Java实现网络爬虫：HttpClient抓取https协议页面_htttpclient get 请求抓取网页-CSDN博客

本文详细介绍如何使用Java的HttpClient库进行网络爬虫开发，包括GET和POST请求的发送、连接池的使用以及HTTPS页面的抓取技巧。

Java实现网络爬虫

HttpClient

HttpClient

爬虫介绍

一、什么是爬虫
爬虫是一段程序，抓取互联网上的数据，保存到本地。

抓取过程：

使用程序模拟浏览器
向服务器发送请求。
服务器响应html
把页面中的有用的数据解析出来。
解析页面中的链接地址。
把链接地址添加到url队列中。
爬虫从url队列中取url，返回2的操作。

爬虫的抓取环节

二、爬虫的抓取环节

抓取页面。
可以使用java api中提供的URLConnection类发送请求。
推荐使用工具包HttpClient。是apache旗下的一个开源项目。可以模拟浏览器。
对页面进行解析。
使用Jsoup工具包。
可以像使用jQuery一样解析html。

使用HttpClient发送get请求

步骤：
1）创建一个HttpClient对象，使用CloseableHttpClient，使用HttpClients工具类创建。
2）创建一个HttpGet对象，get对象封装请求的url
3）使用HttpClient执行请求
4）接收服务端响应的内容。
响应的内容包含响应头
包含响应的内容（html）
5）关闭连接

一、引入依赖

<dependencies>
        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!-- 日志 -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>
    </dependencies>

二、使用HttpClient发送get请求

public class HttpClientTest {
    @Test
    public void testGet() throws Exception {
        //1.相当于打开浏览器
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.设置访问路径
        HttpGet get = new HttpGet("http://yun.itheima.com/search?keys=Java");
        //3.发送请求，获取响应
        CloseableHttpResponse response = httpClient.execute(get);
        //4，获取相应的内容
        StatusLine statusLine = response.getStatusLine();
        System.out.println(statusLine);
        //5.获取响应头
        int statusCode = statusLine.getStatusCode();
        System.out.println(statusCode);
        //6.获取html
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity);
        System.out.println(html);
        //7.关闭连接
        response.close();
        httpClient.close();
    }
}

使用HttpClient发送post请求

步骤：
1）创建一个HttpClient对象
2）创建HttpPost对象，封装一个url
3）如果有参数就应该把参数封装到表单中。
4）使用HttpClient执行请求。
5）接收服务端响应html
6）关闭连接

@Test
    public void testPost() throws Exception {
        //1、创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2、封装post对象
        HttpPost post = new HttpPost("http://bbs.itheima.com/search.php");
        //3、封装参数
        List<NameValuePair> form = new ArrayList<>();
        form.add(new BasicNameValuePair("mod","forum"));
        form.add(new BasicNameValuePair("searchid","50"));
        form.add(new BasicNameValuePair("orderby","lastpost"));
        form.add(new BasicNameValuePair("ascdesc","desc"));
        form.add(new BasicNameValuePair("kw","java"));
        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(form);
        post.setEntity(entity);
        //4、发送请求
        CloseableHttpResponse response = httpClient.execute(post);
        //5、接收服务端响应
        HttpEntity resultEntity = response.getEntity();
        String html = EntityUtils.toString(resultEntity);
        System.out.println(html);
        //6、关闭连接
        response.close();
        httpClient.close();
    }

HttpClient连接池

步骤：
1）创建一个连接池对象。在系统中应是单例的。
2）使用HttpClients工具类，设置使用的连接池对象。基于连接池创建HttpClient对象。
3）使用HttpClient发送请求。
4）接收服务端响应的数据。
5）关闭Response对象，HttpClient对象不需要关闭。

HttpClient抓取https协议页面

发现，可以抓取如京东首页，但是无法抓取商品数据

首先，添加HttpsUtils工具类

import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.ssl.SSLContextBuilder;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

public class HttpsUtils {
    private static final String HTTP = "http";
    private static final String HTTPS = "https";
    private static SSLConnectionSocketFactory sslsf = null;
    private static PoolingHttpClientConnectionManager cm = null;
    private static SSLContextBuilder builder = null;
    static {
        try {
            builder = new SSLContextBuilder();
            // 全部信任 不做身份鉴定
            builder.loadTrustMaterial(null, new TrustStrategy() {
                @Override
                public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                    return true;
                }
            });
            sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
            Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register(HTTP, new PlainConnectionSocketFactory())
                    .register(HTTPS, sslsf)
                    .build();
            cm = new PoolingHttpClientConnectionManager(registry);
            cm.setMaxTotal(200);//max connection
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static CloseableHttpClient getHttpClient() throws Exception {
        CloseableHttpClient httpClient = HttpClients.custom()
                .setSSLSocketFactory(sslsf)
                .setConnectionManager(cm)
                .setConnectionManagerShared(true)
                .build();
        return httpClient;
    }

}

使用工具类就可以爬取https页面的数据了，主要还要添加user-agent的请求头。

@Test
    public void testHttps() throws Exception {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
        //创建get对象
        HttpGet httpGet = new HttpGet("https://search.jd.com/Search?keyword=%E7%94%B5%E8%84%91&enc=utf-8&pvid=b1deb5e2163141b8bebbb6c0505a4fca");
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36");
        //执行请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        //接收结果
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity,"utf-8");
        //打印结果
        System.out.println(html);
        //关闭连接
        response.close();
    }