Web Crawler: How to get the data in the web page and disguise the header, disguise as a browser to visit many times, avoid a single visit leading to IP blocked

User agent: user agent. It is a kind of identification that provides information such as browser type, operating system and version, CPU type, browser rendering engine, browser language, browser plug-in, etc. The UA string is sent to the server every time the browser makes an HTTP request

Referer: http referer is a part of the header. When a browser sends a request to a web server, it usually brings a referer to tell the server which page I’m linking from, so that the server can get some information for processing

	public static String getHtmls(String url) throws IOException {
		RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.IGNORE_COOKIES).build();
		String html = "";
		CloseableHttpClient httpClient = HttpClients.custom().setDefaultRequestConfig(globalConfig).build();
		HttpGet httpget = new HttpGet(url);
		//Browser identifier (OS identifier; encryption level identifier; browser language) Rendering engine identifier Version information
		httpget.setHeader("User-Agent","Mozilla/5.0 (Linux; U; Android 2.3.6; zh-cn; GT-S5660 Build/GINGERBREAD) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 MicroMessenger/4.5.255");
	    // Camouflage head
		httpget.setHeader("Referer", "https://mp.weixin.qq.com");
		
		try {
			HttpResponse responce = httpClient.execute(httpget);//
			int resStatu = responce.getStatusLine().getStatusCode();
			if (resStatu == HttpStatus.SC_OK) {

				HttpEntity entity = responce.getEntity();
				if (entity != null) {
					html = EntityUtils.toString(entity);// Get html source code
				}
			}
		} catch (Exception e) {
			System.out.println("request " + url + " error!");
			e.printStackTrace();
		} finally {
			// close
			httpClient.close();
		}
		return html;
	}

Read More: