The urllib.request-urlopen () method is often used to open the source code of a web page and then analyze the source code of the page, but it will throw an “HTTP Error 403: Forbidden” exception for some websites
For example, when the following statement is executed,
urllib.request.urlopen("http://blog.csdn.net/eric_sunah/article/details/11099295")
will appear the following exception:
File "D:\Python32\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "D:\Python32\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "D:\Python32\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "D:\Python32\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "D:\Python32\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Analysis:
Appear abnormal, because if use urllib request. Open a URL urlopen way, the server will only receive a simple access request for the page, but the server does not know to send the request to use the browser, operating system, hardware platform, such as information, and often lack the information request are normal access, such as the crawler.
In order to prevent this abnormal access, some websites will verify the UserAgent in the request information (its information includes hardware platform, system software, application software, and user preferences). If the UserAgent is abnormal or does not exist, then the request will be rejected (as shown in the error message above).
So you can try to add the UserAgent’s information
to the request
Solution:
For Python 3.x, adding the UserAgent information to the request is simple as follows
#HTTPError: HTTP Error 403: Forbidden error appears if the following line is not added
#The main reason is that the site is forbidden to crawl, you can add header information to the request, pretending to be a browser to access the User-Agent, specific information can be found through the Firefox FireBug plugin.
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=chaper_url, headers=headers)
urllib.request.urlopen(req).read()
The urllib. Request. Urlopen. Read () to replace the above code, for problems page can normal visit