Tag Archives: python crawler

[Solved] raise ContentTooShortError(urllib.error.ContentTooShortError: <urlopen error retrieval incomplete:

1. Problem description

The following error occurred during crawler batch download

 raise ContentTooShortError(
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 0 out of 290758 bytes>

2. Cause of problem

Problem cause: urlretrieve download is incomplete

3. Solution

1. Solution I

Use the recursive method to solve the incomplete method of urlretrieve to download the file. The code is as follows:

def auto_down(url,filename):
    try:
        urllib.urlretrieve(url,filename)
    except urllib.ContentTooShortError:
        print 'Network conditions is not good.Reloading.'
        auto_down(url,filename)

However, after testing, urllib.ContentTooShortError appears in the downloaded file, and it will take too long to download the file again, and it will often try several times, or even more than a dozen times, and occasionally fall into a dead cycle. This situation is very unsatisfactory.

2. Solution II

Therefore, the socket module is used to shorten the time of each re-download and avoid falling into a dead cycle, so as to improve the operation efficiency
the following is the code:

import socket
import urllib.request
#Set the timeout period to 30s
socket.setdefaulttimeout(30)
#Solve the problem of incomplete download and avoid falling into an endless loop
try:
    urllib.request.urlretrieve(url,image_name)
except socket.timeout:
    count = 1
    while count <= 5:
        try:
            urllib.request.urlretrieve(url,image_name)                                                
            break
        except socket.timeout:
            err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count
            print(err_info)
            count += 1
    if count > 5:
        print("downloading picture fialed!")

raise HTTPError(req.full_url, code, msg, hdrs, fp)urllib.error.HTTPError: HTTP Error 404: Not Found

import requests
url=['www....','www.....',...]
for i in range(0,len(url)):
     linkhtml = requests.get(url[i])

The crawler reported the following error:


  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Refer to an article on stack overflow

Python: urllib.error.HTTPError: HTTP Error 404: Not Found – Stack Overflow

In the crawler scenario, the original link may not open. Naturally, it will prompt HTTP Error 404. What you need to do is skip this link and then crawl to the following page

Correction code

import requests
url=['www....','www.....',...]
for i in range(0,len(url)):
    try:
        linkhtml = requests.get(url[i])
    except:
        pass

Python: crawler handles strings of XML and HTML

When crawling a web page, sometimes the data returned by the web page is XML or HTML fragments, which needs to be processed and analyzed by yourself. After searching the processing methods on the Internet, here is a summary.
First, let’s give a simple “Crawler”:

import urllib2
def get_html(url,response_handler):
    response=urllib2.urlopen(url)
    return response_handler(response.read())

#crawl the page
get_html("www.zhihu.com/topics",html_parser)

The above is to get a single page of the “Crawler”, it is indeed very simple, but also describes the core process of crawler: grab the web page and analyze the web page.
The process of crawling web pages is basically solidified, and our focus is on response_ In the handler, this parameter is the method of analyzing web pages. In some popular crawler frameworks, we mainly need to complete the process of analyzing web pages. The example above uses XML_ The parser function, do not think that there is a default implementation (in fact, there is no), let’s complete this function:

#Use lxml library to analyze web pages, the detailed use of this library can be Google
import lxml.etree
def html_parser(response):
    page=lxml.etree.HTML(response)
    #XPATH is used here to fetch the elements, XPATH usage on your own Google
    for elem in page.xpath('//ul[@class="zm-topic-cat-main"]/li'):
        print elem.xpath('a/text()')[0]

We used it lxml.etree.HTML () load the content of the web page. The returned object can use XPath to retrieve the elements of the web page. It seems that it also supports CSS syntax to retrieve elements. For details, please refer to the relevant documents. The above analysis function completes printing the names of all categories.

For the method of parsing XML strings, it is similar to the above, with the following examples:

import lxml.etree
def xml_parser(response):
    page=lxml.etree.fromstring(response)
    #do what you want
    pass

That’s all for now

Modify the default file location of the Jupiter notebook

Modify the default file location for Jupyter notebook
After installing Anaconda and Jupyter Notebook, open the Jupyter notebook will find that there are some folders displayed, but the exact location is not clear. To facilitate future file editing and saving, you need to modify the default file location of the Jupyter Notebook.
The detailed modification steps verified by practical operation are as follows:
1. Generate configuration files through the Anaconda Prompt command window: find and open the Anaconda Prompt in the start menu, type the following command, and then execute:
— generation-config.
2, open the configuration file generated in the previous step, it is generally in the following position: C:\Users\admin. Jupyter \jupyter \jupyter_notebook_config.py.
this file path is the default path for jupyter.
3, in the start menu to find jupyter notebook, and the previously generated configuration files with jupyter open, there are two ways to open:
(1) can be opened by dragging the configuration file to the notebook window;
(2) click on the Upload in the upper right corner of the notebook window, then find the configuration file and open it using the path described above.
Notebook_dir = ‘”
before the change: # c.notebookapp.notebook_dir =’
before the change: # c.notebookapp.notebook_dir = ‘
after the change: remove the # prefix and enter the default location in the single quotation marks, such as: c.notebookapp.notebook_dir =’ E:/Python_notebook ‘. Note: \ cannot be used in the Python file path, but /.
5, click save under file in the menu above the notebook window to save the changes, and use this saved file to overwrite the originally generated configuration file under C disk (this is very important, be sure to overwrite, otherwise the changes will not take effect), that is, to overwrite the configuration file at the following location:
C:\Users\admin.jupyter\ notebook_config.py.
6. From the Start menu, find Jupyter Notebook, right mouse button & GT; > Property & gt; > Shortcut & GT; > Target, delete the end “%USERPROFILE%/” in the box after the target.
After the above modification, finally finished. After restarting the Jupyter Notebook, the default file location has been changed to the desired location, which is: E:\Python_notebook.

Python crawler: urllib.error.HTTPError : HTTP Error 404: Not Found

* *
Python crawler: urllib.error.httperror: HTTP error 404: Not Found





See a 404 error, I am also very depressed, and read the code and again, there is no problem, as the small white, the most afraid of this kind of error, but after the attack, full of a sense of accomplishment, nonsense not much said, so is the Internet search the cause of the error, the answer is multifarious, but still find the feeling, that is to see if the address was wrong.
the truth is there’s something wrong with the address. You want to cry.