Tag Archives: The crawler

Beautiful soup4 gets the value of the class property of the tag

BeautifulSoup4 retrieves the value of the class attribute of the tag
Recently, when I was writing a crawler, I suddenly needed to determine whether the class value of the current tag was a specific value. After searching online, I found that it was not very helpful, so I made a note of it.
Using the GET method

html = BeautifulSoup(request,'lxml')
a = html.find_all('a')
for i in a:
	if (i.get('class') == 'xxx'):
		url = i.get('href')
		return url
return None

The above code has the following functions:

    converts the request to BS4 format and stores it in HTML to find all the A tags in HTML, traverse the A tag, and check whether the class value of the current A tag is equal to ‘XXX’. True will return the URL of the current A tag. If the traverse fails to match, it will return None
The get function does more than that. The arguments in get represent the name of the variable you want to get.

Python error open SSL.SSL.SysCallError :(-1,‘Unexpected EOF‘)

Ssl. syscallError :(-1,’Unexpected EOF’)
Verify = False: verify = False: verify = False: verify = False

Request, when using a proxy to crawl a web page
Verify = False; verify = False
code

URL = "https://docs.google.com/uc?export=download"

session = requests.Session()

response = session.get(URL, params = { 'id' : id }, stream = True )

Verify = False

URL = "https://docs.google.com/uc?export=download"

session = requests.Session()

response = session.get(URL, params = { 'id' : id }, stream = True, verify = False )

Urllib2.httperror: http error 403: forbidden solution

When using Python to crawl a web crawler, it is common to assume that the target site has 403 Forbidden crawlers
Question: why are the 403 Forbidden errors
answer: urllib2.httperror: HTTPError 403: Forbidden errors occur mainly because the target website prohibits the crawler. The request header can be added to the request.
Q: So how do you solve it?
answer: just add a headers

req = urllib.request.Request(url="http://en.wikipedia.org"+pageUrl)
HTML = urlopen(req)

to add a headers to make it become

 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    req = urllib.request.Request(url="http://en.wikipedia.org"+pageUrl, headers=headers)
    # req = urllib.request.Request(url="http://en.wikipedia.org"+pageUrl)
    html = urlopen(req)

Q: How does Headers look?Answer: you can use the Internet in browser developer tools, such as in firefox
Q: Is there any other problem with pretending to be a browser?
answer: yes, for example, the target site will block the query too many times IP address

Reproduced in: https://www.cnblogs.com/fonttian/p/7294845.html

Python – SSL certificate error

One, SSL certificates


the reason for this problem is that SSL certificates are not secure
1. What about making a request in your code
Ex. :

import requests

url = "https://chinasoftinc.com/owa"
response = requests.get(url)

Return certificate error as below

Second, solutions
In order to make a normal request in the code, we modify to add a parameter

import requests

url = "https://12306.cn/mormhweb/"
response = requests.get(url, verify=False)

Python crawler: urllib.error.HTTPError : HTTP Error 404: Not Found

* *
Python crawler: urllib.error.httperror: HTTP error 404: Not Found





See a 404 error, I am also very depressed, and read the code and again, there is no problem, as the small white, the most afraid of this kind of error, but after the attack, full of a sense of accomplishment, nonsense not much said, so is the Internet search the cause of the error, the answer is multifarious, but still find the feeling, that is to see if the address was wrong.
the truth is there’s something wrong with the address. You want to cry.