When crawling a web page, sometimes the data returned by the web page is XML or HTML fragments, which needs to be processed and analyzed by yourself. After searching the processing methods on the Internet, here is a summary.
First, let’s give a simple “Crawler”:
import urllib2
def get_html(url,response_handler):
response=urllib2.urlopen(url)
return response_handler(response.read())
#crawl the page
get_html("www.zhihu.com/topics",html_parser)
The above is to get a single page of the “Crawler”, it is indeed very simple, but also describes the core process of crawler: grab the web page and analyze the web page.
The process of crawling web pages is basically solidified, and our focus is on response_ In the handler, this parameter is the method of analyzing web pages. In some popular crawler frameworks, we mainly need to complete the process of analyzing web pages. The example above uses XML_ The parser function, do not think that there is a default implementation (in fact, there is no), let’s complete this function:
#Use lxml library to analyze web pages, the detailed use of this library can be Google
import lxml.etree
def html_parser(response):
page=lxml.etree.HTML(response)
#XPATH is used here to fetch the elements, XPATH usage on your own Google
for elem in page.xpath('//ul[@class="zm-topic-cat-main"]/li'):
print elem.xpath('a/text()')[0]
We used it lxml.etree.HTML () load the content of the web page. The returned object can use XPath to retrieve the elements of the web page. It seems that it also supports CSS syntax to retrieve elements. For details, please refer to the relevant documents. The above analysis function completes printing the names of all categories.
For the method of parsing XML strings, it is similar to the above, with the following examples:
import lxml.etree
def xml_parser(response):
page=lxml.etree.fromstring(response)
#do what you want
pass
That’s all for now
Read More:
- Simple Python crawler exercise: News crawling on sohu.com
- How to Split numbers and strings in a strings
- Python crawler: urllib.error.HTTPError : HTTP Error 404: Not Found
- [HTML] Python extracts HTML text to TXT
- Python crawler urllib.error.HTTPError : HTTP Error 418:
- Vscode HTML file auto supplement HTML skeleton failure
- [solution] Python flash database migrate error type error: option values must be strings
- Error in pyinstall package Python program: jinja2.exceptions.templatenotfound: Chart_ Solution to component.html
- python reads csv file is an error _csv.Error: iterator should return strings, not bytes (did you open the file in text)
- The MySQL load data command parses and handles error 29 (errCode: 13) errors (in the Ubuntu environment)
- Scrapy runs a crawler with an error importerror: cannot import name suppress
- HTML echo error, NPM run build error, HTML path error
- Error in reading XML file IOException parsing XML document from URL
- Has HTML webpack plugin been installed, or error: cannot find module ‘HTML webpack plugin’
- Error parsing Mapper XML. The XML location is ‘file
- R Error: XML content does not seem to be XML: ‘input.xml’
- org.apache.ibatis.builder.BuilderException: Error parsing Mapper XML. The XML location is ‘com/hujin
- The conversion between [Python] bytes and hex strings.
- 43. Implementation of the shortest code of multiply strings | Java