Python: crawler handles strings of XML and HTML

When crawling a web page, sometimes the data returned by the web page is XML or HTML fragments, which needs to be processed and analyzed by yourself. After searching the processing methods on the Internet, here is a summary.
First, let’s give a simple “Crawler”:

import urllib2
def get_html(url,response_handler):
    response=urllib2.urlopen(url)
    return response_handler(response.read())

#crawl the page
get_html("www.zhihu.com/topics",html_parser)

The above is to get a single page of the “Crawler”, it is indeed very simple, but also describes the core process of crawler: grab the web page and analyze the web page.
The process of crawling web pages is basically solidified, and our focus is on response_ In the handler, this parameter is the method of analyzing web pages. In some popular crawler frameworks, we mainly need to complete the process of analyzing web pages. The example above uses XML_ The parser function, do not think that there is a default implementation (in fact, there is no), let’s complete this function:

#Use lxml library to analyze web pages, the detailed use of this library can be Google
import lxml.etree
def html_parser(response):
    page=lxml.etree.HTML(response)
    #XPATH is used here to fetch the elements, XPATH usage on your own Google
    for elem in page.xpath('//ul[@class="zm-topic-cat-main"]/li'):
        print elem.xpath('a/text()')[0]

We used it lxml.etree.HTML () load the content of the web page. The returned object can use XPath to retrieve the elements of the web page. It seems that it also supports CSS syntax to retrieve elements. For details, please refer to the relevant documents. The above analysis function completes printing the names of all categories.

For the method of parsing XML strings, it is similar to the above, with the following examples:

import lxml.etree
def xml_parser(response):
    page=lxml.etree.fromstring(response)
    #do what you want
    pass

That’s all for now

Read More: