I just started to contact crawlers, read online tutorials and began to learn a little bit. All the knowledge points are relatively shallow. If there are better methods, please comment and share.
The initial crawler is very simple: crawl the data list in a web page, and the format returned by the web page is also very simple. It is in the form of a dictionary, which can be directly accessed by saving it as a dictionary with
. Json() .
At the beginning of contact with asynchronous collaborative process, after completing the exercise, try to transform the original crawler, resulting in an error.
async def download_page(url): async with aiohttp.ClientSession() as session: async with session.get(url) as resp: await result = resp.text() async def main(urls): tasks =  for url in urls: tasks.append(asyncio.create_task(download_page(url))) # 我的python版本为3.9.6 await asyncio.await(tasks) if __name__ == '__main__': urls = [ url1, url2, …… ] asyncio.run(main(urls))
This is the most basic asynchronous collaborative process framework. When the amount of data is small, it can basically meet the requirements. However, if the amount of data is slightly large, it will report errors. The error information I collected is as follows:
aiohttp.client_exceptions.ClientOSError: [WinError 64] The specified network name is no longer available. Task exception was never retrieved aiohttp.client_exceptions.ClientOSError: [WinError 121] The signal timeout has expired Task exception was never retrieved aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected Task exception was never retrieved
The general error message is that there is a problem with the network request and connection
as a beginner, I didn’t have more in-depth understanding of network connection, nor did I learn more in-depth asynchronous co process operation. Therefore, it is difficult to solve the problem of error reporting.
The big problem with the above error reports is that each task creates a session. When too many sessions are created, an error will be reported.
Try to create only one session
async def download_page(url，session): async with session.get(url) as resp: await result = resp.text() async def main(urls): tasks =  async with aiohttp.ClientSession() as session: # The session will be created in the main function, and the session will be passed to the download_page function as a variable. for url in urls: tasks.append(asyncio.create_task(download_page(url,session))) #My python version is 3.9.6, python version 3.8 and above, if you need to create asynchronous tasks, you need to create them via asyncio.create_task(), otherwise it will run normally but will give a warning message await asyncio.await(tasks) if __name__ == '__main__': urls = [ url1, url2, …… ] asyncio.run(main(urls))
In this way, the problem of connection error can be solved to a certain extent when crawling a large amount of data.
Sentiment: in the process of programming, thinking should be more flexible. A little change may improve the efficiency a lot.
- Python asynchronous execution library asyncio
- Python Error aiohttp.client_exceptions.ClientConnectorCertificateError, Cannot connect to host:443
- ‘coroutine‘ object is not iterable [How to Solve]
- [Solved] ValueError: check_hostname requires server_hostname
- Parallel processing in Python (Pool.map(), Pool.starmap(), Pool.apply ())
- [Solved] Can’t reconnect until invalid transaction is rolled back
- Failed to establish a new connection: [winerror 10048] in the requests thread pool, the interface is called circularly to get the datagram error
- Collection and solution of error reporting on various platforms of pyspider
- Python PIP TypeError: expected str, bytes or os.PathLike object, not int
- pytest pluggy.manager.PluginValidationError: unknown hook’pytest_namespace’ error handling method
- Python 3 uses the relative path import module
- Python PIP Fatal error in launcher: Unable to create process using ‘“e:\program files\programdata
- python3 ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) Error
- [Solved] Error “incorrect padding” in decoding of Base64 module in Python
- npm install Error: stack Error: Can’t find Python executable “python”
- Python: How to parses HTML, extracts data, and generates word documents
- Gunicorn Flask Error: [ERROR] Socket error processing request
- How to Solve PyInstaller Package Error: ModuleNotFoundError: No module named ‘xxxx‘
- Python3 Fatal error in launcher: Unable to create process using ‘”‘
- [Solved] Selenium.common.exceptions.WebDriverException: Message: newSession