Background description:
I just started to contact crawlers, read online tutorials and began to learn a little bit. All the knowledge points are relatively shallow. If there are better methods, please comment and share.
The initial crawler is very simple: crawl the data list in a web page, and the format returned by the web page is also very simple. It is in the form of a dictionary, which can be directly accessed by saving it as a dictionary with . Json()
.
At the beginning of contact with asynchronous collaborative process, after completing the exercise, try to transform the original crawler, resulting in an error.
Initial code:
async def download_page(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
await result = resp.text()
async def main(urls):
tasks = []
for url in urls:
tasks.append(asyncio.create_task(download_page(url))) # 我的python版本为3.9.6
await asyncio.await(tasks)
if __name__ == '__main__':
urls = [ url1, url2, …… ]
asyncio.run(main(urls))
This is the most basic asynchronous collaborative process framework. When the amount of data is small, it can basically meet the requirements. However, if the amount of data is slightly large, it will report errors. The error information I collected is as follows:
aiohttp.client_exceptions.ClientOSError: [WinError 64] The specified network name is no longer available.
Task exception was never retrieved
aiohttp.client_exceptions.ClientOSError: [WinError 121] The signal timeout has expired
Task exception was never retrieved
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
Task exception was never retrieved
The general error message is that there is a problem with the network request and connection
as a beginner, I didn’t have more in-depth understanding of network connection, nor did I learn more in-depth asynchronous co process operation. Therefore, it is difficult to solve the problem of error reporting.
Solution:
The big problem with the above error reports is that each task creates a session. When too many sessions are created, an error will be reported.
Solution:
Try to create only one session
async def download_page(url,session):
async with session.get(url) as resp:
await result = resp.text()
async def main(urls):
tasks = []
async with aiohttp.ClientSession() as session: # The session will be created in the main function, and the session will be passed to the download_page function as a variable.
for url in urls:
tasks.append(asyncio.create_task(download_page(url,session)))
#My python version is 3.9.6, python version 3.8 and above, if you need to create asynchronous tasks, you need to create them via asyncio.create_task(), otherwise it will run normally but will give a warning message await asyncio.await(tasks)
if __name__ == '__main__':
urls = [ url1, url2, …… ]
asyncio.run(main(urls))
In this way, the problem of connection error can be solved to a certain extent when crawling a large amount of data.
Sentiment: in the process of programming, thinking should be more flexible. A little change may improve the efficiency a lot.
Read More:
- Python asynchronous execution library asyncio
- Python Error aiohttp.client_exceptions.ClientConnectorCertificateError, Cannot connect to host:443
- [Solved] socketio.exceptions.ConnectionError: OPEN packet not returned by server
- [Solved] Python Error: An attempt has been made to start a new process before the current process has finished …
- [Solved] HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/saved_model
- [Solved] Python Project Error: django.core.exceptions.ImproperlyConfigured: WSGI application ‘WebTool.wsgi.application
- Python3 Fatal error in launcher: Unable to create process using ‘”‘
- Crawler overtime error socket.timeout: timed out/NameError: name ‘socket‘ is not defined
- Python PIP Fatal error in launcher: Unable to create process using ‘“e:\program files\programdata
- A summary of a demo development process for Python using the QT5 development interface
- Python Error: Process finished with exit code -1073740791 (0xC0000409)
- urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘localhost‘, port=8097): Max retries excee
- [Solved] Selenium.common.exceptions.WebDriverException: Message: newSession
- [Solved] selenium.common.exceptions.WebDriverException: Message: unknown error: DevToolsActivePort file doesn
- Python server run code ModuleNotFoundError Error [How to Solve]
- [Solved] celery Startup Error: kombu.exceptions.VersionMismatch: Redis transport requires redis-py versions 3.2.0 or later. You have 2.10.6
- How to Solve Python WARNING: Ignoring invalid distribution -ip (e:\python\python_dowmload\lib\site-packages)
- MySQLdb._exceptions.ProgrammingError appears when scrapy crawls Weibo hot comments
- [Solved] django.core.exceptions.ImproperlyConfigured: Error loading MySQLdb module.Did you install mysqlclie
- Invalid python sd, Fatal Python error: init_fs_encoding: failed to get the Python cod [How to Solve]