Tag Archives: reptile

[Solved] AttributeError: ‘WebDriver‘ object has no attribute ‘find_element_by_id‘

1. Question:

When learning the selenium part of the crawler, AttributeError appears: ‘WebDriver’ object has no attribute ‘find_ element_ by_ Id ‘problem.

2. Reasons:

Because of version iteration, find_element_by_id method is not supported in the new version of selenium.

3. Solution:

Modify button = browser.find_element_by_id(‘su’) to the following codes:

button = browser.find_element(By.ID,'su')

Add the following code at the front of the code page,

from selenium.webdriver.common.by import By

[Solved] Python Selenium Error: AttributeError: ‘WebDriver‘ object has no attribute ‘find_element_by_xpath‘

Python selenium Error:

el = driver.find_element_by_xpath('//*[@id="changeCityBox"]/ul/li[2]/a')
driver.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)

Solution: Modify the codes above to:

from selenium.webdriver.common.by import By
el = driver.find_element(By.XPATH,r'//*[@id="changeCityBox"]/ul/li[2]/a')
driver.find_element(By.XPATH,r'//*[@id="search_input"]').send_keys("python",Keys.ENTER)

War here to use the browser driver for Google, other browsers can also be Edge to edge, modify the driver needs to configure the environment

from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys   # Button commands for the keyboard
from selenium.webdriver.common.by import By
driver = Chrome()
driver.get("https://www.lagou.com/")
# Find the element copyxpath that the browser needs to operate on
el = driver.find_element(By.XPATH,r'//*[@id="changeCityBox"]/ul/li[2]/a')
# el = driver.find_element_by_xpath('//*[@id="changeCityBox"]/ul/li[2]/a')
el.click() # Click event
# find the input box F12 element copyxpath, enter python content, enter or search button xpath
driver.find_element(By.XPATH,r'//*[@id="search_input"]').send_keys("python",Keys.ENTER)
# driver.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)

[Solved] Saving to Database Error: pymysql.err.DataError: (1366, “Incorrect string value…

DataError : (1366, “Valeur incorrecte : ‘\xE5\\xA4\xAA\xE7\\xA9\xBA…'”) for column ‘title’ at row 1″), vérifiez l’état de l’exécution en arrière-plan, les données ont été extraites, déterminez donc le problème de ce type de rapport d’erreur, ce type de problème est généralement un problème de codage.

Returning to the database creation command interface, I found that utf8 was not added during creation. Returning to the database creation interface and adding charset = utf8 again, there is no problem

[Solved] urllib.error.URLError: <urlopen error [SSL: WRONG_VERSION_NUMBER] wrong version number

There are totally four methods to solve this error

 

Solution:
Method 1: the SSL certificate problem
you can open the URL with the following code

import ssl
 
# This restores the same behavior as before.
context = ssl._create_unverified_context()
response = urllib.request.urlopen("https://no-valid-cert", context=context)

https://no-valid-cert you can change it to the website you want

Method 2: Change https to http, because some versions of python verify the SSL certificate once when you urllib.urlopen an https.

Method 3: Add the following codes:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

I just add it to the py file when calling commands from the urllib library that comes with python 3.8.0

model = ALBEF(config=config, text_encoder=args.text_encoder, tokenizer=tokenizer, init_deit=True)

That is, when the model is initialized (init_deit), the error occurs when calling return self.sslsocket_class._create in lib/python3.8/http/client.py under python 3.8, but at the beginning of the py file for initializing the model Add these two lines and you’ll be fine

Method 4:
Upgrade your python interpreter version, e.g. 2.7 or 3.7 to 3.8 or even 3.9

[Solved] transformers Install Error: error can‘t find rust compiler

Install transformers after reinstalling the system. If you encounter a bug, record it and check it later.

When reinstalling with pip install transformers command under windows, an error is reported:

error: can't find Rust compiler
      
    If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
      
    To update pip, run:
      
        pip install --upgrade pip
      
    and then retry package installation.
      
    If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
    [end of output]
  
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

According to the error prompt, first run pip install -- upgrade pip is invalid, and then install Rust Compiler according to the error prompt. First go to the official website to download the corresponding installation package, select the 64-bit installation file according to my actual situation, then click the downloaded exe file to install, and select the default configuration during the installation process.

According to the instructions on the official website, all tools of rust are in the  ~/.cargo/bin directory includes commands:  rustc, cargo and rustup . Therefore, it needs to be configured into the environment variable, but windows will configure it automatically, but the configured environment variable will take effect only after restarting the computer under windows. After restarting, run the installation command again:

pip install transformers

The result is a successful installation. The screenshot is as follows:

[Solved] raise ContentTooShortError(urllib.error.ContentTooShortError: <urlopen error retrieval incomplete:

1. Problem description

The following error occurred during crawler batch download

 raise ContentTooShortError(
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 0 out of 290758 bytes>

2. Cause of problem

Problem cause: urlretrieve download is incomplete

3. Solution

1. Solution I

Use the recursive method to solve the incomplete method of urlretrieve to download the file. The code is as follows:

def auto_down(url,filename):
    try:
        urllib.urlretrieve(url,filename)
    except urllib.ContentTooShortError:
        print 'Network conditions is not good.Reloading.'
        auto_down(url,filename)

However, after testing, urllib.ContentTooShortError appears in the downloaded file, and it will take too long to download the file again, and it will often try several times, or even more than a dozen times, and occasionally fall into a dead cycle. This situation is very unsatisfactory.

2. Solution II

Therefore, the socket module is used to shorten the time of each re-download and avoid falling into a dead cycle, so as to improve the operation efficiency
the following is the code:

import socket
import urllib.request
#Set the timeout period to 30s
socket.setdefaulttimeout(30)
#Solve the problem of incomplete download and avoid falling into an endless loop
try:
    urllib.request.urlretrieve(url,image_name)
except socket.timeout:
    count = 1
    while count <= 5:
        try:
            urllib.request.urlretrieve(url,image_name)                                                
            break
        except socket.timeout:
            err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count
            print(err_info)
            count += 1
    if count > 5:
        print("downloading picture fialed!")

raise HTTPError(req.full_url, code, msg, hdrs, fp)urllib.error.HTTPError: HTTP Error 404: Not Found

import requests
url=['www....','www.....',...]
for i in range(0,len(url)):
     linkhtml = requests.get(url[i])

The crawler reported the following error:


  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Users\lenovo7\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Refer to an article on stack overflow

Python: urllib.error.HTTPError: HTTP Error 404: Not Found – Stack Overflow

In the crawler scenario, the original link may not open. Naturally, it will prompt HTTP Error 404. What you need to do is skip this link and then crawl to the following page

Correction code

import requests
url=['www....','www.....',...]
for i in range(0,len(url)):
    try:
        linkhtml = requests.get(url[i])
    except:
        pass

Scrapy runs a crawler with an error importerror: cannot import name suppress

2021-11-02 15:56:03 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/crawler.py", line 104, in crawl
    six.reraise(*exc_info)
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/core/engine.py", line 67, in __init__
    self.scheduler_cls = load_object(self.settings['SCHEDULER'])
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/utils/misc.py", line 46, in load_object
    mod = import_module(module)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/scrapy/core/scheduler.py", line 7, in <module>
    from queuelib import PriorityQueue
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/queuelib/__init__.py", line 1, in <module>
    from queuelib.queue import FifoDiskQueue, LifoDiskQueue
  File "/Users/tanya/Library/Python/2.7/lib/python/site-packages/queuelib/queue.py", line 7, in <module>
    from contextlib import suppress
ImportError: cannot import name suppress

resolvent:

pip uninstall attrs
pip uninstall queuelib
pip install queuelib==1.5.0
pip install attrs

[Solved] unknown error: DevToolsActivePort file doesn‘t exis

Scene

When using selenium, an error is reported one day, as shown in the title, and the code is as follows:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

option = Options()
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-shm-usage')

browser = webdriver.Chrome('./chromedriver',chrome_options=option)

browser.get('http://www.baidu.com/')
print(browser.title)
browser.quit()  

Solution:

First, ensure that the versions of Google Chrome and chrome River are consistent. If you are sure to be consistent, skip the following three steps.

    1. View Google Chrome version command:
google-chrome --version

I’m not sure about the command to view the chrome River version, but if you find the Google Chrome version, you can go to the following website to download the corresponding chrome river. If the versions are inconsistent, you need to download the consistent version from the download location

    1. google-chrome: http://dist.control.lth.se/public/CentOS-7/x86_64/google.x86_64/chromedriver: https://sites.google.com/a/chromium.org/chromedriver/downloads

Then the solution is to add an option

option.add_argument("--remote-debugging-port=9222")  # this

[Solved] Request.url is not modifiable, use Request.replace() instead

Encountered in a scene_ In the redis project, the URL needs to be filtered and de duplicated, so a de duplication class is customized

Simply copy the source code directly, and then rewrite the request_ See, to change the logic, the original direct assignment will report the error of the title

    def request_seen(self, request):
        temp_request = request
        if "ref" in temp_request.url:
            # An error is reported here, you cannot assign a value directly
            temp_request.url = temp_request.url.split("ref")[0]
        fp = self.request_fingerprint(temp_request)
        added = self.server.sadd(self.key, fp)
        return added == 0

Solution:

Use _set_URL (URL) is OK

temp_request._set_url(temp_request.url.split("ref")[0])

The complete code is as follows:

from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint
from scrapy_redis import get_redis_from_settings
from scrapy_redis import defaults
import logging
import time

logger = logging.getLogger(__name__)


class UrlFilter(BaseDupeFilter):
    logger = logger

    def __init__(self, server, key, debug=False):
        self.server = server
        self.key = key
        self.debug = debug
        self.logdupes = True

    @classmethod
    def from_settings(cls, settings):
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy's default scheduler
        # if scrapy passes spider on open() method this wouldn't be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

    @classmethod
    def from_crawler(cls, crawler):
        """Returns instance from crawler.

        Parameters
        ----------
        crawler : scrapy.crawler.Crawler

        Returns
        -------
        RFPDupeFilter
            Instance of RFPDupeFilter.

        """
        return cls.from_settings(crawler.settings)

    def request_seen(self, request):
        temp_request = request
        if "ref" in temp_request.url:
            temp_request._set_url(temp_request.url.split("ref")[0])
        fp = self.request_fingerprint(temp_request)
        added = self.server.sadd(self.key, fp)
        return added == 0

    def request_fingerprint(self, request):
        """Returns a fingerprint for a given request.

        Parameters
        ----------
        request : scrapy.http.Request

        Returns
        -------
        str

        """
        return request_fingerprint(request)

    @classmethod
    def from_spider(cls, spider):
        settings = spider.settings
        server = get_redis_from_settings(settings)
        dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)
        key = dupefilter_key % {'spider': spider.name}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

    def close(self, reason=''):
        """Delete data on close. Called by Scrapy's scheduler.

        Parameters
        ----------
        reason : str, optional

        """
        self.clear()

    def clear(self):
        """Clears fingerprints data."""
        self.server.delete(self.key)

    def log(self, request, spider):
        """Logs given request.

        Parameters
        ----------
        request : scrapy.http.Request
        spider : scrapy.spiders.Spider

        """
        if self.debug:
            msg = "Filtered duplicate request: %(request)s"
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
        elif self.logdupes:
            msg = ("Filtered duplicate request %(request)s"
                   " - no more duplicates will be shown"
                   " (see DUPEFILTER_DEBUG to show all duplicates)")
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
            self.logdupes = False

[Python] Right-click Selenium to Save the picture error: attributeerror: solution to module ‘pyscreen’ has no attribute ‘locationonwindow’

When crawling pictures with selenium module, simulate right clicking and select “save picture as” to save the picture (pyautogui module is not used)

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from time import sleep

url = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fcdn.duitang.com%2Fuploads%2Fitem%2F201202%2F18%2F20120218194349_ZHW5V.thumb.700_0.jpg&refer=http%3A%2F%2Fcdn.duitang.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1633348528&t=794e58b55dfe37e5f3082568feb89783'

driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element_by_xpath('/html/body/img[1]')  

action = ActionChains(driver)  
action.move_to_element(element)  
action.context_click(element)  
action.send_keys(Keys.ARROW_DOWN)  
action.send_keys('v')  

action.perform()  
sleep(2)

driver.quit()  

When you execute the above code, you will find that you do not click the keyboard after right clicking, because the page is not controlled by selenium after right clicking. At this point, we need to use the pyautogui module, which is a module for automatically controlling the mouse and keyboard
you need to PIP download the wheel first

pip install pyautogui

Modified code:

from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
from time import sleep
import pyautogui

url = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fcdn.duitang.com%2Fuploads%2Fitem%2F201202%2F18%2F20120218194349_ZHW5V.thumb.700_0.jpg&refer=http%3A%2F%2Fcdn.duitang.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1633348528&t=794e58b55dfe37e5f3082568feb89783'

driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element_by_xpath('/html/body/img[1]') 

action = ActionChains(driver) 
action.move_to_element(element) 
action.context_click(element) 
action.perform() 

pyautogui.typewrite(['v']) 
sleep(1) 
pyautogui.typewrite(['1', '.', 'j', 'p', 'g'])
pyautogui.typewrite(['enter'])

sleep(1)

driver.quit() 

The red explosion will be found after execution:

attributeerror: module ‘pyscreen’ has no attribute ‘locateonwindow’

It is actually very simple to solve this problem
we directly download the highest version 0.9.53 after PIP install pyautogui
we only need to download version 0.9.50, so this problem will not exist

pip install pyautogui==0.9.50

Just execute the code again~

Python asynchronous co process crawler error: [aiohttp. Client]_ Exceptions: serverdisconnected error: Server Disconnected]

Background description:

I just started to contact crawlers, read online tutorials and began to learn a little bit. All the knowledge points are relatively shallow. If there are better methods, please comment and share.

The initial crawler is very simple: crawl the data list in a web page, and the format returned by the web page is also very simple. It is in the form of a dictionary, which can be directly accessed by saving it as a dictionary with . Json() .

At the beginning of contact with asynchronous collaborative process, after completing the exercise, try to transform the original crawler, resulting in an error.

Initial code:

async def download_page(url):
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			await result = resp.text()

async def main(urls):
	tasks = []	
	for url in urls:
		tasks.append(asyncio.create_task(download_page(url)))  # 我的python版本为3.9.6
	await asyncio.await(tasks)

if __name__ == '__main__':
	urls = [ url1, url2, …… ]
	asyncio.run(main(urls))

This is the most basic asynchronous collaborative process framework. When the amount of data is small, it can basically meet the requirements. However, if the amount of data is slightly large, it will report errors. The error information I collected is as follows:

aiohttp.client_exceptions.ClientOSError: [WinError 64] The specified network name is no longer available.
Task exception was never retrieved

aiohttp.client_exceptions.ClientOSError: [WinError 121] The signal timeout has expired
Task exception was never retrieved

aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
Task exception was never retrieved

The general error message is that there is a problem with the network request and connection
as a beginner, I didn’t have more in-depth understanding of network connection, nor did I learn more in-depth asynchronous co process operation. Therefore, it is difficult to solve the problem of error reporting.

Solution:

The big problem with the above error reports is that each task creates a session. When too many sessions are created, an error will be reported.

Solution:

Try to create only one session

async def download_page(url,session):
	async with session.get(url) as resp:
		await result = resp.text()

async def main(urls):
	tasks = []	
	async with aiohttp.ClientSession() as session:  # The session will be created in the main function, and the session will be passed to the download_page function as a variable.
		for url in urls:
			tasks.append(asyncio.create_task(download_page(url,session)))
			#My python version is 3.9.6, python version 3.8 and above, if you need to create asynchronous tasks, you need to create them via asyncio.create_task(), otherwise it will run normally but will give a warning message await asyncio.await(tasks)

if __name__ == '__main__':
	urls = [ url1, url2, …… ]
	asyncio.run(main(urls))

In this way, the problem of connection error can be solved to a certain extent when crawling a large amount of data.

Sentiment:  in the process of programming, thinking should be more flexible. A little change may improve the efficiency a lot.