Encountered in a scene_ In the redis project, the URL needs to be filtered and de duplicated, so a de duplication class is customized
Simply copy the source code directly, and then rewrite the request_ See, to change the logic, the original direct assignment will report the error of the title
def request_seen(self, request):
temp_request = request
if "ref" in temp_request.url:
# An error is reported here, you cannot assign a value directly
temp_request.url = temp_request.url.split("ref")[0]
fp = self.request_fingerprint(temp_request)
added = self.server.sadd(self.key, fp)
return added == 0
Solution:
Use _set_URL (URL) is OK
temp_request._set_url(temp_request.url.split("ref")[0])
The complete code is as follows:
from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint
from scrapy_redis import get_redis_from_settings
from scrapy_redis import defaults
import logging
import time
logger = logging.getLogger(__name__)
class UrlFilter(BaseDupeFilter):
logger = logger
def __init__(self, server, key, debug=False):
self.server = server
self.key = key
self.debug = debug
self.logdupes = True
@classmethod
def from_settings(cls, settings):
server = get_redis_from_settings(settings)
# XXX: This creates one-time key. needed to support to use this
# class as standalone dupefilter with scrapy's default scheduler
# if scrapy passes spider on open() method this wouldn't be needed
# TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server, key=key, debug=debug)
@classmethod
def from_crawler(cls, crawler):
"""Returns instance from crawler.
Parameters
----------
crawler : scrapy.crawler.Crawler
Returns
-------
RFPDupeFilter
Instance of RFPDupeFilter.
"""
return cls.from_settings(crawler.settings)
def request_seen(self, request):
temp_request = request
if "ref" in temp_request.url:
temp_request._set_url(temp_request.url.split("ref")[0])
fp = self.request_fingerprint(temp_request)
added = self.server.sadd(self.key, fp)
return added == 0
def request_fingerprint(self, request):
"""Returns a fingerprint for a given request.
Parameters
----------
request : scrapy.http.Request
Returns
-------
str
"""
return request_fingerprint(request)
@classmethod
def from_spider(cls, spider):
settings = spider.settings
server = get_redis_from_settings(settings)
dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)
key = dupefilter_key % {'spider': spider.name}
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server, key=key, debug=debug)
def close(self, reason=''):
"""Delete data on close. Called by Scrapy's scheduler.
Parameters
----------
reason : str, optional
"""
self.clear()
def clear(self):
"""Clears fingerprints data."""
self.server.delete(self.key)
def log(self, request, spider):
"""Logs given request.
Parameters
----------
request : scrapy.http.Request
spider : scrapy.spiders.Spider
"""
if self.debug:
msg = "Filtered duplicate request: %(request)s"
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
elif self.logdupes:
msg = ("Filtered duplicate request %(request)s"
" - no more duplicates will be shown"
" (see DUPEFILTER_DEBUG to show all duplicates)")
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
self.logdupes = False
Read More:
- How to Solve Error: Unverified HTTPS request is being made
- [Solved] Forbidden (403) CSRF verification failed. Request aborted.
- When sending HTTP request, python encountered: error 54, ‘connection reset by peer’ solution
- Gunicorn Flask Error: [ERROR] Socket error processing request
- Using postman Test Django Interface error: RuntimeError:You called this URL via POST,but the URL doesn‘t end in a slash
- To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe.
- [Solved] Pycharm Use pip Error: Script file ‘D:\Anaconda3\envs\pytorch\Scripts\pip-script.py‘ is not present
- [Solved] HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/saved_model
- [Solved] ERROR: URL ‘s3://‘ is supported but requires these missing dependencies: [‘s3fs‘]. To install dvc wi
- [Solved] python Error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.
- python Use timeit Error: stmt is neither a string nor callable
- [Solved] error in REfO setup command: use_2to3 is invalid.
- [Solved] ERROR: Could not build wheels for cryptography which use PEP 517 and cannot be installed directly
- [Solved] error in pycallgraph setup command: use_2to3 is invalid.
- [Solved] Pymysql Use Error: RuntimeError: ‘cryptography‘ package is required for sha256_password
- [Solved] Odrive gui Error: Do not use the development server in a productioun and supported version of the Socket
- Pandas ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.an
- Python Requests Error: Max retries exceeded with url
- Mac Selenium Use Error: Can not connect to the Service chromedriver