Attach a section of crawling to take anjuke second-hand housing information code
Re
import time
import pymongo
import requests
from bson import ObjectId
from LXML import etree
from pprint import pprint
headers = {
“user-agent “: “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36”,
“cookie”: “aQQ_ajkguid= 243e5d558-8b13-d7bd-4922-3de583e03855; ctid=11; _ga = GA1.2.1030980732.1530799904; _gid = GA1.2.506397644.1530799904; 58tj_uuid=c606f59a-2fb9-4c91-9815-741fdf9cfe5d; als=0; lps=http%3A%2F%2Fwww.anjuke.com%2F%3Fpi%3DPZ-baidu-pc-all-biaoti%7Chttps%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3Dutf-8%26f%3D8%26rsv_bp%3D0%26rsv_idx%3D1%26tn%3Dbaidu%26wd%3D%25E5%25AE%2589%25E5%25B1%2585%25 E5%25AE%25A2%26rsv_pq%3Dd71198bd000395ca%26rsv_t%3D6172VDlcx2zzRQ%252FLyCdcEidtafr%252BSvVyVXrlZ0lsK3U1MEz8066IF4byz4c%26rqlang%3Dcn%26rsv_enter%3D1%26rsv_sug3%3D5%26rsv_sug1%3D5%26rsv_sug7%3D101; twe=2; sessid=3497C1D2-43A8-6143-B2D7-CFDA33FF0C0E; new_uv=2; __xsptplus8 7 c = 8.2.1530840314.1530840335.2%232%7Cwww.baidu.com % % 7 c % 7 c % 25 e5%25 ae b1%2585% % 2589% % 25 e5 25 25 e5%25 ae % 25 a2%23 z7v3xnqldcxthemliqlxqslhvxrh8k_r 7 c % 23% % 23 “,
“referer” : “https://shanghai.anjuke.com/?PI = pz-bid-pc-all-biaoti “
}
# connect to database
client = pymongo.MongoClient(‘127.0.0.1’, 27017)
# define database name
db = client.anjuke
# define table name
coll = db.ershoufang
def get_info():
count = 0
for I in range(23):
response = Requests. The get (‘ # https://shanghai.anjuke.com/sale/p {}/filtersort ‘. The format (I), headers = headers)
item = response. The text
# print (item)
# use etree. HTML, The string into an HTML document.
HTML = etree HTML (item)
HTMLS = HTML. The xpath (‘// * [@ id = “houselist – mod – new”]/li ‘)
# print (HTMLS)
house = {}
for h in HTMLS:
h_addr = h.x path (‘/div [2]/div [1]/a/text () ‘) [0]. The strip ()
h_type = h.x path (‘/div [2]/div [2]/span [1]/text () ‘) [0]. The strip ()
h_area = H.x path (‘/div [2]/div [2]/span [2]/text () ‘) [0]. The strip ()
h_hight = h.x path (‘/div [2]/div [2]/span [3]/text () ‘) [0]. The strip ()
h_name = H.x path (‘/div [2]/div [2]/span [4]/text () ‘) [0]. The strip ()
try:
h_youshi1 = h.x path (‘/div [2]/div [4]/span [1]/text () ‘) [0]. The strip ()
the except:
H_youshi1 = None
try:
h_youshi2 = h.x path (‘/div [2]/div [4]/span [2]/text () ‘) [0]. The strip ()
the except:
h_youshi2 = None
try:
H_youshi3 = h.x path (‘/div [2]/div [4]/span [3]/text () ‘) [0]. The strip ()
the except:
h_youshi3 = None
h_price = H.x path (‘/div [3]/span [1]/strong/text () ‘) [0]. The strip ()
house [‘ h_addr] = h_addr
house [‘ h_type] = h_type
house [‘ h_area] = h_area
House [‘ h_hight] = h_hight
house [‘ h_name] = h_name
house [‘ h_youshi1] = h_youshi1
house [‘ h_youshi2] = h_youshi2
house [‘ h_youshi3] = h_youshi3
House [‘ h_price] = h_price
# pprint (house)
time. Sleep (0.01)
# coll. Insert (house)
save (house)
Count + = 1
print (count)
def save (house) :
coll. Insert (house)
def main () :
get_info ()
if __name__ = = “__main__ ‘:
The main ()
This code can only run two pieces of data,
has two data sets, one with ‘_id’ and one without
There are two solutions:
1. Add a ‘_ID’ into the program. Set the _ID field by yourself instead of the system assignment:
Program no problem:
Two: put house={}, the dictionary inside the for loop:
Either of these approaches will solve the problem, and my personal suggestion is that the second approach, code specification, lets the system assign its own ID
Read More:
- Pymong adds a unique index pymongo. Errors. Duplicatekeyerror
- E11000 duplicate key error collection in mongodb
- Mongoose Error: e11000 duplicate key error collection, code: 11000
- Error in Python connection to mongodb pymongo.errors.OperationFailure : Authentication failed.
- Duplicate entry ‘787192513’ for key ‘primary’
- Solution of duplicate entry ‘value’ for key ‘field name’ in MySQL
- Txt import MySQL: error 1062 (23000): duplicate entry ‘0’ for key ‘primary’
- Appium error collection, sorting out appium errors
- mysql Error Code: 1022. Can’t write; duplicate key in table `xxx`
- The solution of duplicate entry ‘for key’ primary ‘when inserting data in MySQL
- Lambda set to map duplicate key error solution
- Dbsql occurs when configuring SAP ATC during SCI check_DUPLICATE_KEY_ERROR
- Errors encountered by elasticsearch in creating index and mapping
- 1822 – Failed to add the foreign key constraint. Missing index for constraint ‘tb_emp_ibfk_1’ in the
- Error 404:SRVE0190E: File not found: index.action when websphere publishes the project war package
- Error 0210: stick key 28 is displayed when ThinkPad E40 is turned on
- Solutions to remote or adding SSH key errors
- Failed to add the foreign key constraint. Missing index for constraint ‘stu_ibfk_1’ in the reference
- Parse error: syntax error, unexpected T_OBJECT_OPERATOR in E:\WWW\blog\hyii2\frontend\web\index.php
- Index error: invalid index to scalar variable