Tag Archives: duplicatekeyerror

pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: anjuke.ershoufang index

This bug kept me busy all afternoon and a night, and finally I ko it
Attach a section of crawling to take anjuke second-hand housing information code
Re
import time
import pymongo
import requests
from bson import ObjectId
from LXML import etree
from pprint import pprint
headers = {
“user-agent “: “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36”,
“cookie”: “aQQ_ajkguid= 243e5d558-8b13-d7bd-4922-3de583e03855; ctid=11; _ga = GA1.2.1030980732.1530799904; _gid = GA1.2.506397644.1530799904; 58tj_uuid=c606f59a-2fb9-4c91-9815-741fdf9cfe5d; als=0; lps=http%3A%2F%2Fwww.anjuke.com%2F%3Fpi%3DPZ-baidu-pc-all-biaoti%7Chttps%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3Dutf-8%26f%3D8%26rsv_bp%3D0%26rsv_idx%3D1%26tn%3Dbaidu%26wd%3D%25E5%25AE%2589%25E5%25B1%2585%25 E5%25AE%25A2%26rsv_pq%3Dd71198bd000395ca%26rsv_t%3D6172VDlcx2zzRQ%252FLyCdcEidtafr%252BSvVyVXrlZ0lsK3U1MEz8066IF4byz4c%26rqlang%3Dcn%26rsv_enter%3D1%26rsv_sug3%3D5%26rsv_sug1%3D5%26rsv_sug7%3D101; twe=2; sessid=3497C1D2-43A8-6143-B2D7-CFDA33FF0C0E; new_uv=2; __xsptplus8 7 c = 8.2.1530840314.1530840335.2%232%7Cwww.baidu.com % % 7 c % 7 c % 25 e5%25 ae b1%2585% % 2589% % 25 e5 25 25 e5%25 ae % 25 a2%23 z7v3xnqldcxthemliqlxqslhvxrh8k_r 7 c % 23% % 23 “,
“referer” : “https://shanghai.anjuke.com/?PI = pz-bid-pc-all-biaoti “
}

# connect to database
client = pymongo.MongoClient(‘127.0.0.1’, 27017)
# define database name
db = client.anjuke
# define table name
coll = db.ershoufang

def get_info():
count = 0
for I in range(23):

response = Requests. The get (‘ # https://shanghai.anjuke.com/sale/p {}/filtersort ‘. The format (I), headers = headers)

item = response. The text

# print (item)
# use etree. HTML, The string into an HTML document.
HTML = etree HTML (item)
HTMLS = HTML. The xpath (‘// * [@ id = “houselist – mod – new”]/li ‘)
# print (HTMLS)

house = {}
for h in HTMLS:
h_addr = h.x path (‘/div [2]/div [1]/a/text () ‘) [0]. The strip ()
h_type = h.x path (‘/div [2]/div [2]/span [1]/text () ‘) [0]. The strip ()
h_area = H.x path (‘/div [2]/div [2]/span [2]/text () ‘) [0]. The strip ()
h_hight = h.x path (‘/div [2]/div [2]/span [3]/text () ‘) [0]. The strip ()
h_name = H.x path (‘/div [2]/div [2]/span [4]/text () ‘) [0]. The strip ()
try:
h_youshi1 = h.x path (‘/div [2]/div [4]/span [1]/text () ‘) [0]. The strip ()
the except:
H_youshi1 = None
try:
h_youshi2 = h.x path (‘/div [2]/div [4]/span [2]/text () ‘) [0]. The strip ()
the except:
h_youshi2 = None
try:
H_youshi3 = h.x path (‘/div [2]/div [4]/span [3]/text () ‘) [0]. The strip ()
the except:
h_youshi3 = None
h_price = H.x path (‘/div [3]/span [1]/strong/text () ‘) [0]. The strip ()

house [‘ h_addr] = h_addr
house [‘ h_type] = h_type
house [‘ h_area] = h_area
House [‘ h_hight] = h_hight
house [‘ h_name] = h_name
house [‘ h_youshi1] = h_youshi1
house [‘ h_youshi2] = h_youshi2
house [‘ h_youshi3] = h_youshi3
House [‘ h_price] = h_price
# pprint (house)
time. Sleep (0.01)

# coll. Insert (house)
save (house)
Count + = 1
print (count)

def save (house) :

coll. Insert (house)

def main () :
get_info ()

if __name__ = = “__main__ ‘:
The main ()
This code can only run two pieces of data,

has two data sets, one with ‘_id’ and one without
There are two solutions:
1. Add a ‘_ID’ into the program. Set the _ID field by yourself instead of the system assignment:

Program no problem:

Two: put house={}, the dictionary inside the for loop:

Either of these approaches will solve the problem, and my personal suggestion is that the second approach, code specification, lets the system assign its own ID