Rselenium packet capture chain home network (Part 2: data storage and fault tolerance management)

To continue
The previous RSelenium package crawls HOME LINK (above: simulated clicks and page crawls) focused on the issue of automatic clicking on web pages. The code can also capture the full data, but only if there is no error or warning to interrupt the capture. Since both LinkinfoFunc and HouseinfoFunc are encapsulated functions, once interrupted, the data captured before the interruption cannot be written to the data box or list.
When the amount of data to be captured is large and time consuming, various problems such as network interruption will inevitably occur. Therefore, this article builds on the previous one by adding a for loop that introduces the tryCatch function to perform simple error handling. The two comparative changes are as follows:

    page preparation, Step1 code unchanged Step2 delete the data box result, transfer the data storage task to Step 3Step3 add for loop, and introduce tryCatch function

In addition, repeated code is not commented.


Page to

library(rvest)
library(stringr)
library(RSelenium)
remDr <- remoteDriver(browserName = "chrome")
base1 <- "https://hui.lianjia.com/ershoufang/"
base2 <- c("danshui", "huiyangqu", "nanzhanxincheng")
url   <- paste(base1, base2, "/", sep = "")

Step1: Encapsulate the function LinkinfoFunc

LinkinfoFunc <- function(remDr, url) {
  result <- data.frame()
  remDr$open()
  for (i in seq_along(url)) {
    remDr$navigate(url[i])
    j = 0
    while (TRUE) {
      j = j + 1
      destination <- remDr$getPageSource()[[1]] %>% read_html()
      link        <- destination %>% html_nodes("li.clear div.title a") %>% html_attr("href")
      pageinfo    <- destination %>% html_nodes("div.house-lst-page-box") %>% 
                     html_attr("page-data") %>% str_extract_all(., ":[\\d]+") %>% 
                     unlist() %>% gsub(":", "", .)
      totalpage   <- pageinfo[1]
      curpage     <- pageinfo[2]
      data        <- data.frame(link, stringsAsFactors = FALSE)
      result      <- rbind(result, data)
      if (curpage != totalpage) {
        cat(sprintf("The [%d] area of the [%d] page capture success", i, j), sep = "\n")
        remDr$executeScript("arguments[0].click();", 
                            list(remDr$findElement("css", "div.house-lst-page-box a.on+a")))
      } else {
        cat(sprintf("The [%d] area of the [%d] page capture success", i, j), sep = "\n")
        break
      }
    }
    cat(sprintf("The [%d] area was successfully captured.", i), sep = "\n")
  }
  remDr$close()
  cat("All work is done!", sep = "\n")
  return(result)
}

Executive function

linkinfo  <- LinkinfoFunc(remDr, url) %>% unlist()
# Execute function LinkinfoFunc, get linkinfo (list form)

Step2: package function HouseinfoFunc

HouseinfoFunc <- function(link) {
    destianation <- read_html(link, encoding = "UTF-8")
    location     <- destianation %>% html_nodes("a.no_resblock_a") %>% html_text()
    unit         <- destianation %>% html_nodes(".price span.unit") %>% html_text()
    totalprice   <- destianation %>% html_nodes(".price span.total:nth-child(1)") %>%
                    html_text() %>% paste(., unit, sep = "")
    downpayment  <- destianation %>% html_nodes(".taxtext span") %>% html_text() %>% .[1]
    persquare    <- destianation %>% html_nodes("span.unitPriceValue") %>% html_text()
    area         <- destianation %>% html_nodes(".area .mainInfo") %>% html_text()
    title        <- destianation %>% html_nodes(".title h1") %>% html_text()
    subtitle     <- destianation %>% html_nodes(".title div.sub") %>% html_text()
    room         <- destianation %>% html_nodes(".room .mainInfo") %>% html_text()
    floor        <- destianation %>% html_nodes(".room .subInfo") %>% html_text()
    data         <- data.frame(location, totalprice, downpayment, persquare, 
                               area, title, subtitle, room, floor)
    return(data)
}

The Step3: for loop and tryCatch functions catch exceptions

result <- list()
# or result <- vector("list", length(linkinfo))
# Create an empty list of results to be used later for loading data
for (i in seq_along(linkinfo)) {
  if (! (linkinfo[i] %in% names(result)) {
  # If the current linkinfo[i] data has not been written to result, continue with the following command
  # Used to determine if the linkinfo[i] to be fetched has already been fetched to avoid repeated fetching.
    cat(paste("Doing", i, linkinfo[i], "..."))
    # Output a message that says "The link is currently being fetched".
    ok <- FALSE
    # Set initial logical values    
    counter <- 0
    # Set the initial value of counter for the number of connection attempts.
    while (ok == FALSE & counter < 5) {
    # The initial value is FALSE, and when the data is captured, the value will be changed to TRUE and the while loop will pop up.
    The setting of # counter is used to handle exceptions to linkinfo[i], with an initial value of 1. When an error occurs, it returns to the while loop for the 2nd... The fifth reconnection
      counter <- counter + 1
      # Start the first connection
      output <- tryCatch({                  
        HouseinfoFunc(linkinfo[i])
        # Execute the HouseinfoFunc function on linkinfo[i] to implement the crawl.
      },
      error=function(e){
      # If you get an error
        Sys.sleep(2)
        # Rest for 2s
        e
        # Output error messages
      }
      )
      if ("error" %in% class(output)) {
      # If the output type of the output is error
        cat("NA...")
        # then "NA..." and return to the while loop for the second... Reconnect for the 5th time.
      } else {
      # If the output is the captured data, not an error
        ok <- TRUE
        # then the logical value becomes TRUE, out of the while loop.
        cat("Done.")
        # Output the "Done" prompt
      }
    }
    cat("\n")
    result[[i]] <- output
    # Each time the for loop is run, get the value of output and write it to result.
    names(result)[[i]] <- linkinfo[i]
    # Name the i-th vector in result to the corresponding house link (URL)
  }
} 
# The result (in list form) collected in this step includes both the error message returned by the 404 not found page and the target data, which needs to be further separated.

Data separation and extraction

result <- lapply(result, function(x) {
  if (unlist(x) %>% length() == 9) {
    return(x)
  } else {
    return(NULL)
  }
})
# Expand the vectors in the result one by one, and since the target data contains 9 variables, the length of the target vector should be equal to 9 when it is expanded
# Use this feature to leave the target vector and set the non-target vector value to NULL.
result <- result[!sapply(result, is.null)]
# Remove the NULL vector, leaving only the target vector for result.
houseinfo <- do.call(rbind, result)
# Operate the target vector with rbind to get houseinfo (in data.frame form).
View(houseinfo)
write.table(houseinfo, row.names = FALSE, sep = ",", "houseinfo.csv")
# View() function to view data and export it locally

conclusion
Although LinkinfoFunc, the encapsulation function of Step 2, can catch all house links LinkInfo (list form), each link has timeliness. After the house is removed from the shelves, the page will return 404 not Found. Not only can’t the contents of the page be captured, but the subsequent fetching will also be interrupted. Therefore, the function of Step 3 is mainly implemented as follows:

    1. for loop through each linkinfo, for each linkinfo, execute the tryCatch function once, determine and catch the abnormal linkinfo[I] for the abnormal linkinfo[I], execute the while loop to try to recapture, a total of 5 times, and wait 2 seconds each time if the loop is manually interrupted, if (! (linkinfo [I] % % in names (result))) statement can eliminate the link that has been fetching, again, when performing a for loop directly from the intermitting position abnormal action again to hold the data link, the write data box default is NULL (set in Step 3 is written to the error message), and subsequent make rbind operation, should remove these invalid NULL values, can appear otherwise inconsistent length can’t rbind error </ ol> operation result and error of sample:
    1. [For links that are normally connected or captured, the for loop prompts are as follows]
Doing 1 https://hui.lianjia.com/ershoufang/105101098943.html ...Done.
Doing 2 https://hui.lianjia.com/ershoufang/105101085455.html ...Done.

[For links that cannot be properly connected or captured (5 attempts), the for loop prompt and Warining alert are as follows]

Doing 3 https://hui.lianjia.com/ershoufang/105101261413.html ...NA...NA...NA...NA...NA...
Doing 4 https://hui.lianjia.com/ershoufang/105112912491.html ...NA...NA...NA...NA...NA...
Warning messages:
1: closing unused connection 11 (https://hui.lianjia.com/ershoufang/105101261413.html) 
2: closing unused connection 10 (https://hui.lianjia.com/ershoufang/105101261413.html) 
3: closing unused connection 9 (https://hui.lianjia.com/ershoufang/105101261413.html) 
4: closing unused connection 8 (https://hui.lianjia.com/ershoufang/105101261413.html) 
5: closing unused connection 7 (https://hui.lianjia.com/ershoufang/105101261413.html) 

[(Step 3) The result obtained contains both target vectors (the first two) and non-target vectors (the last two)]

$`https://hui.lianjia.com/ershoufang/105101098943.html`
          location totalprice downpayment   persquare       area                                            title
1 Huizhou South Station, South and North Transparent, full five only, sincere and sincere sale at any time.
                                                  subtitle room floor
1 This house is the only one with full five, less taxes, mid-high floor, square, owner sincere sale 3 room 2 hall mid floor / 11 stories

$`https://hui.lianjia.com/ershoufang/105101085455.html`
  location totalprice downpayment   persquare      area                                          title
1 Pengcheng City 1,050,000 down payment 320,000 RMB 9981 yuan/sqm 105.2 sqm full five the only one Lin Shen area District government central location next to Metro Line 14.
                                                subtitle room floor
1 This house is full five only, no VAT tax, living room out of the balcony to see the garden, unobstructed. 3 Bedrooms 2 Bedrooms Mid floor/28 floors

$`https://hui.lianjia.com/ershoufang/105101261413.html`
<simpleError in open.connection(x, "rb"): HTTP error 404.>

$`https://hui.lianjia.com/ershoufang/105112912491.html`
<simpleError in open.connection(x, "rb"): HTTP error 404.>

[If rbind is performed on result directly, the following error will occur.]

Error in rbind(deparse.level, ...) : 
  There's an error in the series parameter: all variables should be the same length

<Human Resources>

Reference:
Iterate RV scratch function as: “Error in open. Connection(x, ” rb “):Arrival timeout.”

Read More: