Tag Archives: r language

The sparse matrix of R language is too large to be used as.matrix

A very large matrix, with 320,201 rows and 8189 columns, would require 9.8GB if stored with a normal matrix of all zeros

cols <- 8189
rows <- 320127
mat <- matrix(data = 0, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 19.5 Gb
mat <- matrix(data = 0L, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 9.8 GbThe 0 here is actually to be distinguished

Here, 0L means that the data type is integer, which by default is numeric. The biggest difference between the two is that when you use 320127L * 8189L, you get a NA, whereas 320127 * 8189 does not
If you save it as a sparse matrix

mat <- Matrix(data = 0L, nrow=320127, ncol = 8189, sparse = TRUE)
print(object.size(mat), unit="GB")
#0 Gb
dim(mat)
#[1] 320127   8189

Although the number of rows and columns is the same, the sparse matrix occupies almost no memory. And the operations that ordinary matrices support, such as row sum, column sum and extraction of elements, are also possible in sparse matrices, but take a little more time. At the same time, there are many R packages that support sparse matrix, such as glmnet, an R package that does lasso regression.
Although sparse matrices look nice, parts of a sparse matrix that big in R can go wrong

> mat2 <- mat + 1
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Even if I wanted to convert it back to the normal matrix with as. Matrix , it would still give me an error

> mat3 <- Matrix::as.matrix(mat)
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Since the ready-made as. Matrix cannot be processed, what can be done?The simplest and roughest method is to create a new ordinary matrix, then traverse the sparse matrix and put the values of the sparse matrix back to the ordinary matrix one by one.

mat2 <- matrix(data = 0, nrow=320127, ncol = 8189)
for (i in seq_len(nrow(mat))){
    for (j in seq_len(ncol(mat))){
        mat2[i][j] <- mat[i][j]
    }
}

So how long does it take?My computer didn’t run for two hours anyway, so don’t test it.
Is there any way to speed it up?The way to speed up is to reduce the number of for loops, because we are a sparse matrix and most of the space is 0, we only need to assign the parts that are not 0 to the new matrix.
This requires us to understand the data structure of sparse matrices

> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int(0) 
  ..@ p       : int [1:8190] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ Dim     : int [1:2] 320127 8189
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num(0) 
  ..@ factors : list()

@dim </code b> records the dimension information of the matrix, @dimnames </code b> records the row and column names, @x records non-0 values. @i that doesn't record a 0 row index corresponds to @x , all of which are 0, so it's not recorded. @p is more complex, it is not a simple record of non-0 value column index, look at the document also do not know what is, but through searching can find its conversion relationship with non-0 value column index.
Therefore, the code is optimized as

row_pos <- mat@i+1
col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
val <- mat@x
    
for (i in seq_along(val)){
    tmp[row_pos[i],col_pos[i]] <- val[i]
}

You can encapsulate it as a function

as_matrix <- function(mat){

  tmp <- matrix(data=0L, nrow = mat@Dim[1], ncol = mat@Dim[2])
  
  row_pos <- mat@i+1
  col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
  val <- mat@x
    
  for (i in seq_along(val)){
      tmp[row_pos[i],col_pos[i]] <- val[i]
  }
    
  row.names(tmp) <- mat@Dimnames[[1]]
  colnames(tmp) <- mat@Dimnames[[2]]
  return(tmp)
}

If you also need to improve, so may need to play Rcpp. I wrote a simple reference to http://adv-r.had.co.nz/Rcpp.html code, can come to my blog http://xuzhougeng.top/archives/R-Sparse-Matrix-Note to continue reading, can buy to continue reading in this paper.

Error: cannot allocate vector of size 88.1 MB

Tags: TPS ace should be the reference of the big err is unable to pay attention to the hive
When I was training the model to run the code these days, I was always prompted to say: Error: cannot allocate vector of size 88.1MB, only knowing that the allocated space is insufficient.
Here are some of the answers we looked up:
1, this is the characteristics of R, there are several solutions:
1. Upgrade to R3.3.0 or above, the memory management and matrix calculation is too good. Calculations that can crash on R3.2.5 will work fine above R3.3.0. 2. Load some R language disk cache packets, search
3. Write code when appropriate to add some clean memory command.
4. I should run multiple threads.
5. Add memory function is limited. R3.2.5 can crash the server, which has 44 cores and 512 gigabytes of memory. It is necessary to optimize the code.
Second, sometimes adding memory chips can’t meet the demand of large data volume, so parallel computing strategy is adopted. If the data is read in one time, it can be combined with filematrix package to read the data from the hard disk in several times, but it will be much slower.
Three, find that parameter in R, there’s a place where you can change the maximum memory allocation, in Preference or something like that.
Download a Package called BigMemory. It rebuilds classes for large data sets, and is basically cutting edge in the ability to handle large data sets (including tens of GIGABYtes).
Links to cran.r-project.org/web/packages/bigmemory/
The BigMemory package is ok. Two other options are also available, mapReduce and RHIPE(using Hadoop), which can also work with large data sets.
Six, the great spirit guide (http://bbs.pinggu.org/thread-3682816-1-1.html), always the allocate a vector is the typical data too big can’t read
There are three methods:
1, upgrade hardware
2, improve algorithm
3, modify the upper limit of memory allocated by the operating system to R, memory.size(T) check the allocated memory
Memory.size (F) checks the memory used
Memory.limit () check the memory limit
Object.size () looks at how much memory each variable takes up.
memory.size() view current work space memory usage
memory.limit() view system memory usage limit.
If the current memory limit is not sufficient, you can change it to a newLimit by using memory.limit(newLimit). Note that in 32-bit R, the capped limit is 4G, and you cannot use more than 4G (digit limit) on a program. In such cases, consider using a 64-bit version.
Detail can refer to this article, is very good at https://blog.csdn.net/sinat_26917383/article/details/51114265
1 http://jliblog.com/archives/276
2 http://cos.name/wp-content/uploads/2011/05/01-Li-Jian-HPC.pdf
http://cran.r-project.org/web/views/HighPerformanceComputing.html 3 R high performance computing and parallel computing
If you encounter this problem, you can try the corresponding solution, the method is not bad oh ~
Error: Cannot allocate Vector of size 88.1MB
Tags: TPS ace should be the reference of the big err is unable to pay attention to the hive
The original address: https://www.cnblogs.com/babyfei/p/9565143.html

R learning notes (1) — ARIMA model

In view of the existing tutorial (http://blog.csdn.net/desilting/article/details/39013825), in the operation of the problems and solutions
If you have to do d-order differences on a time series to get a stationary series, then you use the ARIMA(P, D, Q) model, where D is the order of the difference. ARIMA(P, D, Q) Model is fully known as Autoregressive Integrated Moving Average Model (ARIMA). AR is Autoregressive and P is an Autoregressive term. MA is the moving average, Q is the number of moving average terms, and D is the difference times made when the time series becomes stationary.
Here are some basic ways to view help:
A. Help ()
two

1. Open R interface
2. 3. Click on “packages” in the pop-up page and then go to “…”

3.
library(help=”MASS”)
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
data< – XTS (data,seq(as.POSIXct(“2014-01-01″),len=length(data),by=”day”)
Error in as. Vector (x, mode) :
cannot coerce type ‘closure’ to vector of type ‘any’
Solution: Just because the blogger did not replace the input data (source), it should be data< -xts(source,seq(as.POSIXct(“2014-01-01″),len=length(source),by=”day”))

acf < -acf (data_diff1,lag.max=100,plot=FALSE)
Error in na.fail. Default (as. Ts (x)) : there is a missing value
in the object
Solution: acf & lt; – acf(data_diff1,lag.max=100,na.action = na.pass,plot=FALSE)
However, in the ACF figure displayed at this time, the maximum value of the horizontal axis coordinate (hysteresis value) is not 100, and the horizontal axis coordinate grows exponentially with the value of E +00 and E +02.
After checking the specific value of ACF, it was found that the original horizontal coordinate of lag=1 was 86400, which should be changed to the unit of seconds (?). .
There is no way to solve this problem at present, just put “data< – “(data, seq (as POSIXct (” 2014-01-01″), len = length (data), by = “day”)) “this step can be omitted…

data.fit < – arima (data, order = c (7, 0), seasonal = list (order = c (1, 0), the period = 7))
Here’s a seasonal setup for ARIMA models, the setup rules are unclear.
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
The final result is the same as the original blogger
Question: Do I need to do a unit root test?(Verified to be stationary time series)

R language error in match.names (clabs, names(xi)) :

Today, when I used the rbind function in R language, this error occurred, and I couldn’t find the error after searching for a long time.
The reason for this mistake:
1. The data type used is data.frame
2. The header files for the rows to be merged are different

Solution: change the data to matrix, and you can do it. In addition, when you read the data, add the data header=FALSE, which is also ok. I tried it, but it is hard to use, so I changed it to matrix.

Error name: Error in match. Names (CLABS, names(XI)) : Names do not match previous names

Error analysis of multiple linear regression in R language model.frame.default

Error in model.frame.default(Terms, Newdata, Na. action = Na. action, Xlev = Object $Xlevels) : A new level appears in factor factor(O)
Make sure the test and train data sets have the same factor level when you set them. In short, each factor level should be stored in test and train at the same time. Otherwise, how can you look up the data, rewrite the code, it is useless.

It would seem test and train datasets have a different set of factor levels.

R language error messages and related solutions

Original: Huang Xiaoxian
Error: object of type ‘closure’ is not subsettable
object is not subsettable, see if the object is empty, sometimes the file path or name is wrong, data is not imported successfully
Remove duplicates before running TSNE
there can be duplicate data lines. There is a parameter in the package Rtsne, check_duplicates = FALSE
Error in colMeans(x, na.rm = TRUE) : ‘x’ must be numeric
. The data imported by that numeric contains columns that are not numeric, which may be factors.
can be checked using sapply (x, class)
Error: (list) object cannot be coerced into type ‘double’
. If using as.matrix cannot convert columns whose type is factor to numeric,
can use lapply(x,as.numeric) to convert the factor type to numeric
Error in install. Packages: cannot remove prior installation of package ‘digest’
can directly delete the corresponding package in the folder stored in R package
Error in df$type: $ operator is invalid for atomic vectors
The $method element

Error in file(file, ifelse(append, “a”, “w”)) :

Use of rep function in R

The official help document reads as follows:
Usage

rep(x, ...)

rep.int(x, times)

rep_len(x, length.out)

Arguments

x a vector (of any mode including a list) or a factor or (for rep only) A POSIXct or POSIXlt or Date object; or an S4 object containing such an object.
... further arguments to be passed to or from other methods. For the internal default method these can include:

times

An integer vector giving the (non-negative) number of times to repeat each element if of length length(x), Negative or NA values are an error.

length.out

The desired length of The output vector. Other inputs will be coerced into an integer vector and The first element taken. Ignored if NA or invalid

each

Non-negative integer. Each element of x is repeated Each times. Other inputs will be coerced into an integer vector and the first element taken As 1 if NA or invalid.

times

see ... .
length.out non-negative integer: the desired length of the output vector.

Rep functions with four parameters: the x vector or class vector of objects, each: x elements each repetitions, times: after each vector processing, if The Times is a single value, is the value of the whole after each repeat number of times, if it is equal to the vector x after each of the length of the vector, for each of the number of the elements of the repeat times each element in the same position, otherwise an error; Leng. out refers to the length of the final output of the vector processed by times. If it is longer than the generated vector, it is completed. That is, rep will take each parameter, generate a vector X1, and times will manipulate X1 to generate X2, lengthen. Out will manipulate X2 to produce the final output vector X3. Here is an example:
> Rep (1:4,times=c(1,2,3,4)) # and vector x equal length times mode
[1] 1 2 2 3 3 3 4 4 4 4
> Rep (1:4,times=c(1,2,3)) # non-equilong mode, Error
Error in rep(1:4,times=c(1,2,3)) : invalid 'times' argument
> Rep (1:4,each=2,times=c(1,2,3,4)) # is still non-equal length mode, because the vector after each has 8 bits, instead of 4 bits
Error in rep(1:4,each=2,times=c(1,2,3,4)) :
invalid 'times' argument
> Rep (1:4, times = c (1, 2, 3, 4)) # isometric model, I wrote to the o (╯/╰) o
[1] 1 2 2, 3, 3, 3, 4, 4 4 4
& gt; Rep (1:4,times=c(1,2,3,4),each=3) # repeat example, don't beat me
Error in rep(1:4,times=c(1,2,3,4),each=3) :
invalid 'times' argument
> Rep (1:4, each = 2, times = 8) # value correctly, times8 bit length vector
[1] 1 1 1 2 2 2 2 2 2 2, 3, 3 3 3 3 3 3 3 3 3, 3, 4, 4 4 4 4 4 4 4 4 4 4 4 4 4 4
& gt; Rep (1:4,each=2,times=1:8,len=3) # use of len, loop complement pay attention to
[1] 1 1 2
> Rep (1:4,each=2,times=3) # after each times
[1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 4 1 1 2 2 3 3 3 4 4
> Rep function over!

Reproduced in: https://www.cnblogs.com/business-analysis/p/3414997.html

R language loading xlsx error error: JAVA_HOME cannot be determined from the Registry

OnLoad failed for ‘XLSX’ :
loadNamespace() calculate ‘rJava’; onLoad failed,
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry
This is because my computer is not installed the JAVA environment, right now I need to download and install the JDK, website download address: https://www.oracle.com/java/technologies/javase-downloads.html, the website need to register account and download speed is slow, there are baidu network location 64 downloads: link: https://pan.baidu.com/s/1WQvo3UyVPtBypGcCLLCLzQ extracted code: wg5o
After downloading, which contains the method to install the JDK, introduce a public number, commonly used in all kinds of software download source and installation methods ~
After installing the JDK, first find the location of R installation package. Generally in the library document under R-4.0.2, delete the packages that were not installed successfully before. Here, I will delete RJava, Xlsxjars, and XLSX

After you delete it, go back to the Rstudio window and install it in the order rJava, XlsXJars, and XLSX. There are two ways to install the Package of R, either way
Method 1: Run the script
Install. Packages (” rJava “)
install. Packages (” xlsxjars “)
install. Packages (” XLSX “)
Method 2:


After installing library(XLSX), there will be no errors!!

R language packages installation failed: Error in install.packages: error reading from connection

Failed to download and install “SP” package:

Solution:
1. If the download is unsuccessful, switch the default download image of Rstudio to the domestic download image;


2. There are several domestic download images, if not, try more times. I chose the download image of Peking University;
3. Enter: install. Packages (“sp”) and wait for successful installation.

13. R language: Error in match.names(clabs, names(xi)): The name is not relative to the original name

Problem description

Count_bind = rbind(count_left,count_right)
Error in match.names(clabs, names(xi)) : names are not relative to existing names

To explore the reason
This is a problem in the match.names function that occurs during rbind, for the simple reason that my first two objects to be rbind have different column names:


To solve the problem
Manually changing the column name can solve the problem:

colnames(count_left) <- c("AAA")

Use the above function to execute the two objects separately, and just change the column name.
Big mouth a
I’ve never really understood why data.frame cannot set the column name when declared but can set the row name. Below is the official usage, only for row row.

The data frame (… , row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = default.stringsAsFactors())

In this way, I often have to manually name the column names of the data box. Usually, the code I write is genetic data, so as to ensure the accuracy of the column names. However, it would be troublesome to write small codes for statistics
Welcome to communicate

Installation and use of R language ggmap package

R language GGMap package installation and use
Ggmap is a map drawing package that calls the Google Maps API via the get_map function.
1 The first attempt went wrong
At the beginning, get_map was used to fetch the map data directly in R. However, an error occurred. The interface to display the function request refused our request (HTTP error code: 403).

library(ggmap)
map=get_map(location='San Fransico',maptype='roadmap',zoom=12)

URL:http://maps.googleapis.com/maps/api/staticmap?center=San+Fransico& zoom=12& size=640x640& scale=2& maptype=roadmap& language=en-EN& sensor=false
2 Baidu Discovery
From July 16, 2018, Google will limit the number of API requests, charge fees for exceeding the limit, and make it mandatory for all projects to use the official API Key. Without the API Key, the quality of the map may degrade or the map will not work. The API key must be associated with a credit card, and if the limit is exceeded, Google will start charging from the credit card. The search giant first offered users a $200 a month credit for free. Google slashed the number of free requests, from 25,000 per day to 28,000 per month, or about 1,000 per day, to a quarter of that. Google users who do not apply for a settlement account can only access the interface once a day.
3. Solving problems
So errors occur because of the lack of interface in the request URL key, to apply for a key in the Google maps developer platform:
https://developers.google.cn/maps/documentation/
We also found that we need to associate the credit card account, that is, the account that needs to be cleared by the Google map interface. After completing a series of applications, we found through the URL that to retrieve this function, we need to enable the staticmap interface. After enabling the interface, we got the KEY of the staticmap interface, but we found that the package we just downloaded directly in R could not add the KEY parameter to the URL. Later, it was found that someone on GitHub had updated the ggmap package. Then remove the ggmap package installed previously, install the latest ggmap package from GitHub, and directly enter the code in R to install:

if(!requireNamespace("devtools")) install.packages("devtools")
devtools::install_github("dkahle/ggmap", ref = "tidyup")
library(ggmap)
register_google(key=”your google map API key”)

Add the KEY parameter to register_google() and call the get_map() function to get the map data. If you get an error message that register_google is not register_google, you have not successfully installed the GGMap package from GitHub.
https://stackoverflow.com/questions/53275443/unable-to-use-register-google-in-r
When I call get_map(), I find an error prompting HTTP request to send REQUEST_DENIED. It was found on StackOverflow that Google map still needs to apply for Geocoding API interface. After applying for the interface, it should be ok to call get_map.
https://stackoverflow.com/questions/52565472/get-map-not-passing-the-api-key/52617929#52617929
Appendix:
1, ggmap packet address: https://github.com/dkahle/ggmap
2, methods other than Google maps, can also study the openstreetmap:
https://www.openstreetmap.org/user/jbelien/diary/44356

R mac X11 library is missing: install XQuartz from xquartz.macosforge.org

Today, when I ran the fix code, the following error occurred:

> fix(Carseats)
Error in check_for_XQuartz() : 
  X11 library is missing: install XQuartz from xquartz.macosforge.org

The MAC solution is simple and can be installed with BREW:

brew cask install xquartz

Finally, reboot and let the program take effect.
reference
[1]. The ggplot2 sourcing error: X11 library is missing. https://stackoverflow.com/questions/28984243/ggplot2-sourcing-error-x11-library-is-missing