A very large matrix, with 320,201 rows and 8189 columns, would require 9.8GB if stored with a normal matrix of all zeros
cols <- 8189
rows <- 320127
mat <- matrix(data = 0, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 19.5 Gb
mat <- matrix(data = 0L, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 9.8 GbThe 0 here is actually to be distinguished
Here, 0L
means that the data type is integer
, which by default is numeric
. The biggest difference between the two is that when you use 320127L * 8189L
, you get a NA, whereas 320127 * 8189
does not
If you save it as a sparse matrix
mat <- Matrix(data = 0L, nrow=320127, ncol = 8189, sparse = TRUE)
print(object.size(mat), unit="GB")
#0 Gb
dim(mat)
#[1] 320127 8189
Although the number of rows and columns is the same, the sparse matrix occupies almost no memory. And the operations that ordinary matrices support, such as row sum, column sum and extraction of elements, are also possible in sparse matrices, but take a little more time. At the same time, there are many R packages that support sparse matrix, such as glmnet
, an R package that does lasso regression.
Although sparse matrices look nice, parts of a sparse matrix that big in R can go wrong
> mat2 <- mat + 1
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I wanted to convert it back to the normal matrix with as. Matrix
, it would still give me an error
> mat3 <- Matrix::as.matrix(mat)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Since the ready-made as. Matrix
cannot be processed, what can be done?The simplest and roughest method is to create a new ordinary matrix, then traverse the sparse matrix and put the values of the sparse matrix back to the ordinary matrix one by one.
mat2 <- matrix(data = 0, nrow=320127, ncol = 8189)
for (i in seq_len(nrow(mat))){
for (j in seq_len(ncol(mat))){
mat2[i][j] <- mat[i][j]
}
}
So how long does it take?My computer didn’t run for two hours anyway, so don’t test it.
Is there any way to speed it up?The way to speed up is to reduce the number of for loops, because we are a sparse matrix and most of the space is 0, we only need to assign the parts that are not 0 to the new matrix.
This requires us to understand the data structure of sparse matrices
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int(0)
..@ p : int [1:8190] 0 0 0 0 0 0 0 0 0 0 ...
..@ Dim : int [1:2] 320127 8189
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..@ x : num(0)
..@ factors : list()
@dim </code b> records the dimension information of the matrix,
@dimnames </code b> records the row and column names,
@x
records non-0 values. @i
that doesn't record a 0 row index corresponds to @x
, all of which are 0, so it's not recorded. @p
is more complex, it is not a simple record of non-0 value column index, look at the document also do not know what is, but through searching can find its conversion relationship with non-0 value column index.
Therefore, the code is optimized as
row_pos <- mat@i+1
col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
val <- mat@x
for (i in seq_along(val)){
tmp[row_pos[i],col_pos[i]] <- val[i]
}
You can encapsulate it as a function
as_matrix <- function(mat){
tmp <- matrix(data=0L, nrow = mat@Dim[1], ncol = mat@Dim[2])
row_pos <- mat@i+1
col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
val <- mat@x
for (i in seq_along(val)){
tmp[row_pos[i],col_pos[i]] <- val[i]
}
row.names(tmp) <- mat@Dimnames[[1]]
colnames(tmp) <- mat@Dimnames[[2]]
return(tmp)
}
If you also need to improve, so may need to play Rcpp. I wrote a simple reference to http://adv-r.had.co.nz/Rcpp.html code, can come to my blog http://xuzhougeng.top/archives/R-Sparse-Matrix-Note to continue reading, can buy to continue reading in this paper.
Read More:
- Python memoryerror (initializing a large matrix)
- Memory error in Python numpy matrix
- Matlab matrix transpose function
- Modification scheme of binary files in dot matrix font library
- 13. R language: Error in match.names(clabs, names(xi)): The name is not relative to the original name
- Pytorch corresponding point multiplication and matrix multiplication
- R language error in hist.default ():’x’must be a value
- error: Eigen does not name a type Eigen::Matrix
- R language error:‘ namespace:lazyeval There is no exit_ The object is eval
- Matlab delete row or col to delete the row or column of the matrix
- When C language refers to a user-defined type as a parameter, an error segmentation fault is reported
- Analysis of R language error replacement has length zero problem
- Flume receives an error when a single message is too large
- In R language, for loop or array truncation, the following error occurs only 0’s may be mixed with negative subscripts
- R language loading xlsx error error: JAVA_HOME cannot be determined from the Registry
- R language error error: n() should only be called in a data context
- R language error messages and related solutions
- Solutions to the failure of R language loading rjava
- The Vue project is packaged and deployed to tomcat, and an Error 404 is reported as soon as it is refreshed
- Nginx upload error 413 request entity too large