A very large matrix, with 320,201 rows and 8189 columns, would require 9.8GB if stored with a normal matrix of all zeros
cols <- 8189
rows <- 320127
mat <- matrix(data = 0, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 19.5 Gb
mat <- matrix(data = 0L, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 9.8 GbThe 0 here is actually to be distinguished
Here, 0L
means that the data type is integer
, which by default is numeric
. The biggest difference between the two is that when you use 320127L * 8189L
, you get a NA, whereas 320127 * 8189
does not
If you save it as a sparse matrix
mat <- Matrix(data = 0L, nrow=320127, ncol = 8189, sparse = TRUE)
print(object.size(mat), unit="GB")
#0 Gb
dim(mat)
#[1] 320127 8189
Although the number of rows and columns is the same, the sparse matrix occupies almost no memory. And the operations that ordinary matrices support, such as row sum, column sum and extraction of elements, are also possible in sparse matrices, but take a little more time. At the same time, there are many R packages that support sparse matrix, such as glmnet
, an R package that does lasso regression.
Although sparse matrices look nice, parts of a sparse matrix that big in R can go wrong
> mat2 <- mat + 1
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I wanted to convert it back to the normal matrix with as. Matrix
, it would still give me an error
> mat3 <- Matrix::as.matrix(mat)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Since the ready-made as. Matrix
cannot be processed, what can be done?The simplest and roughest method is to create a new ordinary matrix, then traverse the sparse matrix and put the values of the sparse matrix back to the ordinary matrix one by one.
mat2 <- matrix(data = 0, nrow=320127, ncol = 8189)
for (i in seq_len(nrow(mat))){
for (j in seq_len(ncol(mat))){
mat2[i][j] <- mat[i][j]
}
}
So how long does it take?My computer didn’t run for two hours anyway, so don’t test it.
Is there any way to speed it up?The way to speed up is to reduce the number of for loops, because we are a sparse matrix and most of the space is 0, we only need to assign the parts that are not 0 to the new matrix.
This requires us to understand the data structure of sparse matrices
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int(0)
..@ p : int [1:8190] 0 0 0 0 0 0 0 0 0 0 ...
..@ Dim : int [1:2] 320127 8189
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..@ x : num(0)
..@ factors : list()
@dim </code b> records the dimension information of the matrix,
@dimnames </code b> records the row and column names,
@x
records non-0 values. @i
that doesn't record a 0 row index corresponds to @x
, all of which are 0, so it's not recorded. @p
is more complex, it is not a simple record of non-0 value column index, look at the document also do not know what is, but through searching can find its conversion relationship with non-0 value column index.
Therefore, the code is optimized as
row_pos <- mat@i+1
col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
val <- mat@x
for (i in seq_along(val)){
tmp[row_pos[i],col_pos[i]] <- val[i]
}
You can encapsulate it as a function
as_matrix <- function(mat){
tmp <- matrix(data=0L, nrow = mat@Dim[1], ncol = mat@Dim[2])
row_pos <- mat@i+1
col_pos <- findInterval(seq(mat@x)-1,mat@p[-1])+1
val <- mat@x
for (i in seq_along(val)){
tmp[row_pos[i],col_pos[i]] <- val[i]
}
row.names(tmp) <- mat@Dimnames[[1]]
colnames(tmp) <- mat@Dimnames[[2]]
return(tmp)
}
If you also need to improve, so may need to play Rcpp. I wrote a simple reference to http://adv-r.had.co.nz/Rcpp.html code, can come to my blog http://xuzhougeng.top/archives/R-Sparse-Matrix-Note to continue reading, can buy to continue reading in this paper.