Tag Archives: r language

R language notes – sample() function

Field studies and sample selection in medical statistics or epidemiology often use one word: random sampling. Random sampling is an important method to ensure the equilibrium among comparison groups. So the first function introduced today is the function sample for sampling:

> x=1:10
> sample(x=x)

 [1]  3  5  9  6 10  7  2  1  8  4

The first line represents assigning the x vector 1 to 10, and the second line represents random sampling of the x vector. The output is the result of each sampling, and it can be seen that the sampling is not put back — at most n times, n is the number of elements in the x vector.

if you want to specify the number of elements extracted from the vector, you need to add a parameter size:

> x=1:1000
> sample(x=x,size=20)

 [1]  66 891 606 924 871 374 879 573 284 305 914 792 398 497 721 897 324 437
[19] 901  33

This is sampled in positive integers from 1 to 1000, where size specifies the number of times the sample is sampled, 20 times, and the result is shown above.
These are not put back into the sample. No put back sampling means that once an element is selected, there will be no more of that element in the population. If the sample is put back, a parameter repalce=T should be added:

> x=1:10
> sample(x=x,size=5,replace=T)

[1] 4 7 2 4 8

“Replace” means to repeat. So you can sample the elements repeatedly, which is what’s called a put back sample. We look at the results above. Element 4 is selected twice in the course of 5 random sampling.


R language code has a feature is “contraption”, maybe my word is not professional, but it means: if we enter the position of the code corresponds to the position of the parameters in a function, we can not write the parameters of the function, such as:

> x=1:10
> sample(x,20,T)

 [1] 1 2 2 1 5 5 5 9 9 5 2 9 8 3 4 8 8 8 1 1

In the above code, we have omitted the parameters x, size and Repalce, but it can still be evaluated and indicates that the x vector is put back to random extraction 20 times. The reason we try to take parameters with us every time we write code is because I think it’s a good habit and it looks clear. In addition, if you are familiar with the location of a function’s arguments, you will get the wrong result if there is no “counterpoint”. And many functions have too many arguments to remember where they are. If the parameters are taken, the operation can be carried out even if the positions do not correspond:

> x=1:10
> sample(size=20,replace=T,x=x)

 [1]  4  9  2  6  4  5  4  7 10  5  2  2  3  4  2  4  6  8  7  8

This advantage is obvious, not only clear, but also has no counterpart. And we can also see that if you put it back, the size is infinite, and if you don’t put it back, the size depends on the size of the population.

for the roll of dice, the roll of a coin (this is probably a necessary introduction to sampling), is a put back sampling.
It should be explained here that for the SAMPLE function, the parameter x can be either a numerical value or a character. In fact, the parameter x represents any vector:

> a=c("A","B")
> sample(x=a,size=10,replace=T)

 [1] "B" "A" "A" "A" "B" "A" "A" "B" "A" "A"

The code above can be interpreted as the number of flips of A coin, in which heads (A) and tails (B) occur 10 times.

above mentioned sampling process, each element is drawn with equal probability, called random sampling.
Sometimes our probability of extracting elements may not be equal (for example, common binomial distribution probability problems). In this case, we need to add a parameter prob, which is the abbreviation of “probability”. If a doctor performs an operation on a patient with an 80% chance of success, how many times can he operate on 20 patients today?The code is as follows:

> x=c("S","F")
> sample(x,size=20,replace=T,prob=c(0.8,0.2))

 [1] "F" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "F" "S" "S" "F" "S" "S"
[19] "F" "S"

Where “S” stands for success and “F” for failure.

> x=c(1,3,5,7)
> sample(x,size=20,replace=T,prob=c(0.1,0.2,0.3,0.9))

 [1] 3 5 7 3 7 3 7 5 3 7 7 7 1 5 7 5 7 7 3 7

These codes tell us that each element can be given a probability, and each probability is independent, that is, in the parameter PROb, the probability of all elements does not necessarily add up to 1, it only represents the probability of an element being extracted.

for the sample function, the parameter x can be any object in R (such as the sample character above). Another of the same functions is sample.int, short for “intger” or “integer.” Its argument n must be a positive integer:

> x=-10.5:7.5
> sample(x=x,size=3);sample.int(n=x,size=3)

[1] -5.5 -7.5  0.5
Error in sample.int(x, size = 3) : invalid first argument

The first line of code generates an arithmetic sequence of -10.5 to 7.5. The first line of output is the result of SAMPLE. The second line is the result of sample.int with an error: “First argument invalid” because it is not a positive integer. The rest of the usage is the same as sample.

pick from http://www.wtoutiao.com/p/186VWin.html

R language: na.fail and na.omit

In practice, the data set is rarely complete, and in many cases, the sample will contain several missing values NA, which is troublesome in data analysis and mining.
R language can improve the omit value of samples by na.fail and na.omit.
The

    na. Fail (& lt; Vector a>) : If vector A contains at least 1 NA, error is returned; If NA is excluded, return the original vector ana. Omit (& LT; Vector a>) : Return the vector aattr (na.omit (& LT); Vector a>) , “na.action”) : returns the subscript of na in vector a. Na: determines whether the element in the vector is na

Example:

data< – c (1, 2, NA, 2,4,2,10, NA, 9)
data. NA. Omit< – na. Omit (data)
data. Na. Omit the
[1] 1 2 2 and 4 2 10 9
attr (, “na. The action”)
3 8 [1]
attr (” class “)
[1] “omit”

attr (data. Na. Omit, “na. The action”)
3 8 [1]
attr (” class “)
[1] “omit”

can also be used! X mode conveniently deletes NA. Such as:

a< – c (1, 2, 3, NA, NA, 2, NA, 5)
a[!is.na(a)]
[1] 1 2 3 2 5

.
which is for na na is used to determine whether the element in the vector, returns the result: c (FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE), namely the elements within a as na, its corresponding subscript elements is TRUE, otherwise is FALSE. ! X is the non-logical operator,! Is. Na (a) means that the element inside a is not Na, and its subscript element is TRUE and FALSE conversely. After indexing through A [! Is.na (a)], the element that is not Na in A can be taken out and filtered.
The functions Na. fail and Na. omit can be applied not only to vectors but also to matrices and data boxes.
Example:

data < – read.table(text=”
a b c d e f
NA 1 1 1 1 1
1 NA 1 1 1 1
1 1 NA 1 1 1
1 1 1 NA 1 1
1 1 1 1 NA 1
1 1 1 1 1 NA”,header=T)
na.omit(data)
data
> [1] a b c d e f
< 0 line & gt; (or 0-length row.names)

— — — — — — — — — — — — — — — — — — — — —
the author: SThranduil
source: CSDN
,
https://blog.csdn.net/SThranduil/article/details/71710283 copyright statement: this article original articles for bloggers, reproduced please attach link to blog!

No such file or directory

 
No such file or directory can not be found when reading file in R
 
Recently, while reading the file, the following problem occurred
> Passenger = read. CSV (‘ international – the airline – passengers. CSV, sep = ‘, ‘)
Error in File (file, “RT “) : Unable to open link
Warning Message:
In the file (file, “rt”) :
Unable to open file ‘international-Airline-passengers. CSV ‘: No such file or Directory

R can’t find the file.
 
File path, divided into absolute path and relative path, using absolute path is troublesome, usually use relative path, and relative path refers to, relative to the current working path.
 
Using the geTWd () function, you can get the current working path

Solution 1
Set the directory where the international-airline-travel.csv file is placed to the working directory
 
Solution 2
Copy the file international-airline-passengers. CSV to the current working directory

Error in .Call.graphics(C_palette2, .Call(C_palette2, NULL)) : invalid graphics state

I believe my dataframe is okay and my code is okay. In fact, I have eliminated parts of the dataframe and most of the graphing code to make things as basic as possible. But still, I get:

Error in .Call.graphics(C_palette2, .Call(C_palette2, NULL)) : 
  invalid graphics state

What is wrong here?Here is the data:

 date   trt var val
1/8/2008    cc  sw5 0.2684138
1/8/2008    cc  sw15    0.2897586
1/8/2008    cc  sw5 0.2822414
2/8/2008    cc  sw5 0.2494583
2/8/2008    cc  sw5 0.2692917
2/8/2008    cc  sw15    0.2619167
2/8/2008    cc  sw5 0.204375
3/8/2008    cc  sw5 0.2430625
3/8/2008    cc  sw5 0.2654375
3/8/2008    cc  sw5 0.2509583
3/8/2008    cc  sw5 0.2055625
1/8/2008    ccw sw15    0.2212414
1/8/2008    ccw sw5 0.3613448
1/8/2008    ccw sw5 0.2607586
2/8/2008    ccw sw5 0.2087917
2/8/2008    ccw sw15    0.3390417
2/8/2008    ccw sw5 0.2436458
2/8/2008    ccw sw5 0.290875
3/8/2008    ccw sw5 0.20175
3/8/2008    ccw sw15    0.328875
3/8/2008    ccw sw5 0.2328958
3/8/2008    ccw sw5 0.2868958

When I work with this data, I specify dates like this:

df<-df[order(as.Date(df$date,format="%d/%m/%Y")),,drop=FALSE]

and here I want to make a scatterplot:

ggplot(data = df,aes(x = date,y = val)) + geom_point(aes(group = trt))


I ran into this same error and solved it by running:

dev.off()

and then running the plot again. I think the graphics device was messed up earlier somehow by exporting some graphics and it didn’t get reset. This worked for me and it’s simpler than reinstalling ggplot2.

\

R note for Bioinfo: the column for the select call is undefined

R Note for Bioinfo: The column selected to call is undefined
Problems:

input: Table(GPL number)[1: number of rows selected,c(” ID “, “other column attributes”,…)]
Error:
Error in [.data.frame… undefined columns selected
Solution:
Table(GPL number)[1: number of rows selected,c(” ID “, “other column attributes”…)]
note that the other column attributes here are compared with the original database to see if they are consistent
modify the inconsistent column attributes to be consistent with the original database column fields after
error disappears and runs correctly
Specific orders to solve the problem:

Table (GPL) [1:10, c (” ID “, “GB_ACC”, “Gene Title”, “Gene Symbol”, “ENTREZ_GENE_ID”)]

R language learning problem solving error in output $nodeid: $operator is invalid for atomic vectors

Problem: Error in output$nodeID: $operator is invalid for atomic Vectors when viewing variable columns using the “$” operator

output <- data$score
output <- cbind(nodeID=dat$nodeID,score=output)
head(output$nodeID)
 Error in output$nodeID : $ operator is invalid for atomic vectors
# Check the type of output and find out it's matrix.
class(output)
 [1] "matrix"
#"data.frame" can only be used with "$", just use [,] here.
head(output[,1])

 

 

Solutions to the failure of R language loading rjava

library(rJava)

Error: when ‘rJava’ is calculated in loadNamespace(), onLoad failed.

call: inDL (x, as the logical (local), as the logical (now),…).

error: unable to load Shared object ‘f:/Program Files/R/R – 3.1.2/library/rJava/libs/x64/rJava DLL’ :

LoadLibrary failure: %1 is not a valid Win32 application.

Error: failed to load ‘rJava’ package or namespace,

The reason for the above error is that your JAVA version is 32-bit and your R is 64-bit, so download the 64-bit version of the JRE and change the environment variable JAVA_HOME to the position of the 64-bit JRE
The JRE can be downloaded in the website: http://www.java.com/en/download/manual.jsp

R reads JSON data

You can use the Library (JSONLite) package
jsonmessage< -read_json(name.json,simplifyVector = FALSE)
jsonmessage$Month$minute[[1]]$number\
You can use layers of lists to find the information you want
But jsonLite will report an error when reading a Chinese-named JSON file
Error in parse_con(TXT, bigint_as_char) :
circum-error: invalid char in json text.
Now I can change it to Library (RJSONIO)
ts1< -RJSONIO::fromJSON(name.json)
If something is character and you can’t select one of them, just make it a list
ts1$month$minute[[1]]
Name, age, gender, job
Lili 20 female enginner
This string has a noun corresponding to it
To the list
ts1$month$minute[[1]]< -as.list(ts1$month$minute[[1]])
You can turn it into a list and continue to select the items you want

Reproduced in: https://www.cnblogs.com/zhenghuali/p/10455509.html

Interesting undefined columns selected from read.table and read.csv

Enter the following syntax:
read.table(site_file,header=T)-> data
data< -data[which(data[,5]==”ADD”),]
A:
Error in `[.data.frame`(data, , 5) : undefined columns selected
Calls: plot_manhatton -> [ -> [.data.frame -> which -> [ -> [.data.frame
After a few attempts, change the command to:
read.csv(site_file,header=T)-> data
data< -data[which(data[,5]==”ADD”),]
It’s ready to run.
The reason for undefined Columns selected error is that What I imported was a CSV file, but I used read.table when Reading the file. After changing to read.csv, there is no problem.

Reproduced in: https://www.cnblogs.com/chenwenyan/p/5384714.html

Solving the problem of saving object set by save() function in R language

Solve the save() function in R language to save the collection of objects – & GT; The Art of R programming, P195

>ls()
[1] "j"              "joe"            "k"              "o"              "print.employee" "z"             
> z<-rnorm(100000)
> hz<-hist(z)
> save(hz,"hzfile.RData")
Error in save(hz, "hzfile.RData") : The target object 'hzfile.Rdata' does not exist.
> save(hz,"hzfile")
Error in save(hz, "hzfile") : The target 'hzfile' does not exist.
> save(hz,file="hzfile.RData")
> ls()
[1] "hz"             "j"              "joe"            "k"              "o"              "print.employee" "z"             
> rm(hz)
> ls()
[1] "j"              "joe"            "k"              "o"              "print.employee" "z"             
> load("hzfile")
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file 'hzfile', probable reason 'No such file or directory'
> load("hzfile.RData")
> ls()
[1] "hz"             "j"              "joe"            "k"              "o"              "print.employee" "z"             

As shown below:


When you use save(), you use the “file” parameter, and the suffix of the file is “.rdata “. If you use load(), the suffix of the file is “.rdata”

The sparse matrix of R language is too large to be used as.matrix

A very large matrix, with 320,201 rows and 8189 columns, would require 9.8GB if stored with a normal matrix of all zeros

cols <- 8189
rows <- 320127
mat <- matrix(data = 0, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 19.5 Gb
mat <- matrix(data = 0L, nrow=320127, ncol = 8189)
print(object.size(mat), unit="GB")
# 9.8 GbThe 0 here is actually to be distinguished

Here, 0L means that the data type is integer, which by default is numeric. The biggest difference between the two is that when you use 320127L * 8189L, you get a NA, whereas 320127 * 8189 does not
If you save it as a sparse matrix

mat <- Matrix(data = 0L, nrow=320127, ncol = 8189, sparse = TRUE)
print(object.size(mat), unit="GB")
#0 Gb
dim(mat)
#[1] 320127   8189

Although the number of rows and columns is the same, the sparse matrix occupies almost no memory. And the operations that ordinary matrices support, such as row sum, column sum and extraction of elements, are also possible in sparse matrices, but take a little more time. At the same time, there are many R packages that support sparse matrix, such as glmnet, an R package that does lasso regression.
Although sparse matrices look nice, parts of a sparse matrix that big in R can go wrong

> mat2 <- mat + 1
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Even if I wanted to convert it back to the normal matrix with as. Matrix , it would still give me an error

> mat3 <- Matrix::as.matrix(mat)
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Since the ready-made as. Matrix cannot be processed, what can be done?The simplest and roughest method is to create a new ordinary matrix, then traverse the sparse matrix and put the values of the sparse matrix back to the ordinary matrix one by one.

mat2 <- matrix(data = 0, nrow=320127, ncol = 8189)
for (i in seq_len(nrow(mat))){
    for (j in seq_len(ncol(mat))){
        mat2[i][j] <- mat[i][j]
    }
}

So how long does it take?My computer didn’t run for two hours anyway, so don’t test it.
Is there any way to speed it up?The way to speed up is to reduce the number of for loops, because we are a sparse matrix and most of the space is 0, we only need to assign the parts that are not 0 to the new matrix.
This requires us to understand the data structure of sparse matrices

> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  [email protected] i       : int(0) 
  [email protected] p       : int [1:8190] 0 0 0 0 0 0 0 0 0 0 ...
  [email protected] Dim     : int [1:2] 320127 8189
  [email protected] Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  [email protected] x       : num(0) 
  [email protected] factors : list()

@dim </code b> records the dimension information of the matrix, @dimnames </code b> records the row and column names, @x records non-0 values. @i that doesn't record a 0 row index corresponds to @x , all of which are 0, so it's not recorded. @p is more complex, it is not a simple record of non-0 value column index, look at the document also do not know what is, but through searching can find its conversion relationship with non-0 value column index.
Therefore, the code is optimized as

row_pos <- [email protected]+1
col_pos <- findInterval(seq([email protected])-1,[email protected][-1])+1
val <- [email protected]
    
for (i in seq_along(val)){
    tmp[row_pos[i],col_pos[i]] <- val[i]
}

You can encapsulate it as a function

as_matrix <- function(mat){

  tmp <- matrix(data=0L, nrow = [email protected][1], ncol = [email protected][2])
  
  row_pos <- [email protected]+1
  col_pos <- findInterval(seq([email protected])-1,[email protected][-1])+1
  val <- [email protected]
    
  for (i in seq_along(val)){
      tmp[row_pos[i],col_pos[i]] <- val[i]
  }
    
  row.names(tmp) <- [email protected][[1]]
  colnames(tmp) <- [email protected][[2]]
  return(tmp)
}

If you also need to improve, so may need to play Rcpp. I wrote a simple reference to http://adv-r.had.co.nz/Rcpp.html code, can come to my blog http://xuzhougeng.top/archives/R-Sparse-Matrix-Note to continue reading, can buy to continue reading in this paper.

Error: cannot allocate vector of size 88.1 MB

Tags: TPS ace should be the reference of the big err is unable to pay attention to the hive
When I was training the model to run the code these days, I was always prompted to say: Error: cannot allocate vector of size 88.1MB, only knowing that the allocated space is insufficient.
Here are some of the answers we looked up:
1, this is the characteristics of R, there are several solutions:
1. Upgrade to R3.3.0 or above, the memory management and matrix calculation is too good. Calculations that can crash on R3.2.5 will work fine above R3.3.0. 2. Load some R language disk cache packets, search
3. Write code when appropriate to add some clean memory command.
4. I should run multiple threads.
5. Add memory function is limited. R3.2.5 can crash the server, which has 44 cores and 512 gigabytes of memory. It is necessary to optimize the code.
Second, sometimes adding memory chips can’t meet the demand of large data volume, so parallel computing strategy is adopted. If the data is read in one time, it can be combined with filematrix package to read the data from the hard disk in several times, but it will be much slower.
Three, find that parameter in R, there’s a place where you can change the maximum memory allocation, in Preference or something like that.
Download a Package called BigMemory. It rebuilds classes for large data sets, and is basically cutting edge in the ability to handle large data sets (including tens of GIGABYtes).
Links to cran.r-project.org/web/packages/bigmemory/
The BigMemory package is ok. Two other options are also available, mapReduce and RHIPE(using Hadoop), which can also work with large data sets.
Six, the great spirit guide (http://bbs.pinggu.org/thread-3682816-1-1.html), always the allocate a vector is the typical data too big can’t read
There are three methods:
1, upgrade hardware
2, improve algorithm
3, modify the upper limit of memory allocated by the operating system to R, memory.size(T) check the allocated memory
Memory.size (F) checks the memory used
Memory.limit () check the memory limit
Object.size () looks at how much memory each variable takes up.
memory.size() view current work space memory usage
memory.limit() view system memory usage limit.
If the current memory limit is not sufficient, you can change it to a newLimit by using memory.limit(newLimit). Note that in 32-bit R, the capped limit is 4G, and you cannot use more than 4G (digit limit) on a program. In such cases, consider using a 64-bit version.
Detail can refer to this article, is very good at https://blog.csdn.net/sinat_26917383/article/details/51114265
1 http://jliblog.com/archives/276
2 http://cos.name/wp-content/uploads/2011/05/01-Li-Jian-HPC.pdf
http://cran.r-project.org/web/views/HighPerformanceComputing.html 3 R high performance computing and parallel computing
If you encounter this problem, you can try the corresponding solution, the method is not bad oh ~
Error: Cannot allocate Vector of size 88.1MB
Tags: TPS ace should be the reference of the big err is unable to pay attention to the hive
The original address: https://www.cnblogs.com/babyfei/p/9565143.html