Tag Archives: Random sampling

R language notes – sample() function

Field studies and sample selection in medical statistics or epidemiology often use one word: random sampling. Random sampling is an important method to ensure the equilibrium among comparison groups. So the first function introduced today is the function sample for sampling:

> x=1:10
> sample(x=x)

 [1]  3  5  9  6 10  7  2  1  8  4

The first line represents assigning the x vector 1 to 10, and the second line represents random sampling of the x vector. The output is the result of each sampling, and it can be seen that the sampling is not put back — at most n times, n is the number of elements in the x vector.

if you want to specify the number of elements extracted from the vector, you need to add a parameter size:

> x=1:1000
> sample(x=x,size=20)

 [1]  66 891 606 924 871 374 879 573 284 305 914 792 398 497 721 897 324 437
[19] 901  33

This is sampled in positive integers from 1 to 1000, where size specifies the number of times the sample is sampled, 20 times, and the result is shown above.
These are not put back into the sample. No put back sampling means that once an element is selected, there will be no more of that element in the population. If the sample is put back, a parameter repalce=T should be added:

> x=1:10
> sample(x=x,size=5,replace=T)

[1] 4 7 2 4 8

“Replace” means to repeat. So you can sample the elements repeatedly, which is what’s called a put back sample. We look at the results above. Element 4 is selected twice in the course of 5 random sampling.


R language code has a feature is “contraption”, maybe my word is not professional, but it means: if we enter the position of the code corresponds to the position of the parameters in a function, we can not write the parameters of the function, such as:

> x=1:10
> sample(x,20,T)

 [1] 1 2 2 1 5 5 5 9 9 5 2 9 8 3 4 8 8 8 1 1

In the above code, we have omitted the parameters x, size and Repalce, but it can still be evaluated and indicates that the x vector is put back to random extraction 20 times. The reason we try to take parameters with us every time we write code is because I think it’s a good habit and it looks clear. In addition, if you are familiar with the location of a function’s arguments, you will get the wrong result if there is no “counterpoint”. And many functions have too many arguments to remember where they are. If the parameters are taken, the operation can be carried out even if the positions do not correspond:

> x=1:10
> sample(size=20,replace=T,x=x)

 [1]  4  9  2  6  4  5  4  7 10  5  2  2  3  4  2  4  6  8  7  8

This advantage is obvious, not only clear, but also has no counterpart. And we can also see that if you put it back, the size is infinite, and if you don’t put it back, the size depends on the size of the population.

for the roll of dice, the roll of a coin (this is probably a necessary introduction to sampling), is a put back sampling.
It should be explained here that for the SAMPLE function, the parameter x can be either a numerical value or a character. In fact, the parameter x represents any vector:

> a=c("A","B")
> sample(x=a,size=10,replace=T)

 [1] "B" "A" "A" "A" "B" "A" "A" "B" "A" "A"

The code above can be interpreted as the number of flips of A coin, in which heads (A) and tails (B) occur 10 times.

above mentioned sampling process, each element is drawn with equal probability, called random sampling.
Sometimes our probability of extracting elements may not be equal (for example, common binomial distribution probability problems). In this case, we need to add a parameter prob, which is the abbreviation of “probability”. If a doctor performs an operation on a patient with an 80% chance of success, how many times can he operate on 20 patients today?The code is as follows:

> x=c("S","F")
> sample(x,size=20,replace=T,prob=c(0.8,0.2))

 [1] "F" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "F" "S" "S" "F" "S" "S"
[19] "F" "S"

Where “S” stands for success and “F” for failure.

> x=c(1,3,5,7)
> sample(x,size=20,replace=T,prob=c(0.1,0.2,0.3,0.9))

 [1] 3 5 7 3 7 3 7 5 3 7 7 7 1 5 7 5 7 7 3 7

These codes tell us that each element can be given a probability, and each probability is independent, that is, in the parameter PROb, the probability of all elements does not necessarily add up to 1, it only represents the probability of an element being extracted.

for the sample function, the parameter x can be any object in R (such as the sample character above). Another of the same functions is sample.int, short for “intger” or “integer.” Its argument n must be a positive integer:

> x=-10.5:7.5
> sample(x=x,size=3);sample.int(n=x,size=3)

[1] -5.5 -7.5  0.5
Error in sample.int(x, size = 3) : invalid first argument

The first line of code generates an arithmetic sequence of -10.5 to 7.5. The first line of output is the result of SAMPLE. The second line is the result of sample.int with an error: “First argument invalid” because it is not a positive integer. The rest of the usage is the same as sample.

pick from http://www.wtoutiao.com/p/186VWin.html