Removing duplicate lines from Linux shell files

original text file

$ cat test              
jason
jason
jason
fffff
jason

Method 1: sort -u

after removal of repetition

sort -u test
fffff
jason

notice that the order is disrupted

sort test|uniq

after removal of repetition

$sort test |uniq 
fffff
jason

note that the order is disrupted, the principle and method are the same

method three: awk ‘! A [$0] + + ‘

after removal of repetition

$ awk '!a[$0]++' test
jason
fffff

order remains the same, file deduplication example

awk '!a[$0]++' test.txt >test.txt.tmp && mv -f test.txt.tmp test.txt

where awk USES a temporary file to overwrite the result

specific principle is as follows:

awk’s program instructions consist of patterns and actions, in the form of Pattern {Action}. If the Action is omitted, print $0 will be executed by default.

Pattern:

can be used here to remove repetition

!a[$0]++

In awk, for uninitialized array variables, an initial value of 0 is assigned to them during numerical operations, so a[$0]=0, and the ++ operator is characterized by first value and then 1, so Pattern is equivalent to

!0

and 0 is false,! In order to get the reverse, the final result of the whole Pattern is 1, which is equivalent to if(1). The Pattern matching is successful, and the current record is output. For the DUP file, the processing method of the first three records is the same.

when the data “Jason” in line 2 is read, a[$0]=1, and the result after taking the reverse is 0, that is, the Pattern is 0, and the Pattern matching fails, so this record is not output, and the subsequent data is followed by the same, and the duplicate lines in the file are finally removed successfully.

Read More: