Text Processing
Useful tools:
grep
,egrep
andfgrep
: match patterns.sed
(stream editor) is for programmatically editing files based on lines.awk
is for text processing, especially useful for table-like text files like csv.
Replace characters
$ cat foo.txt | tr "," "_" > bar.txt
for unprintable character, e.g. \u0007
, press ctrl-V ctrl-G
Change Everything to Uppercase
$ cat foo.txt | tr "[a-z]" "[A-Z]"
Count Rows
$ cat foo.txt | wc -l
Count Columns
If delimiter is ,
$ cat foo.txt | awk -F, '{print NF}'
or
$ awk 'BEGIN {FS=","} {print NF}' file.txt
where
- NF=Number of Fields
- FS=File Separator
If delimiter is \u0007
(ctrl-v ctrl-g
)
$ cat foo.txt | awk -F'^G' '{print NF}'
Get Column Number
replace <pattern>
with the column name or pattern
head -1 foo.csv | awk -v RS="|" '/<pattern>/{print NR;}'
Print Rows by Number
print the second row:
$ awk 'NR==2' filename
print line 2 to line 10
$ awk 'NR==2,NR==10' filename
Match
Show ssh
processes
$ ps | grep ssh
Specify max count by -m
:
$ cat foo.log | grep -m 10 ERROR
Add One Line To The Beginning
$ sed -i '1s/^/line to insert\n/' /path/to/file
Remove 1st line in place
$ sed -i 1d filename
Extract a Column
awk
$ cat file | awk '{print $2}'
cut
$ echo 'a b c' | cut -d ' ' -f1
a
$ echo 'a b c' | cut -d ' ' -f2
b
Add -
to list everything to the right
$ echo 'a b c' | cut -d ' ' -f2-
b c
Add comma to the end of the line
$ cat foo.txt | sed s/$/,/g
split
Split data into chunks
Split by number of lines: split myfile
, each chunk has 500 lines, prefixed by segment_
, i.e. segment_aa
, segment_ab
, segment_ac
...
$ split -l 500 myfile segment_
Split by size: split myfile
, each chunk is 40k
$ split -b 40k myfile segment_
sort
- -k: (key) column number
- -t: delimiter
$ cat file | sort -nr -t \| -k 2 | head
Pretty print JSON
$ cat data.json | python -m json.tool
Append Multiple Lines To File
Use Here Document syntax, << "EOF"
means the multi-line text ends at string EOF
, anything in between will be appended to file.
cat >> path/to/file/to/append-to.txt << "EOF"
Some text here
And some text here
EOF
grep / egrep / fgrep
grep, egrep and fgrep are used to match patterns in files, here are the differences:
- grep: basic regular expressions
- egrep: extended regular expressions(
?
,+
,|
), equivalent togrep -E
- fgrep: fixed patterns, no regular expression; faster than grep and egrep; equivalent to
grep -F
Checkout the Regular_expression wikipedia page for the definitions of POSIX basic and extended regular expressions
Assume there's a examle.txt
file containing 4 lines:
$ cat example.txt
hello world
good luck
good day
linux
grep
vs fgrep
fgrep
does not support regular expression at all, this will return nothing
$ cat example.txt | fgrep g..d
use grep
instead
$ cat example.txt | grep g..d
good luck
good day
grep
vs egrep
grep
does not support |
, so this will return nothing
$ cat example.txt | grep "good|linux"
however egrep
can recognize |
as OR
$ cat example.txt | egrep "good|linux"
good luck
good day
linux
Count Occurrence: -c
$ cat example.txt | grep -c good
2
Get Context: -C
Set --context=0
to print that line alone
$ cat example.txt | grep --context=0 "good luck"
good luck
Set --context=1
to print 1 line below and 1 line above
$ cat example.txt | grep --context=1 "good luck"
hello world
good luck
good day
or use -C 1
$ cat example.txt | grep -C 1 "good luck"
hello world
good luck
good day
Ignore: -v
Use -v
to exclude some lines(i.e. NOT)
$ cat example.txt | grep good | grep -v day
good luck
Case Insensitive: -i
$ cat example.txt | grep GOOD
$ cat example.txt | grep -i GOOD
good luck
good day
Show Match in Color: --color
$ cat example.txt | grep good --color
Show Matched Line Number: -n
$ cat example.txt | grep good -n
2:good luck
3:good day
Show Matched File Name: -l
grep
is not limited to searching a single file, compare the results below
$ grep good example.txt
good luck
good day
to search from multiple files:
$ grep good *
example.txt:good luck
example.txt:good day
filename will be shown along with the matched lines; to show the filename only:
$ grep -l good *
example.txt
what happens to the "pipe" version?
$ cat example.txt | grep good -l
(standard input)
Search for Whole Words Only: -w
$ grep -w goo example.txt
this returns nothing since goo
is a pattern though not a whole word
$ grep goo example.txt
good luck
good day
Recursive grep: -R
This will search all the directory and sub-directories recursively
$ grep -R pattern *
Set Maximum Matches: -m
$ cat example.txt | grep -m 1 good
good luck