logo

Text Processing

Last Updated: 2023-02-05

Useful tools:

  • grep, egrep and fgrep: match patterns.
  • sed(stream editor) is for programmatically editing files based on lines.
  • awk is for text processing, especially useful for table-like text files like csv.

Replace characters

$ cat foo.txt | tr "," "_" > bar.txt

for unprintable character, e.g. \u0007, press ctrl-V ctrl-G

Change Everything to Uppercase

$ cat foo.txt | tr "[a-z]" "[A-Z]"

Count Rows

$ cat foo.txt | wc -l

Count Columns

If delimiter is ,

$ cat foo.txt | awk -F, '{print NF}'

or

$ awk 'BEGIN {FS=","} {print NF}' file.txt

where

  • NF=Number of Fields
  • FS=File Separator

If delimiter is \u0007(ctrl-v ctrl-g)

$ cat foo.txt | awk -F'^G' '{print NF}'

Get Column Number

replace <pattern> with the column name or pattern

head -1 foo.csv | awk -v RS="|" '/<pattern>/{print NR;}'

Print Rows by Number

print the second row:

$ awk 'NR==2' filename

print line 2 to line 10

$ awk 'NR==2,NR==10' filename

Match

Show ssh processes

$ ps | grep ssh

Specify max count by -m:

$ cat foo.log | grep -m 10 ERROR

Add One Line To The Beginning

$ sed -i '1s/^/line to insert\n/' /path/to/file

Remove 1st line in place

$ sed -i 1d filename

Extract a Column

awk

$ cat file | awk '{print $2}'

cut

$ echo 'a b c' | cut -d ' ' -f1
a

$ echo 'a b c' | cut -d ' ' -f2
b

Add - to list everything to the right

$ echo 'a b c' | cut -d ' ' -f2-
b c

Add comma to the end of the line

$ cat foo.txt | sed s/$/,/g

split

Split data into chunks

Split by number of lines: split myfile, each chunk has 500 lines, prefixed by segment_, i.e. segment_aa, segment_ab, segment_ac...

$ split -l 500 myfile segment_

Split by size: split myfile, each chunk is 40k

$ split -b 40k myfile segment_

sort

  • -k: (key) column number
  • -t: delimiter
$ cat file | sort -nr -t \| -k 2 | head

Pretty print JSON

$ cat data.json | python -m json.tool

Append Multiple Lines To File

Use Here Document syntax, << "EOF" means the multi-line text ends at string EOF, anything in between will be appended to file.

cat >> path/to/file/to/append-to.txt << "EOF"
Some text here
And some text here
EOF

grep / egrep / fgrep

grep, egrep and fgrep are used to match patterns in files, here are the differences:

  • grep: basic regular expressions
  • egrep: extended regular expressions(?, +, |), equivalent to grep -E
  • fgrep: fixed patterns, no regular expression; faster than grep and egrep; equivalent to grep -F

Checkout the Regular_expression wikipedia page for the definitions of POSIX basic and extended regular expressions

Assume there's a examle.txt file containing 4 lines:

$ cat example.txt
hello world
good luck
good day
linux

grep vs fgrep

fgrep does not support regular expression at all, this will return nothing

$ cat example.txt | fgrep g..d

use grep instead

$ cat example.txt | grep g..d
good luck
good day

grep vs egrep

grep does not support |, so this will return nothing

$ cat example.txt | grep "good|linux"

however egrep can recognize | as OR

$ cat example.txt | egrep "good|linux"
good luck
good day
linux

Count Occurrence: -c

$ cat example.txt | grep -c good
2

Get Context: -C

Set --context=0 to print that line alone

$ cat example.txt | grep --context=0 "good luck"
good luck

Set --context=1 to print 1 line below and 1 line above

$ cat example.txt | grep --context=1 "good luck"
hello world
good luck
good day

or use -C 1

$ cat example.txt | grep -C 1 "good luck"
hello world
good luck
good day

Ignore: -v

Use -v to exclude some lines(i.e. NOT)

$ cat example.txt | grep good | grep -v day
good luck

Case Insensitive: -i

$ cat example.txt | grep GOOD

$ cat example.txt | grep -i GOOD
good luck
good day

Show Match in Color: --color

$ cat example.txt | grep good --color

Show Matched Line Number: -n

$ cat example.txt | grep good -n
2:good luck
3:good day

Show Matched File Name: -l

grep is not limited to searching a single file, compare the results below

$ grep good example.txt
good luck
good day

to search from multiple files:

$ grep good *
example.txt:good luck
example.txt:good day

filename will be shown along with the matched lines; to show the filename only:

$ grep -l good *
example.txt

what happens to the "pipe" version?

$ cat example.txt | grep good -l
(standard input)

Search for Whole Words Only: -w

$ grep -w goo example.txt

this returns nothing since goo is a pattern though not a whole word

$ grep goo example.txt
good luck
good day

Recursive grep: -R

This will search all the directory and sub-directories recursively

$ grep -R pattern *

Set Maximum Matches: -m

$ cat example.txt | grep -m 1 good
good luck