Advanced Filters

Advanced filters are used to filter and manipulate the data from cli and in shell scripts.

grep

grep is not only one of the most useful commands, but also, mastery of grep opens the gates to mastery of other tools such as awk, sed and perl. grep scans its input for a pattern and displays lines containing the pattern, the line numbers or filenames where the pattern occurs

Syntax

grep options pattern filename(s)

grep searches for a pattern in one or more filename(s), or the standard input if no filename is specified. The first argument(barring the options) is the pattern and the remaining arguments are filenames.

Examples: Let’s consider the following ‘emp.lst" table

surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000
mith:softsol:mumbai:020000

below command searches and prints only those lines having mumbai as pattern

grep "mumbai" emp.lst
# output
john:softech:mumbai:16000
mith:softsol:mumbai:020000

Because grep is a filter, it can search its standard input for the pattern, and also save the standard output in a file like

who | grep mumbai > ff1

grep Options with Examples

Ignore Case (-i)

Ignoring case (-i) will ignore the case for pattern matching. This even locates the name HYDERABAD using the pattern hyderabad.

grep –i "hyderabad" emp.lst
# output
rahul:ibm:hyderabad:15000
iran:oracle:HYDERABAD:017000

Deleting Lines (-v)

Deleting Lines (-v) grep has inverse option which selects all lines except those containing the pattern. More often than not, when we use grep –v , we also redirect its output to a file as means of getting rid of unwanted lines.

grep –v  "mumbai" emp.lst
# output
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000

Display Line Numbers (-n)

Displaying line numbers(-n). The –n option displays the line numbers containing the pattern, along with the lines

grep –n  "softsol" emp.lst
# output
6: king:softsol:kolkata:14000
8: mith:softsol:mumbai:020000

Count Lines (-c)

Counting lines containing the pattern (-c). The below command counts the number of lines containing the pattern ignoring the case.

grep –ic "hyderabad" emp.lst
# output
2

Matching multiple patterns (-e)

grep –e "mumbai" –e "kolkata" –e "pune" emp.lst
# output
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
mith:softsol:mumbai:020000

sed

The Stream Editor

sed is a multipurpose tool which combines the work of several filters. sed operations are straightforward. Each file named on the command line is opened and read, in turn. If there are no files, standard input is used, and the filename “-” (a single dash) for standard input.

sed reads through each file one line at a time. The line is placed in an area of memory termed the pattern space. This is like a variable in a programming language: an area of memory that can be changed as desired under the direction of the editing commands. All editing operations are applied to the contents of the pattern space. When all operations have been completed, sed prints the final contents of the pattern space to standard output, and then goes back to the beginning, reading another line of input.

Learning sed will prepare you well for perl which uses many of these features. sed uses instructions to act on text. An instruction combines an address for selecting lines, with an action to be taken on them.

Syntax

sed options 'address action' file(s)

The address and action are enclosed within single quotes. Addressing in sed is done in two ways.

  • By one or two line numbers (like 3,7).
  • By specifying a / -enclosed pattern which occurs in a line (like /From:/)

In the first form, address specifies either one line number to select a single line or a set of two (3,7) to select a group of contiguous lines. The action component is drawn from sed's family of internal commands. It can either be a simple display (print) or an editing functions like insertion, deletion or a substitution of text.

sed processes several instructions sequentially. Each instruction operates on the output of the previous instruction. The -e option that lets you use multiple instructions, and the -f option to take instructions from a file.

Let’s consider the same “emp.lst” table as consider in grep filter.

sed Options with Examples

  • To consider line addressing first, the instruction '3q' can be broken down to the address 3 and the action q(quit). When this instruction is enclosed within quotes and followed by one or more filenames, you can simulate head -n 3 in this way,
sed '3q' emp.lst  # quits after line no 3
# output
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000

Use p (print) command to display lines. It outputs both the selected lines and all lines. So the selected lines appear twice. To suppress this behavior, use -n option and remember to use this option whenever we use the p command.

Example: prints from second line to fourth line.

sed -n '2,4p' emp.lst
# output
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000

Select Last Line ($)

To select last line of the file. use the $:

Example:

sed -n '$p' emp.lst
# output
mith:softsol:mumbai:020000

Negation (!)

Negating the action(!) sed also has a negation operator (!), which can be used with any action. For instance, selecting the first two lines is the same as not selecting lines 3 through the end.

Example:

sed -n '3,$!p' emp.lst # Don't print lines 3 to end.

Multiple Instructions (-e and -f)

The -e option allows you to enter as many instructions as you wish, each preceded by the option.

Example:

sed -n -e '1,2p' -e '4,6p' -e '$p' emp.lst

Pass Commands From File

When there are too many instructions to use or when there are set of common instructions that you execute, they are better stored in a file.

Example:

cat inst1
### output
1,2p
4,6p
$p

You can now use the -f option to direct sed to take its instructions from the file using the command

Example:

sed -n -f inst1 emp.lst

Context Addressing (/{string}/p)

It lets you specify one or two patterns to locate lines. The pattern must be bounded by a / on either side.

sed -n '/softsol/p' emp.lst
# output
king:softsol:kolkata:14000
mith:softsol:mumbai:020000

You can also specify a comma-separated pair of context addresses to select a group of lines. Line and context addresses can also be mixed

Example:

$sed -n '/agni/,/king/p' emp.lst
# output
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000

The above output is also written using the following command

sed -n '2,/king/p' emp.lst

Deleting Lines (d)

sed uses d(delete) command to delete lines either by numbers or by context addressing.

Example: The below command will delete the third line.

sed '3d' emp.lst

To delete sequence of lines use the following command

sed '3,6d' emp.lst # deletes from third to sixth line

Note:

The syntax is similar to printing, but in deleting don’t use -n option and use d(delete) command for deleting.

We can also delete by context addressing as we did in printing.

sed '/mumbai/d' emp.lst # deletes those lines which contain mumbai 

Note:

we can also specify multiple context addresses using -e option.

Substitution (s/:/#/)

Substitution is easily the most important feature of sed. It lets you replace a pattern in its input with something else.

Example:

$sed 's/:/#/' emp.lst|head -n -3
# output
surya#hcl:chennai:19000
agni#microsoft:banglore:20000
rahul#ibm:hyderabad:15000

Here only the first occurrence of the : in a line has been replaced. We need to use the g (global) flag to replace all the colons like

Example:

sed 's/:/#/g' emp.lst|head -n 3
# output             
surya#hcl#chennai#19000
agni#microsoft#banglore#20000
rahul#ibm#hyderabad#15000

It is possible to limit the vertical boundaries by specifying an address

Example:

sed '1,3s/:/#/g' emp.lst # First three lines only

Substitution is not restricted to a single character ; it can be any string.

Example: The below command will replace the word mumbai with MUMBAI from fourth line to end of the file

sed '4,$s/mumbai/MUMBAI/' emp.lst

Text editing (i\)

sed can insert, append and change the text in a file. i (insert), a (append), c(change).

Example: The above below will insert the text in the first line.

sed '1i\
> hi how are you #this will be inserted text 
>' file1

So we have to specify the number to insert in that particular line.

Note:

The same operation above is applicable for appending and changing the text. We have to use a (append) for appending and c (change) for changing the text.

awk

The awk text-processing language is useful for such tasks as,

  • Tallying information from text files and creating reports from the results.
  • Translating files from one format to another.
  • Creating small databases.
  • Performing mathematical operations on files of numeric data.

awk has two faces: it is a utility for performing simple text-processing tasks, and it is a programming language for performing complex text-processing tasks.

There are, however, things that awk is not. It is not really well suited for extremely large, complicated tasks. It is also an interpreted language – that is, a awk program cannot run on its own, it must be executed by the awk utility itself. That means that it is relatively slow, though it is efficient as interpretive languages go, and that the program can only be used on systems that have awk.

One last item before proceeding, What does the name awk mean? awk actually stands for the names of its authors: Aho, Weinberger, & Kernighan. Kernighan later noted: Naming a language after its authors ... shows a certain poverty of imagination."

It is easy to use awk from the command line to perform simple operations on text files.

Examples, Syntax and Options

Suppose a file named “coins.txt” that describes a coin collection. Each line in the file contains the following information:

metal  weight date country              description

gold     1    1986  USA                 American Eagle
gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
silver  10    1981  USA                 ingot
gold     1    1984  Switzerland         ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
gold     0.1  1986  PRC                 Panda
silver   1    1986  USA                 Liberty dollar
gold     0.25 1986  USA                 Liberty 5-dollar piece
silver   0.5  1986  USA                 Liberty 50-cent piece
silver   1    1987  USA                 Constitution dollar
gold     0.25 1987  USA                 Constitution 5-dollar piece
gold     1    1988  Canada              Maple Leaf

To invoke awk to list all the gold pieces as follows

awk '/gold/' coins.txt
# output
gold     1    1986  USA                 American Eagle
gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
gold     1    1984  Switzerland         ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
gold     0.1  1986  PRC                 Panda
gold     0.25 1986  USA                 Liberty 5-dollar piece
gold     0.25 1987  USA                 Constitution 5-dollar piece
gold     1    1988  Canada              Maple Leaf

This is all very nice, a critic might say, but any “grep” or “find” utility can do the same thing. True, but awk is capable of doing much more.

For example, suppose to print the description field, and leave all the other text out. invoke awk to

awk '/gold/ {print $5}' coins.txt
# output
American Eagle  
Franz Josef 100 Korona
ingot   
Krugerrand   
Krugerrand   
Panda   
Liberty 5-dollar piece 
Constitution 5-dollar piece 
Maple Leaf

Above example demonstrates the simplest general form of an awk program

Syntax:

awk <search pattern> {<program actions>}

awk searches through the input file for each line that contains the search pattern. For each of these lines found, awk then performs the specified actions. In this example, the action is specified as:

{print $5}

The purpose of the print statement is obvious. The $5 is a field, or field variable, which store the words in each line of text by their numeric sequence. $1, for example, stores the first word in the line, $2 has the second, and so on. By default, a word is defined as any string of printing characters separated by spaces.

awk’s default program action is to print the entire line, which is what print does when invoked without parameters. This means that the first example,

awk '/gold/' coins.txt

Above command is the same as

awk '/gold/ {print}' coins.txt

Note that awk recognizes the field variable $0 as representing the entire line, so this could also be written as:

awk '/gold/ {print $0}' coins.txt

Example: Command will print first, fourth, and fifth fields in the lines which contain “RSA” as pattern.

awk ‘/RSA/{print $1,$4,$5}’

All the special characters that can be used with the echo command can be also be used with awk print statements, including

  • \n (new line)
  • \t (tab)
  • \b (backspace)
  • \f (formfeed)
  • \r (carriage return)

Example: command will output the fields each separated by a tab space.

awk '{print $1"\t"$3"\t"$5}' coins.txt

You can search for more than one pattern match at a time by placing the multiple criteria in consecutive order and separating them with a pipe (|) symbol

Example:

awk ‘/RSA|USA/’ coins.txt
# output
gold     1    1986  USA                 American Eagle
silver  10    1981  USA                 ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
silver   1    1986  USA                 Liberty dollar                         
gold          0.25 1986  USA                 Liberty 5-dollar piece
silver   0.5  1986  USA                 Liberty 50-cent piece
silver   1    1987  USA                 Constitution dollar
gold     0.25 1987  USA                 Constitution 5-dollar piece

Example: Consider the following file “emp.lst”

surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000
mith:softsol:mumbai:020000

If you attempted to print by specifying like below, It will print the entire contents of the file.

awk '{print $2}' emp.lst

Because, here the field separator is colon(:). We have to mention the field separator using –F option. In earlier cases, we didn’t mention field separator because awk treats space, tab, or multiple spaces as the default field separator. This field separator is also called as delimiter.

To get the second field in above example, use the following

awk –F ":" ‘{print $2}’ emp.lst

This feature can be changed within the program by using the Output Field Separator (OFS) variable. For example, to read the file (separated by colons) and display it with dashes, the command would be

awk -F":" '{OFS="-"}{print $1,$2,$4}' emp.lst
# output
surya-hcl-19000
agni-microsoft-20000
rahul-ibm-15000
john-softech-16000
smith-amsol-13000
king-softsol-14000
iran-oracle-017000
mith-softsol-020000

Pre-defined variables, for example:

  • NR - the current input line number
  • NF - number of fields in the input line

Example: Will print from second line to fifth line i.e. range and fields first, second and fourth.

awk –F ":" ‘NR==2,NR==5{print $1,$2,$4}’ emp.lst
agni microsoft 20000
rahul ibm 15000
john softech 16000
smith amsol 13000

Here it will print from second line to fifth line i.e. range and fields first, second and fourth.

Note:

awk default output field separator is space. We can change this using output field separator (OFS). awk is the most powerful utility which will be operational at row level and field level.

Querying Files Using awk

awk can be used to project the data from files

<       less than 
<=     less than or equal to 
==      equal to 
!=      not equal 
>=     greater than or equal to 
>       greater than 

Examples:

awk –F":" ‘$4>15000{print $1,$4}’ emp.lst
# output
surya 19000
agni 20000
john 16000
iran 017000
mith 020000
awk –F":" '$3=="mumbai"{print $1,$3}' emp.lst
# output
john mumbai
mith Mumbai

Example of more detailed output

awk –F":" 'NR==2,NR==4{print "employee name is",$1,"salary is",$4}' emp.lst
# output
Employee name is agni salary is 20000
Employee name is rahul salary is 15000
Employee name is john salary is 16000

Formatting output using print (like in C, terminate with a semicolon. Brackets are used to enclose the argument, and the text is enclosed using double quotes:

The above command can be written like

awk –F":" 'NR==2,NR==4{printf("employee name is %s and salary is %d\n",$1,$4);}' emp.lst
# output
Employee name is agni and salary is 20000
Employee name is rahul and salary is 15000
Employee name is john and salary is 16000

Subscribe For More Content