Advanced Filters
Advanced filters are used to filter and manipulate the data from cli and in shell scripts.
grep
grep
is not only one of the most useful commands, but also, mastery of grep opens the gates to mastery of other tools such as awk, sed and perl. grep
scans its input for a pattern and displays lines containing the pattern, the line numbers or filenames where the pattern occurs
Syntax
grep options pattern filename(s)
grep
searches for a pattern in one or more filename(s), or the standard input if no filename is specified. The first argument(barring the options) is the pattern and the remaining arguments are filenames.
Examples: Let’s consider the following ‘emp.lst" table
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000
mith:softsol:mumbai:020000
below command searches and prints only those lines having mumbai as pattern
grep "mumbai" emp.lst
# output
john:softech:mumbai:16000
mith:softsol:mumbai:020000
Because grep is a filter, it can search its standard input for the pattern, and also save the standard output in a file like
who | grep mumbai > ff1
grep Options with Examples
Ignore Case (-i)
Ignoring case (-i) will ignore the case for pattern matching. This even locates the name HYDERABAD using the pattern hyderabad.
grep –i "hyderabad" emp.lst
# output
rahul:ibm:hyderabad:15000
iran:oracle:HYDERABAD:017000
Deleting Lines (-v)
Deleting Lines (-v) grep has inverse option which selects all lines except those containing the pattern. More often than not, when we use grep –v , we also redirect its output to a file as means of getting rid of unwanted lines.
grep –v "mumbai" emp.lst
# output
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000
Display Line Numbers (-n)
Displaying line numbers(-n). The –n option displays the line numbers containing the pattern, along with the lines
grep –n "softsol" emp.lst
# output
6: king:softsol:kolkata:14000
8: mith:softsol:mumbai:020000
Count Lines (-c)
Counting lines containing the pattern (-c). The below command counts the number of lines containing the pattern ignoring the case.
grep –ic "hyderabad" emp.lst
# output
2
Matching multiple patterns (-e)
grep –e "mumbai" –e "kolkata" –e "pune" emp.lst
# output
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
mith:softsol:mumbai:020000
sed
The Stream Editor
sed
is a multipurpose tool which combines the work of several filters. sed
operations are straightforward. Each file named on the command line is opened and read, in turn. If there are no files, standard input is used, and the filename “-” (a single dash) for standard input.
sed
reads through each file one line at a time. The line is placed in an area of memory termed the pattern space. This is like a variable in a programming language: an area of memory that can be changed as desired under the direction of the editing commands. All editing operations are applied to the contents of the pattern space. When all operations have been completed, sed
prints the final contents of the pattern space to standard output, and then goes back to the beginning, reading another line of input.
Learning sed
will prepare you well for perl
which uses many of these features. sed
uses instructions to act on text. An instruction combines an address for selecting lines, with an action to be taken on them.
Syntax
sed options 'address action' file(s)
The address and action are enclosed within single quotes. Addressing in sed
is done in two ways.
- By one or two line numbers (like
3,7
). - By specifying a / -enclosed pattern which occurs in a line (like
/From:/
)
In the first form, address specifies either one line number to select a single line or a set of two (3,7)
to select a group of contiguous lines. The action component is drawn from sed's
family of internal commands. It can either be a simple display (print) or an editing functions like insertion, deletion or a substitution of text.
sed
processes several instructions sequentially. Each instruction operates on the output of the previous instruction. The -e option that lets you use multiple instructions, and the -f option to take instructions from a file.
Let’s consider the same “emp.lst” table as consider in grep filter.
sed Options with Examples
- To consider line addressing first, the instruction
'3q'
can be broken down to the address 3 and the action q(quit). When this instruction is enclosed within quotes and followed by one or more filenames, you can simulatehead -n 3
in this way,
sed '3q' emp.lst # quits after line no 3
# output
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
print (p)
Use p
(print) command to display lines. It outputs both the selected lines and all lines. So the selected lines appear twice. To suppress this behavior, use -n option and remember to use this option whenever we use the p command
.
Example: prints from second line to fourth line.
sed -n '2,4p' emp.lst
# output
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
Select Last Line ($)
To select last line of the file. use the $:
Example:
sed -n '$p' emp.lst
# output
mith:softsol:mumbai:020000
Negation (!)
Negating the action(!) sed
also has a negation operator (!), which can be used with any action. For instance, selecting the first two lines is the same as not selecting lines 3 through the end.
Example:
sed -n '3,$!p' emp.lst # Don't print lines 3 to end.
Multiple Instructions (-e and -f)
The -e option allows you to enter as many instructions as you wish, each preceded by the option.
Example:
sed -n -e '1,2p' -e '4,6p' -e '$p' emp.lst
Pass Commands From File
When there are too many instructions to use or when there are set of common instructions that you execute, they are better stored in a file.
Example:
cat inst1
### output
1,2p
4,6p
$p
You can now use the -f option to direct sed to take its instructions from the file using the command
Example:
sed -n -f inst1 emp.lst
Context Addressing (/{string}/p)
It lets you specify one or two patterns to locate lines. The pattern must be bounded by a / on either side.
sed -n '/softsol/p' emp.lst
# output
king:softsol:kolkata:14000
mith:softsol:mumbai:020000
You can also specify a comma-separated pair of context addresses to select a group of lines. Line and context addresses can also be mixed
Example:
$sed -n '/agni/,/king/p' emp.lst
# output
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
The above output is also written using the following command
sed -n '2,/king/p' emp.lst
Deleting Lines (d)
sed
uses d(delete) command to delete lines either by numbers or by context addressing.
Example: The below command will delete the third line.
sed '3d' emp.lst
To delete sequence of lines use the following command
sed '3,6d' emp.lst # deletes from third to sixth line
Note:
The syntax is similar to printing, but in deleting don’t use -n option and use d(delete) command for deleting.
We can also delete by context addressing as we did in printing.
sed '/mumbai/d' emp.lst # deletes those lines which contain mumbai
Note:
we can also specify multiple context addresses using -e option.
Substitution (s/:/#/)
Substitution is easily the most important feature of sed
. It lets you replace a pattern in its input with something else.
Example:
$sed 's/:/#/' emp.lst|head -n -3
# output
surya#hcl:chennai:19000
agni#microsoft:banglore:20000
rahul#ibm:hyderabad:15000
Here only the first occurrence of the : in a line has been replaced. We need to use the g (global) flag to replace all the colons like
Example:
sed 's/:/#/g' emp.lst|head -n 3
# output
surya#hcl#chennai#19000
agni#microsoft#banglore#20000
rahul#ibm#hyderabad#15000
It is possible to limit the vertical boundaries by specifying an address
Example:
sed '1,3s/:/#/g' emp.lst # First three lines only
Substitution is not restricted to a single character ; it can be any string.
Example: The below command will replace the word mumbai with MUMBAI from fourth line to end of the file
sed '4,$s/mumbai/MUMBAI/' emp.lst
Text editing (i\)
sed
can insert, append and change the text in a file. i
(insert), a
(append), c(change).
Example: The above below will insert the text in the first line.
sed '1i\
> hi how are you #this will be inserted text
>' file1
So we have to specify the number to insert in that particular line.
Note:
The same operation above is applicable for appending and changing the text. We have to use
a
(append) for appending andc
(change) for changing the text.
awk
The awk
text-processing language is useful for such tasks as,
- Tallying information from text files and creating reports from the results.
- Translating files from one format to another.
- Creating small databases.
- Performing mathematical operations on files of numeric data.
awk
has two faces: it is a utility for performing simple text-processing tasks, and it is a programming language for performing complex text-processing tasks.
There are, however, things that awk
is not. It is not really well suited for extremely large, complicated tasks. It is also an interpreted
language – that is, a awk
program cannot run on its own, it must be executed by the awk
utility itself. That means that it is relatively slow, though it is efficient as interpretive languages go, and that the program can only be used on systems that have awk
.
One last item before proceeding, What does the name awk
mean? awk
actually stands for the names of its authors: Aho, Weinberger, & Kernighan
. Kernighan later noted: Naming a language after its authors ... shows a certain poverty of imagination."
It is easy to use awk
from the command line to perform simple operations on text files.
Examples, Syntax and Options
Suppose a file named “coins.txt” that describes a coin collection. Each line in the file contains the following information:
metal weight date country description
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
silver 10 1981 USA ingot
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
silver 0.5 1986 USA Liberty 50-cent piece
silver 1 1987 USA Constitution dollar
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
To invoke awk
to list all the gold pieces as follows
awk '/gold/' coins.txt
# output
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
gold 0.25 1986 USA Liberty 5-dollar piece
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
This is all very nice, a critic might say, but any “grep” or “find” utility can do the same thing. True, but awk
is capable of doing much more.
For example, suppose to print the description field, and leave all the other text out. invoke awk
to
awk '/gold/ {print $5}' coins.txt
# output
American Eagle
Franz Josef 100 Korona
ingot
Krugerrand
Krugerrand
Panda
Liberty 5-dollar piece
Constitution 5-dollar piece
Maple Leaf
Above example demonstrates the simplest general form of an awk
program
Syntax:
awk <search pattern> {<program actions>}
awk searches through the input file for each line that contains the search pattern. For each of these lines found, awk
then performs the specified actions. In this example, the action is specified as:
{print $5}
The purpose of the print
statement is obvious. The $5
is a field
, or field variable
, which store the words in each line of text by their numeric sequence. $1
, for example, stores the first word in the line, $2
has the second, and so on. By default, a word
is defined as any string of printing characters separated by spaces.
awk’s default program action is to print the entire line, which is what print
does when invoked without parameters. This means that the first example,
awk '/gold/' coins.txt
Above command is the same as
awk '/gold/ {print}' coins.txt
Note that awk
recognizes the field variable $0 as representing the entire line, so this could also be written as:
awk '/gold/ {print $0}' coins.txt
Example: Command will print first, fourth, and fifth fields in the lines which contain “RSA” as pattern.
awk ‘/RSA/{print $1,$4,$5}’
All the special characters that can be used with the echo command can be also be used with awk
print statements, including
\n
(new line)\t
(tab)\b
(backspace)\f
(formfeed)\r
(carriage return)
Example: command will output the fields each separated by a tab space.
awk '{print $1"\t"$3"\t"$5}' coins.txt
You can search for more than one pattern match at a time by placing the multiple criteria in consecutive order and separating them with a pipe (|) symbol
Example:
awk ‘/RSA|USA/’ coins.txt
# output
gold 1 1986 USA American Eagle
silver 10 1981 USA ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
silver 0.5 1986 USA Liberty 50-cent piece
silver 1 1987 USA Constitution dollar
gold 0.25 1987 USA Constitution 5-dollar piece
Example: Consider the following file “emp.lst”
surya:hcl:chennai:19000
agni:microsoft:banglore:20000
rahul:ibm:hyderabad:15000
john:softech:mumbai:16000
smith:amsol:pune:13000
king:softsol:kolkata:14000
iran:oracle:HYDERABAD:017000
mith:softsol:mumbai:020000
If you attempted to print by specifying like below, It will print the entire contents of the file.
awk '{print $2}' emp.lst
Because, here the field separator is colon(:). We have to mention the field separator using –F option. In earlier cases, we didn’t mention field separator because awk
treats space, tab, or multiple spaces as the default field separator. This field separator is also called as delimiter.
To get the second field in above example, use the following
awk –F ":" ‘{print $2}’ emp.lst
This feature can be changed within the program by using the Output Field Separator (OFS) variable. For example, to read the file (separated by colons) and display it with dashes, the command would be
awk -F":" '{OFS="-"}{print $1,$2,$4}' emp.lst
# output
surya-hcl-19000
agni-microsoft-20000
rahul-ibm-15000
john-softech-16000
smith-amsol-13000
king-softsol-14000
iran-oracle-017000
mith-softsol-020000
Pre-defined variables, for example:
- NR - the current input line number
- NF - number of fields in the input line
Example: Will print from second line to fifth line i.e. range and fields first, second and fourth.
awk –F ":" ‘NR==2,NR==5{print $1,$2,$4}’ emp.lst
agni microsoft 20000
rahul ibm 15000
john softech 16000
smith amsol 13000
Here it will print from second line to fifth line i.e. range and fields first, second and fourth.
Note:
awk
default output field separator is space. We can change this using output field separator (OFS).awk
is the most powerful utility which will be operational at row level and field level.
Querying Files Using awk
awk can be used to project the data from files
< less than
<= less than or equal to
== equal to
!= not equal
>= greater than or equal to
> greater than
Examples:
awk –F":" ‘$4>15000{print $1,$4}’ emp.lst
# output
surya 19000
agni 20000
john 16000
iran 017000
mith 020000
awk –F":" '$3=="mumbai"{print $1,$3}' emp.lst
# output
john mumbai
mith Mumbai
Example of more detailed output
awk –F":" 'NR==2,NR==4{print "employee name is",$1,"salary is",$4}' emp.lst
# output
Employee name is agni salary is 20000
Employee name is rahul salary is 15000
Employee name is john salary is 16000
Formatting output using print (like in C, terminate with a semicolon. Brackets are used to enclose the argument, and the text is enclosed using double quotes:
The above command can be written like
awk –F":" 'NR==2,NR==4{printf("employee name is %s and salary is %d\n",$1,$4);}' emp.lst
# output
Employee name is agni and salary is 20000
Employee name is rahul and salary is 15000
Employee name is john and salary is 16000