Understanding awk command with examples


Awk is a simple and powerful programming language used for processing text files. It is widely used by system administrators as a data manipulation, extraction and reporting tool. Awk treats every file as a sequence of records and by default each records are separated by a new line. Each record consist of one or more fields which are by default separated by a space.

The syntax for awk command is

awk [option] condition {action} file_name
i.e awk will perform an action on a text file based on a condition provided. But this not always necessary. We can omit either the condition or the action from the syntax depending upon  our requirement. Before starting this awk tutorial lets create a file named bill.csv. The contents of the file is displayed below
cat bill.csv
Sample Output:
apple    fruit   2    200
banana   fruit   1    50
orange   fruit   3    240
onion    veg     2    50
potato   veg     1    28
By observing the above output we can say that the file consist of 5 lines and each line contains 4 fields separated by a space. In other words we can say that there are a total of 5 records and 4 fields per record. Each field in the input record may be referenced by its position: $1, $2, and so on. Where $1 represents the first field, $2 represents the second field and so on. The whole record is represented by $0.


print  and printf command

To display the output text using awk we use the print and printf commands. print $n will print the nth  field of each record in an  input file. So print $1 will display the first  field of each record, print $2 will display the second  field of each record and so on. Now lets run our first awk command to view the second field of the each record by executing
awk '{print $2}' bill.csv
Sample output:
fruit
fruit
fruit
veg
veg
The  awk command above will print the second field in each record line by line so that it will appear that awk command has extracted the second column from the input file. Using print command we can  further customize the outputs in more meaningful ways like the example below
awk '{print "Cost of " $3 "Kg of " $1 " is " $4}' bill.csv
Sample Output:
Cost of 2Kg of apple is 200
Cost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
In the above example awk command in each line will print whatever is written inside the double quotes along with the $n data. One thing you have to remember is $n should be written outside the double quotes else print command will consider it as a normal string and will print as it is.
The same output can be displayed with printf command in a much more easier method.To get a similar output with printf command, execute
awk '{printf "Cost of %dKg of %s is %d\n",$3,$1,$4 }' bill.csv
Cost of 2Kg of apple is 200
Cost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
where %d and %s are formats used by printf command to represent decimals and strings. The formats are written along with the statement inside double quotes and their corresponding $n values are written outside the double quotes in the order of appearance.  Please check the man pages for more printf formats

BEGIN and END Patterns

BEGIN and END patterns are special awk patterns which are not processed against the input. The statements written inside the BEGIN rule are executed before input file is processed. Similarly the statements written inside the END rules are executed only after processing the input file. Lets see how useful these patterns are
We saw how to print a particular column of a file with awk command and we also know that $0 will print the fields in a row. ie to display the entire content of bill.csv file
awk '{print $0}' bill.csv
Sample Output:
apple    fruit   2    200
banana   fruit   1    50
orange   fruit   3    240
onion    veg     2    50
potato   veg     1    28
Using BEGIN and END pattern we can format the output with additional details. To do so I am going to add a BEGIN rule and another END rule to the above command
awk 'BEGIN{print "Item\tType\tQty\tAmount\n"} { print $0} END {print "\n**Thank You**"}' bill.csv
Sample Output:
Item    Type  Qty  Amount

apple   fruit  2   200
banana  fruit  1   50
orange  fruit  3   240
onion   veg    2   50
potato  veg    1   28

**Thank You**
The output shows each column now has a heading and there is a message at the bottom. By examining the output we can say that column headings are the sentences written inside the BEGIN rule whereas the footer messages are sentences written inside the END rule. Here the BEGIN rule executed whatever was written inside it before processing the test file. After that awk processed the file and $0 output was printed. Once the file was completely processed, awk executed the statement inside the END rule.


Searching patterns with awk

awk can be used to search pattern. To display the lines containing  fruit, execute
awk /fruit/ bill.csv
Sample Output:
apple  fruit 2 200
banana fruit 1 50
orange fruit 3 240

awk built-in variables

awk  contains several built-in variables. Some of the most widely used variables are

NR

NR keeps a count total number of input records. By default two records are separated by a new line and and can be used to count line numbers. 
awk '{printf "%d\t%s\n",NR,$0}' bill.csv
Sample Output:
1   apple    fruit   2    200
2   banana   fruit   1    50
3   orange   fruit   3    240
4   onion    veg     2    50
5   potato   veg     1    28
In the above example awk first prints count of input record followed by a tab space and the whole records line by line. NR count is 1 in line one, 2 in line two and so on.

NF

NF is the built-in variable which keeps a count of total fields in a  record. The below awk command prints the number of fields per record at the end of each line.
awk '{printf "%s\tNumber of field = %d\n",$0,NF}' bill.csv
Sample Output:
apple    fruit   2    200  Number of field = 4  
banana   fruit   1    50   Number of field = 4
orange   fruit   3    240  Number of field = 4
onion    veg     2    50   Number of field = 4
potato   veg     1    28   Number of field = 4
In the above example awk first prints first record , which is then followed by a tab space and the number of fields in the first line. The process continues till all the records are processed.


FS

awk command by defaullt considers white space (space and tab) as the character that separates two fields. But there are cases when two fields are separated by characters other than white space. In such situations we use awk built-in variable FS to change the default Field Separator. For example, take the case of /etc/passwd file. The file contains the user account details. The file contains one record per line and each record contains 7 fields separated by a colon (:). The first field represents the account name. Lets see the below example where we are trying to view the first column (user account) of /etc/passwd file with default field separator.
awk '{print $1}' /etc/passwd
Sample Output:
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
In the above output you can see instead printing first field the whole record is printed. This is because awk  considers each record as a single field since the field seperator in the case was white space. Now lets change the field separator to colon (:) and lets see what happens.

awk 'BEGIN { FS = ":" } {print $1}' /etc/passwd
Sample Output:
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
uucp
In the above example before processing the file we changed the fields separator value from space to colon ( : ) so that two fields are now separated by a colon instead of white spaces.


RS

RS is the built-in variable which keeps the current record separator character. By default the  record separator character is a new line (\n). RS variable can be used to change the default record separator from new line to any other character. Before moving towards the example lets add an empty line below the line starting with orange in bill.csv file, so that file's content will be
apple    fruit   2    200
banana   fruit   1    50
orange   fruit   3    240

onion    veg     2    50
potato   veg     1    28
Now lets execute an awk command to print whole records and display the total record count at the END
awk '{print $0} END {printf "Total %d records",NR}' bill.csv
Sample Output:
apple    fruit   2   200
banana   fruit   1    50
orange   fruit   3    240

onion    veg     2    50
potato   veg     1    28

Total 6 records
In the above example awk counted the number of records in the file as 6 including the empty line. Now lets change the record separator from a single newline (\n) to double  line (\n\n) and count the number of records.
awk 'BEGIN { RS = "\n\n" } {print $0} END {printf "Total %d records",NR}' bill.csv
Sample Output:
apple    fruit   2   200
banana   fruit   1    50
orange   fruit   3    240
onion    veg     2    50
potato   veg     1    28

Total 2 records

OFS

OFS is the built-in awk variable which keeps the current output field separator character.  the default output field separator character is a space. OFS can be used to change the default field separator character from space to any other character when awk prints them. For example, to print the first and fourth column of bill.csv file, execute

awk '{print $1,$4 }' bill.csv
Sample Output:
apple 200
banana 50
orange 240
onion 50
potato 28
awk prints the first and fourth column of bill.csv file. In this case both fields of the output is separated by a space. To replace the current field separator in the awk output with an equal to sign (=),execute
awk 'BEGIN { OFS = "=" } {print $1,$4 }' bill.csv
Sample Output:
apple=200
banana=50
orange=240
onion=50
potato=28


ORS

ORS is the built-in awk variable which keeps the current output record separator character.  the default output record separator character is a new line. ORS can be used to change the default field separator character from new line to any other character when awk prints them. In the below example lets see how to add an empty line at the end of each record.
awk 'BEGIN { ORS = "\n\n" } {print $0 }' bill.csv
Sample Output:
apple    fruit   2    200

banana   fruit   1    50

orange   fruit   3    240

onion    veg     2    50

potato   veg     1    28
From the output we can see that after changing the ORS character, two records are now separated by double line instead of a single new line. 

For more details about awk please refer the awk man pages.

Comments

Popular posts from this blog

what is an inode?

Understanding sed command with example -Part 1