Understanding awk command with examples
Awk is a simple and powerful programming language used for processing text files. It is widely used by system administrators as a data manipulation, extraction and reporting tool. Awk treats every file as a sequence of records and by default each records are separated by a new line. Each record consist of one or more fields which are by default separated by a space.
The syntax for awk command is
awk [option] condition {action} file_name
cat bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
print and printf command
To display the output text using awk we use the print and printf commands. print $n will print the nth field of each record in an input file. So print $1 will display the first field of each record, print $2 will display the second field of each record and so on. Now lets run our first awk command to view the second field of the each record by executing
awk '{print $2}' bill.csv
Sample output:
fruit
fruit
fruit
veg
veg
fruit
fruit
veg
veg
awk '{print "Cost of " $3 "Kg of " $1 " is " $4}' bill.csv
Sample Output:
Cost of 2Kg of apple is 200
Cost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
In the above example awk command in each line will print whatever is written inside the double quotes along with the $n data. One thing you have to remember is $n should be written outside the double quotes else print command will consider it as a normal string and will print as it is.Cost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
The same output can be displayed with printf command in a much more easier method.To get a similar output with printf command, execute
awk '{printf "Cost of %dKg of %s is %d\n",$3,$1,$4 }' bill.csv
Cost of 2Kg of apple is 200
Cost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
where %d and %s are formats used by printf command to represent decimals and strings. The formats are written along with the statement inside double quotes and their corresponding $n values are written outside the double quotes in the order of appearance. Please check the man pages for more printf formatsCost of 1Kg of banana is 50
Cost of 3Kg of orange is 240
Cost of 2Kg of onion is 50
Cost of 1Kg of potato is 28
BEGIN and END Patterns
BEGIN and END patterns are special awk patterns which are not processed against the input. The statements written inside the BEGIN rule are executed before input file is processed. Similarly the statements written inside the END rules are executed only after processing the input file. Lets see how useful these patterns are
We saw how to print a particular column of a file with awk command and we also know that $0 will print the fields in a row. ie to display the entire content of bill.csv file
awk '{print $0}' bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Using BEGIN and END pattern we can format the output with additional details. To do so I am going to add a BEGIN rule and another END rule to the above commandbanana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
awk 'BEGIN{print "Item\tType\tQty\tAmount\n"} { print $0} END {print "\n**Thank You**"}' bill.csv
Sample Output:
Item Type Qty Amount
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
**Thank You**
The output shows each column now has a heading and there is a message at the bottom. By examining the output we can say that column headings are the sentences written inside the BEGIN rule whereas the footer messages are sentences written inside the END rule. Here the BEGIN rule executed whatever was written inside it before processing the test file. After that awk processed the file and $0 output was printed. Once the file was completely processed, awk executed the statement inside the END rule.apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
**Thank You**
Searching patterns with awk
awk can be used to search pattern. To display the lines containing fruit, execute
awk /fruit/ bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
banana fruit 1 50
orange fruit 3 240
awk built-in variables
awk contains several built-in variables. Some of the most widely used variables are
NR
NR keeps a count total number of input records. By default two records are separated by a new line and and can be used to count line numbers.
awk '{printf "%d\t%s\n",NR,$0}' bill.csv
Sample Output:
1 apple fruit 2 200
2 banana fruit 1 50
3 orange fruit 3 240
4 onion veg 2 50
5 potato veg 1 28
2 banana fruit 1 50
3 orange fruit 3 240
4 onion veg 2 50
5 potato veg 1 28
NF
NF is the built-in variable which keeps a count of total fields in a record. The below awk command prints the number of fields per record at the end of each line.
awk '{printf "%s\tNumber of field = %d\n",$0,NF}' bill.csv
Sample Output:
apple fruit 2 200 Number of field = 4
banana fruit 1 50 Number of field = 4
orange fruit 3 240 Number of field = 4
onion veg 2 50 Number of field = 4
potato veg 1 28 Number of field = 4
banana fruit 1 50 Number of field = 4
orange fruit 3 240 Number of field = 4
onion veg 2 50 Number of field = 4
potato veg 1 28 Number of field = 4
FS
awk command by defaullt considers white space (space and tab) as the character that separates two fields. But there are cases when two fields are separated by characters other than white space. In such situations we use awk built-in variable FS to change the default Field Separator. For example, take the case of /etc/passwd file. The file contains the user account details. The file contains one record per line and each record contains 7 fields separated by a colon (:). The first field represents the account name. Lets see the below example where we are trying to view the first column (user account) of /etc/passwd file with default field separator.
awk '{print $1}' /etc/passwd
Sample Output:
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
awk 'BEGIN { FS = ":" } {print $1}' /etc/passwd
Sample Output:
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
uucp
bin
daemon
adm
lp
sync
shutdown
halt
uucp
RS
RS is the built-in variable which keeps the current record separator character. By default the record separator character is a new line (\n). RS variable can be used to change the default record separator from new line to any other character. Before moving towards the example lets add an empty line below the line starting with orange in bill.csv file, so that file's content will be
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Now lets execute an awk command to print whole records and display the total record count at the END
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
awk '{print $0} END {printf "Total %d records",NR}' bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Total 6 records
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Total 6 records
awk 'BEGIN { RS = "\n\n" } {print $0} END {printf "Total %d records",NR}' bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Total 2 records
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
Total 2 records
OFS
OFS is the built-in awk variable which keeps the current output field separator character. the default output field separator character is a space. OFS can be used to change the default field separator character from space to any other character when awk prints them. For example, to print the first and fourth column of bill.csv file, execute
awk '{print $1,$4 }' bill.csv
Sample Output:
apple 200
banana 50
orange 240
onion 50
potato 28
awk prints the first and fourth column of bill.csv file. In this case both fields of the output is separated by a space. To replace the current field separator in the awk output with an equal to sign (=),execute
banana 50
orange 240
onion 50
potato 28
awk 'BEGIN { OFS = "=" } {print $1,$4 }' bill.csv
Sample Output:
apple=200
banana=50
orange=240
onion=50
potato=28
banana=50
orange=240
onion=50
potato=28
ORS
awk 'BEGIN { ORS = "\n\n" } {print $0 }' bill.csv
Sample Output:
apple fruit 2 200
banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
From the output we can see that after changing the ORS character, two records are now separated by double line instead of a single new line. banana fruit 1 50
orange fruit 3 240
onion veg 2 50
potato veg 1 28
For more details about awk please refer the awk man pages.
Comments
Post a Comment