awk - 10 examples to read files with multiple delimiters

In this article of awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.

Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.

$ cat file
Item1:2010:10;20;30
Item2:2012:12;29;19
Item3:2014:15;50;61


1. To print the 3rd column which contains the prices:

$ awk -F: '{print $3}' file
10;20;30
12;29;19
15;50;61

This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.

2. To print the 1st component of $3 alone:

$ awk -F '[:;]' '{print $4}' file
20
29
50

What did we do here? Specified multiple delimiters, one is : and other is ; . How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered, store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached. In this way, $4 contained the first part of the price component above. Note: Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).
3. To sum the individual components of the 3rd column and print it:

$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file
Item1:2010:60
Item2:2012:60
Item3:2014:126

The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output. Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.
4. Un-group or re-group every record depending on the price column:

$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file
Item1:2010:10
Item1:2010:20
Item1:2010:30
Item2:2012:12
Item2:2012:29
Item2:2012:19
Item3:2014:15
Item3:2014:50
Item3:2014:61

The requirement here is: New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.

5-6. Read file in which the delimiter is square brackets:

$ cat file
123;abc[202];124
125;abc[203];124
127;abc[204];124

5. To print the value present within the brackets:

$ awk -F '[][]' '{print $2}' file
202
203
204

At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.
Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.
6. To print the first value, the value within brackets, and the last value:

$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file
123;202;124
125;203;124
127;204;124

3 delimiters are used in this case with semi-colon also included.

7-8. Read or parse a file containing a series of delimiters:

$ cat file
123;;;202;;;203
124;;;213;;;203
125;;;222;;;203

The above file contains a series of 3 semi-colons between every 2 values.

7. Using the multiple delimiter method:

$ awk -F'[;;;]' '{print $2}' file



Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.
8. Using the delimiter without square brackets:

$ awk -F';;;' '{print $2}' file
202
213
222

The expected output !!! No square brackets is used and we got the output which we wanted.

Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.
9. Read or parse a file containing a series of delimiters of varying lengths:
In the below file, the 1st and 2nd column are separated using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons

$ cat file
123;;;202;;;;203
124;;;213;;;;203
125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file
202 203
213 203
222 203

The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.
10. Using a word as a delimiter:

$ cat file
123Unix203
124Unix203
125Unix203

Retrieve the numbers before and after the word "Unix" :

$ awk -F'Unix' '{print $1, $2}' file
123 203
124 203
125 203

In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.
P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.