awk - Join or merge lines on finding a pattern
In one of our earlier articles, we had discussed about joining all lines in a file and also joining every 2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining lines on encountering a pattern using awk or gawk.
Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.
$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO
1. Join the lines following the pattern START without any delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO
Basically, what we are trying to do is: Accumulate the lines following the START and print them on encountering the next START statement. /START/ searches for lines containing the pattern START. The command within the {} will work only on lines containing the START pattern. Prints a blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the very beginning of the output since it encounters a START in the beginning.
The next command prevents the remaining part of the command from getting executed for the START lines. The second part of braces {} works only for the lines not containing the START. This part simply prints the line without a terminating new line character(printf). And hence as a result, we get all the lines after the pattern START in the same line. The END label is put to print a newline at the end without which the prompt will appear at the end of the last line of output itself.
2. Join the lines following the pattern START with space as delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO
This is same as the earlier one except it uses the format specifier %s in order to accommodate an additional space which is the delimiter in this case.
3. Join the lines following the pattern START with comma as delimiter.
$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO
Here, we form a complete line and store it in a variable x and print the variable x whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is empty, assign the current line($0) to x, else append a comma and the current line to x. As a result, x will contain the lines joined with a comma following the START pattern. And in the END label, x is printed since for the last group there will not be a START pattern to print the earlier group.
4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.
$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO
The difference here is the missing next statement. Because next is not there, the commands present in the second set of curly braces are applicable for the START line as well, and hence it also gets concatenated.
5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.
$ awk '/START/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO
In this, instead of forming START as part of the variable x, the START line is printed. As a result, the START line comes out separately, and the remaining lines get joined.