Developer's Diary
Software development, with Terry Ebdon
02-Aug-2020 awk regex groups

Retrieving regex groups in awk

Today I learned about gawk's enhancements to the match() function.

I wanted to report on disconnection messages in a Minecraft server's log, where the disconnect line appeared in one of two versions:

  1. …iHezi lost connection: Disconnected
  2. …com.mojang…authlib.GameProfile@6c2ffd3b[id=<null>,name=GLX27,properties=…

Disconnect version 1

The first version is easy:

  • field 1 is a timestamp
  • field 4 is a number with a # prefix
  • field 7 is the status, e.g. "Disconnected"

e.g.

[16:07:29] [Server thread/INFO]: OkqySeany lost connection: Disconnected

Reporting these disconnections is trivial:

print $1 "\t" $7 "\t" $4

Disconnect version 2

The second version is more tricky. There username is now in field 4, but it's embedded in something resembling the output of a generated toString() method.

In this example of field 4 the username, GLX27, is deeply embedded, so how do I extract it?

com.mojang.authlib.GameProfile@6c2ffd3b[id=<null>,name=GLX27,properties={},legacy=false] (/199.26.81.77:64150) lost connection: Disconnected

Normally I'd use awk's internal sub() function for something like this. But sub() doesn't allow access to regex groups. It lets me specify the groups, but I can't see what each group matched. The gawk implementation of match() gives me this functionality.

This is the match() statement that extracts the username:

match( $4, /name=\y(.+)\y,properties/, array )

Where \y is used to mark the word boundaries. I needed the contents of the bracketed group, i.e. the part between the \y markers. The match() function returns the groups as an array, passed in the last argument. i.e. it returns the username as array[1].

Putting it all together we get:

"lost connection" {
if ( index( $0, "id=<null>" ) > 0 ) {
# This is a version 2 log entry
match( $4, /name=\y(.+)\y,properties/, array )
$4 = array[1] # Username
$7 = $NF # Status
}
print $1 "\t" $7 "\t" $4

Here I'm conditionally replacing the fields used by the print statement. This allows a single line to contain the output format, avoiding mistakes if a change is required.

match() is very powerful, it does a lot more than I've discussed here.

The above is an expanded explanation of the disconnect message handling outlined in my answer to a Stack Overflow question.

20-JUL-2020 👈 Top of page 👉 03-AUG-2020

Buy Me a Coffee at ko-fi.com
© 2020 Terry Ebdon.

Find me coding on GitHub, networking on LinkedIn, answering questions on Stack Exchange and hanging out on twitter.