Generic text parsing

Basics

The principle of parsing a generic text file is to use regular expressions to extract fields in a text file, using regex named groups.

By default, every line of the input file is interpreted as a record, but you can use the line break character to merge subsequent strings.

For every record, each pattern of the parsing rules will be searched. The behaviour of the parsing engine can be to stop on the first pattern found (using the checkbox Stop on first pattern matching), or keep on evaluating the other patterns to extract other data in the current record.

Reference: your contract n°(?<contractNumber>\d+)
Dear Mrs? (?<clientName>.*),

... could be used to parse standard letters and associate contract numbers with client names.

Comments

For more readability, you can use line of comments among the patterns, by starting the line with a # sign.

# This is a comment

Aliases

For more readability, aliases can be defined to re-use standard patterns:

#define IP \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Then this pattern can be used in other patterns:

Your IP address is ${address::IP}

... will replace the ${...} group by the definition of IP, and this group will be named address in the current record.

Multiple options for the parsing of the same concept can be described:

#define PORT \d+

#options PORTDEF
eq ${unique::PORT}
range ${min::PORT} ${max::PORT}
#endoptions

Such a definition can now be used like a simple alias:

port-mapping ${src::PORTDEF} -> ${dst::PORTDEF}

This previous line will be equivalent to a complete definition of the following patterns (the complete Cartesian product of all combinations):

port-mapping eq (?<srcUnique>\d+) -> eq (?<dstUnique>\d+)
port-mapping range (?<srcMin>\d+) (?<srcMax>\d+) -> eq (?<dstUnique>\d+)
port-mapping eq (?<srcUnique>\d+) -> range (?<dstMin>\d+) (?<dstMax>\d+)
port-mapping range (?<srcMin>\d+) (?<srcMax>\d+) -> range (?<dstMin>\d+) (?<dstMax>\d+)

Notice how the fields names are built : the fields name in the #options definition are appended to the fields name in the main pattern, in Camel Case (i.e. the first letter is set to uppercase).

Context

We have mentioned that every line (or concatenated lines) of the input file is considered as a single record. But some fields may not be repeated on every line : this is contextual information, that we want to keep on all subsequent records. To achieve this goal, just declare a field as contextual :

#contextual contractNumber

Means that as soon as the field contractNumber will be found in a record, its value will persist for all records, until a line contains an optional pattern:

#endcontext contractNumber ^Done$

... this means that as soon as the parser will find a line that is exactly the string Done, it will clear the value of the field contractNumber.

Test of a context

Some patterns may also be pertinent only in a specific context (i.e. if a contextual field has a value). You can write blocks of contextual patterns as follows:

#ifdef contractNumber
Client name : (?<clientName>.*)$
Phone number : (?<phone>[\d\.]+)
#endif