4. Explaining the Mapfile Grammar

For a great introduction to Lark, the general purpose parsing library used by mappyfile see the Lark reference page.

The full Mapfile grammar file is shown at the end of this page. The latest version can be seen here on GitHub.

4.1. A Simple Example

The easiest way to understand how the parser and grammar work is to go through a short example. We will use the Mapfile snippet below to parse and turn into an AST (Abstract Syntax Tree).

MAP
    NAME 'Test'
END

This produces the following tree (click for the full-size version).

_images/tree.png

The tree is stored as a Python object, shown below:

Tree(start, [Tree(composite, [Tree(composite_type, [Token(__MAP32, 'MAP')]), Tree(attr, [Tree(attr_name, [Token(NAME, 'NAME')]), Tree(string, [Token(STRING2, "'Test'")])])])])

The grammar file contains rules and terminals, and the parser matches these to the input text to create the tree.

We will now go through all the rules matched by our example Mapfile.

The first rule start will always be at the root of the tree:

start: _NL* composite _NL*

The rule checks for a composite rule, surrounded by zero to many (denoted by *) new lines. The newline terminal _NL is defined at the end of the grammar file.

Next we’ll look at the composite rule:

composite: composite_type attr? _NL+ composite_body _END
       | composite_type points _END
       | composite_type pattern _END
       | composite_type attr _END

This rule is matched by a list of options - each on its own line starting with the | (pipe) character. When writing or debugging a grammar you can comment out options to see which one matches for a particular input.

In our example the Mapfile matches the last option composite_type attr _END, which can be broken down as follows:

  • a composite_type rule (in this case MAP)
  • an attr rule
  • the _END terminal (the literal string “END”)

As each rule creates a new branch on the tree a new composite branch is created from the start branch.

!composite_type: "CLASS"i // i here is used for case insensitive matches, so CLASS, Class, or class will all be matched
            | "LEGEND"i
            | "MAP"i
            // list cut for brevity

The composite_type rule again has a list of options, but in this case consists of string literals. The ! before the rule name means the rule will keep all their terminals. Without the ! the tree would look as below - note the MAP value is missing from the composite_type branch.

_images/tree_no_terminals.png

Next we look at the attr rule:

attr: attr_name value+

This consists of 2 further rules. An attr_name match and one or more value rules (the + denotes one or more matches). A new attr branch is added to the tree with these rules as children.

The attr_name rule is as follows:

attr_name: NAME | composite_type

In our example the rule is matched by the NAME terminal (the alternative is a composite_type). This is defined using a regular expression to match a string:

NAME: /[a-z_][a-z0-9_]*/i

The regular expression can be explained as follows:

  • [a-z_] - a single letter from a to z, or an underscore
  • [a-z0-9_] - a single letter from a to z, a number from 0-9, or an underscore
  • the * indicates zero to many matches
  • the i indicates a case-insensitive search

Therefore the following strings would all match: NAME, name, Name, NAME_, MY_NAME, however 'NAME', My Name, N@me would not.

The final rule matched in our example is the value rule.

?value: bare_string | string | int | float | expression | not_expression | attr_bind | path | regexp | runtime_var | list

The ? preceding the rule causes it to be “inlined” if it has a single child. This means a new branch isn’t created for the value rule, and its child is added directly to the branch. The tree with “inlining” results in:

_images/tree_inlining.png

And without inlining (the ? character), the value branch appears:

_images/tree_noinlining.png

This example covers a large number of the rules in the grammar file, and hopefully provides a basis for understanding the more complicated rules.

4.2. Terminals

Terminals are displayed in uppercase in the grammar file. They are used to match tokens using string literals, regular expressions, or combinations of other terminals.

If the terminal is proceeded by an underscore it won’t appear in the tree, for example the closing blocks of each of the composite types:

_END: "END"i

Many of the terminals make use of regular expressions to match tokens. The site https://regex101.com/ provides useful explanations of these. Remember to set the Python “Flavor” and to remove the surrounding forward slashes.

For example to get an explanation of COMMENT: /\#[^\n]*/ enter \#[^\n]*:

_images/regex.png

Some further explanations are below:

// check for path names e.g. /root/logs
PATH: /[a-z_]*[.\/][a-z0-9_\/.]+/i

4.3. Miscellaneous Notes

  • Rules can be matched recursively so one of the composite_body can contain a _composite_item which can in turn be another composite. This allows us to parse nested composite types such as a CLASS in a LAYER in a MAP.
composite_body: _composite_item*
_composite_item: (composite|attr|points|projection|metadata|pattern|validation|values) _NL+

4.4. Grammar File

The full Mapfile grammar is shown below.

start: (_NL* composite _NL*)+

composite: composite_type attr? _NL* composite_body _END
       | composite_type points _END
       | composite_type pattern _END
       | composite_type attr _END
       | metadata
       | validation

composite_body: _composite_item*
_composite_item: (composite|attr|points|projection|pattern|values) _NL+

points: "POINTS"i _NL* (_num_pair _NL*)* _END
pattern: "PATTERN"i _NL* (_num_pair _NL*)* _END

projection: "PROJECTION"i _NL* ((string _NL*)+|AUTO _NL+) _END
values: "VALUES"i _NL* ((string_pair) _NL+)+ _END

metadata: "METADATA"i _NL* ((string_pair|attr) _NL+)+ _END
validation: "VALIDATION"i _NL* ((string_pair|attr) _NL+)+ _END

attr: attr_name value+

attr_name: NAME | composite_type
?value: bare_string | string | int | float | expression | not_expression | attr_bind | path | regexp | runtime_var | list

int: SIGNED_INT
int_pair: int int
!bare_string: NAME | "SYMBOL"i | "AUTO"i | "GRID"i | "CLASS"i | "FEATURE"i
string: STRING1 | STRING2 | STRING3 
string_pair: string string
float: SIGNED_FLOAT
float_pair: float float
path: PATH
regexp: REGEXP1 | REGEXP2
runtime_var: RUNTIME_VAR
list: "{" value ("," value)* "}"

_num_pair: (int|float) _NL* (int|float)

attr_bind: "[" bare_string "]"

not_expression: ("!"|"NOT"i) expression
expression: "(" or_test ")"
?or_test : (or_test ("OR"i|"||"))? and_test
?and_test : (and_test ("AND"i|"&&"))? comparison
?comparison: (comparison compare_op)? add
!compare_op: ">=" | "<" | "=*" | "==" | "=" | "~" | "~*" | ">" | "<=" | "IN" | "NE" | "EQ"

?add: (add "+")? (func_call | value)
func_call: attr_name "(" func_params ")"
func_params: value ("," value)*

!composite_type: "CLASS"i
            | "CLUSTER"i
            | "COMPOSITE"i
            | "CONFIG"i
            | "FEATURE"i
            | "FONTSET"i
            | "GRID"i
            | "INCLUDE"i
            | "JOIN"i
            | "LABEL"i
            | "LAYER"i
            | "LEADER"i
            | "LEGEND"i
            | "MAP"i
            | "OUTPUTFORMAT"i
            | "QUERYMAP"i
            | "REFERENCE"i
            | "SCALEBAR"i
            | "SCALETOKEN"i
            | "STYLE"i
            | "SYMBOL"i
            | "WEB"i

AUTO: "AUTO"i
PATH: /[a-z_]*[.\/][a-z0-9_\/.]+/i
NAME: /[a-z_][a-z0-9_]*/i

SIGNED_FLOAT: ["-"|"+"] FLOAT
SIGNED_INT: ["-"|"+"] INT

%import common.FLOAT
%import common.INT

STRING1: /".*?(?<!\\\\)(\\\\\\\\)*?"i?/
STRING2: /'.*?(?<!\\\\)(\\\\\\\\)*?'i?/
STRING3: /`.*?`i?/   // XXX TODO
REGEXP1: /\/.*?\/i?/
REGEXP2: /\\\\.*?\\\\i?/
RUNTIME_VAR: /%.*?%/

COMMENT: /\#[^\n]*/
CCOMMENT: /\/(?s)[*].*?[*]\//

_END: "END"i

WS: /[ \t\f]+/
_NL: /[\r\n]+/

%ignore COMMENT
%ignore CCOMMENT
%ignore WS