Explaining the Mapfile Grammar

For a great introduction to Lark, the general purpose parsing library used by mappyfile see the Lark reference page.

The full Mapfile grammar file is shown at the end of this page. The latest version can be seen here on GitHub.

A Simple Example

The easiest way to understand how the parser and grammar work is to go through a short example. We will use the Mapfile snippet below to parse and turn into an AST (Abstract Syntax Tree).

MAP
    NAME 'Test'
END

This produces the following tree (click for the full-size version).

_images/tree.png

The tree is stored as a Python object, shown below:

Tree(start, [Tree(composite, [Tree(composite_type, [Token(__MAP39, 'MAP')]),
Tree(composite_body, [Tree(attr, [Token(UNQUOTED_STRING, 'NAME'),
Tree(string, [Token(SINGLE_QUOTED_STRING, "'Test'")])])])])])

This can formatted as follows:

start
  composite
    composite_type  MAP
    composite_body
      attr
        NAME
        string      'Test'

The grammar file contains rules and terminals, and the parser matches these to the input text to create the tree.

We will now go through all the rules matched by our example Mapfile.

The first rule start will always be at the root of the tree:

start: composite+

The rule checks for a composite rule, followed by a plus sign, indicating one or more composite types should be found in the input.

Next we’ll look at the composite rule:

composite: composite_type composite_body _END
       | metadata
       | validation

This rule is matched by a list of options - each on its own line starting with the | (pipe) character. When writing or debugging a grammar you can comment out options to see which one matches for a particular input.

In our example the Mapfile matches the first option composite_type composite_body _END, which can be broken down as follows:

  • a composite_type rule (in this case MAP)

  • a composite_body rule

  • the _END terminal (the literal string “END”)

As each rule creates a new branch on the tree a new composite branch is created from the start branch.

!composite_type: "CLASS"i // i here is used for case insensitive matches, so CLASS, Class, or class will all be matched
            | "CLUSTER"i
            | "MAP"i
            // list cut for brevity

The composite_type rule again has a list of options, but in this case consists of string literals. The ! before the rule name means the rule will keep all their terminals. Without the ! the tree would look as below - note the MAP value is missing from the composite_type branch.

_images/tree_no_terminals.png

Next let’s look at the attr rule:

attr: attr_name value+

This consists of 2 further rules. An attr_name match and one or more value rules (as noted above the + denotes one or more matches). A new attr branch is added to the tree with these rules as children.

The attr_name rule is as follows:

attr_name: NAME | composite_type

In our example the rule is matched by the NAME terminal (the alternative is a composite_type). This is defined using a regular expression to match a string:

NAME: /[a-z_][a-z0-9_]*/i

The regular expression can be explained as follows:

  • [a-z_] - a single letter from a to z, or an underscore

  • [a-z0-9_] - a single letter from a to z, a number from 0-9, or an underscore

  • the * indicates zero to many matches

  • the i indicates a case-insensitive search

Therefore the following strings would all match: NAME, name, Name, NAME_, MY_NAME, however 'NAME', My Name, N@me would not.

The final rule matched in our example is the value rule.

?value: bare_string | string | int | float | expression | not_expression | attr_bind | path | regexp | runtime_var | list

The ? preceding the rule causes it to be “inlined” if it has a single child. This means a new branch isn’t created for the value rule, and its child is added directly to the branch. The tree with “inlining” results in:

_images/tree_inlining.png

And without inlining (the ? character), the value branch appears:

_images/tree_noinlining.png

This example covers a large number of the rules in the grammar file, and hopefully provides a basis for understanding the more complicated rules.

Terminals

Terminals are displayed in uppercase in the grammar file. They are used to match tokens using string literals, regular expressions, or combinations of other terminals.

If the terminal is proceeded by an underscore it won’t appear in the tree, for example the closing blocks of each of the composite types:

_END: "END"i

Many of the terminals make use of regular expressions to match tokens. The site https://regex101.com/ provides useful explanations of these. Remember to set the Python “Flavor” and to remove the surrounding forward slashes.

For example to get an explanation of COMMENT: /\#[^\n]*/ enter \#[^\n]*:

_images/regex.png

Some further explanations are below:

// check for path names e.g. /root/logs
PATH: /[a-z_]*[.\/][a-z0-9_\/.]+/i

Miscellaneous Notes

  • Rules can be matched recursively so one of the composite_body can contain a _composite_item which can in turn be another composite. This allows us to parse nested composite types such as a CLASS in a LAYER in a MAP.

composite_body: _composite_item*
_composite_item: (composite|attr|points|projection|metadata|pattern|validation|values) _NL+
%import common.INT

// this is defined in common.g as follows

DIGIT: "0".."9"
INT: DIGIT+

Grammar File

The full Mapfile grammar is shown below.

// =================================================================
// 
// Authors: Erez Shinan, Seth Girvin
// 
// Copyright (c) 2020 Seth Girvin
// 
// Permission is hereby granted, free of charge, to any person
// obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights to use,
// copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following
// conditions:
// 
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
// 
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.
// 
// =================================================================

start: "SYMBOLSET"i composite_body _END   -> symbolset
     | composite+

composite: composite_type composite_body _END
       | metadata
       | validation
       | connectionoptions

composite_body: _composite_item*
_composite_item: (composite|attr|points|projection|pattern|values|config)

!projection: "PROJECTION"i (string*|AUTO) _END
!config: "CONFIG"i (string | UNQUOTED_STRING) (string | UNQUOTED_STRING)

!points: "POINTS"i num_pair* _END
!pattern: "PATTERN"i num_pair* _END

!values: "VALUES"i string_pair* _END
!metadata: "METADATA"i string_pair* _END
!validation: "VALIDATION"i string_pair* _END
!connectionoptions: "CONNECTIONOPTIONS"i string_pair* _END

attr: (UNQUOTED_STRING | composite_type) (value | UNQUOTED_STRING | UNQUOTED_STRING_VALUE)
%declare UNQUOTED_STRING_VALUE  // generated using the interactive parser. See issues #48, #98

?value: string | int | float | expression | not_expression | attr_bind | path
| regexp | runtime_var | list | NULL | true | false | extent | rgb | hexcolor
| colorrange | hexcolorrange | num_pair | attr_bind_pair | attr_mixed_pair | _attr_keyword

int: SIGNED_INT
int_pair: int int
rgb: int int int
colorrange: int int int int int int
hexcolorrange: hexcolor hexcolor
hexcolor: DOUBLE_QUOTED_HEXCOLOR | SINGLE_QUOTED_HEXCOLOR

extent: (int|float) (int|float) (int|float) (int|float) 

!_attr_keyword: "AUTO"i | "HILITE"i | "SELECTED"i 

string: DOUBLE_QUOTED_STRING | SINGLE_QUOTED_STRING | ESCAPED_STRING
string_pair: (string|UNQUOTED_STRING) (string|UNQUOTED_STRING)

attr_bind_pair: attr_bind attr_bind
attr_mixed_pair: attr_bind (int|float) | (int|float) attr_bind
float: SIGNED_FLOAT
float_pair: float float
path: PATH
regexp: REGEXP1 | REGEXP2
runtime_var: RUNTIME_VAR
list: "{" (value | UNQUOTED_STRING_SPACE) ("," (value | UNQUOTED_STRING_SPACE))* "}"

num_pair: (int|float) (int|float)

attr_bind: "[" UNQUOTED_STRING "]"

not_expression: ("!"|"NOT"i) comparison
expression: "(" or_test ")"
?or_test : (or_test ("OR"i|"||"))? and_test
?and_test : (and_test ("AND"i|"&&"))? comparison
?comparison: (comparison compare_op)? sum
!compare_op: ">=" | "<" | "=*" | "==" | "=" | "!=" | "~" | "~*" | ">" | "%"
| "<=" | "IN"i | "NE"i | "EQ"i | "LE"i | "LT"i | "GE"i | "GT"i | "LIKE"i

?sum: product
    | sum "+" product -> add
    | sum "-" product -> sub

?product: unary_expr
    | product "*" unary_expr -> mul
    | product "/" unary_expr -> div
    | product "^" unary_expr -> power

?unary_expr: atom
    | "-" unary_expr -> neg
    | "+" unary_expr

?atom: (func_call | value)
// ?multiply: (multiply "*")? (func_call | value)

func_call: UNQUOTED_STRING "(" func_params ")"
func_params: value ("," value)*

!true: "TRUE"i
!false: "FALSE"i

!composite_type: "CLASS"i
            | "CLUSTER"i
            | "COMPOSITE"i
            | "FEATURE"i
            | "GRID"i
            | "JOIN"i
            | "LABEL"i
            | "LAYER"i
            | "LEADER"i
            | "LEGEND"i
            | "MAP"i
            | "OUTPUTFORMAT"i
            | "QUERYMAP"i
            | "REFERENCE"i
            | "SCALEBAR"i
            | "SCALETOKEN"i
            | "STYLE"i
            | "WEB"i
            | "SYMBOL"i

AUTO: "AUTO"i
PATH: /([a-z0-9_]*\.*\/|[a-z0-9_]+[.\/])[a-z0-9_\/\.-]+/i

// rules allow optional alphachannel
DOUBLE_QUOTED_HEXCOLOR.2: /\"#(?:[0-9a-fA-F]{3}){1,2}([0-9a-fA-F]{2})?\"/
SINGLE_QUOTED_HEXCOLOR.2: /'#(?:[0-9a-fA-F]{3}){1,2}([0-9a-fA-F]{2})?'/

NULL: "NULL"i

SIGNED_FLOAT: ["-"|"+"] FLOAT
SIGNED_INT: ["-"|"+"] INT

INT: /[0-9]+(?![_a-zA-Z])/

%import common.FLOAT

// UNQUOTED_STRING: /[a-z_][a-z0-9_\-]*/i
UNQUOTED_STRING: /[a-z0-9_\xc0-\xff\-:]+/i
UNQUOTED_STRING_SPACE: /[a-z0-9\xc0-\xff_\-: ']+/i
DOUBLE_QUOTED_STRING: "\"" ("\\\""|/[^"]/)* "\"" "i"?
SINGLE_QUOTED_STRING: "'" ("\\'"|/[^']/)* "'" "i"?
ESCAPED_STRING: /`.*?`i?/
//KEYWORD: /[a-z]+/i

//UNQUOTED_NUMERIC_STRING: /[a-z_][a-z0-9_\-]*/i

REGEXP1.2: /\/.*?\/i?/
REGEXP2: /\\\\.*?\\\\i?/

RUNTIME_VAR: /%.*?%/

COMMENT: /\#[^\n]*/
CCOMMENT.3: /\/[*].*?[*]\//s

_END: "END"i

WS: /[ \t\f]+/
_NL: /[\r\n]+/

%ignore COMMENT
%ignore CCOMMENT
%ignore WS
%ignore _NL