Skip to main content

Meta language

Support for any language can be added by writing a language spec using Sylver's meta language. Language specs are written in .syl files and contain a list of declarations representing node types, terminals, and rules.

A complete example is available at the end of this document.

Node Types

The first step of building a language spec is to define a set of types for the parse-tree nodes. These node types resemble structs in mainstream programming languages and can be declared as follows:

node TypeName {
field1: node_type,
field2: node_type,
[...]
}

TypeName must be an identifier starting with an uppercase letter.

node_type is either the name of another node type declared in the same spec or a list type of the form: List<TypeName>.

The field list can be empty.

For instance:

node Statement { }
node Expr { }

node If {
condition: Expr,
then: Statement,
else: Statement
}

node Block {
statements: List<Statement>
}

[...]

Node Type inheritance

node Expr { }
node Number: Expr { }
node Boolean: Expr { }

Node types can declare a parent type using the: syntax. This is mainly useful when running queries on a parse tree: for example, given the above spec, querying all the Expr nodes would return the nodes of kind Number and Boolean as well.

caution

Field inheritance is not supported at the moment.

Terminals

Before building the parse tree, Sylver groups the individual characters of the input stream into tokens of different sorts (in the case of a programming language these tokens could be: string literals, keywords, punctuation, etc.).

Every language spec contains terminal declarations describing the different sorts of token in a given language.

The longest match prevails when two or more terminal declarations match the same portion of the input stream. If two matches have the same length, the terminal declared with a literal is chosen.

Simple terminal declarations are of the form:

term TAG_NAME = term_expression

This means that the parser will build a token with tag TAG_NAME every time that it encounters a sequence of characters matching term_expression in the input.

TAG_NAME must be a unique uppercase identifier (possibly including underscores and digits).

Terminal expressions

Three types of terminal expressions are supported: literals, regexes, and terminal functions.

Literals

term CLASS_KEYWORD = 'class'

A terminal of the given tag is created every time that the literal between the quoted literal is found in the input string.

Regex

term INT_LITERAL = `[0-9]+`

A terminal of the given tag is created every time a match for the regex between backticks (`) is found. In the example above, any sequence of one or more digits will be grouped into an INT_LITERAL terminal. Regexes use a Perl-style syntax.

tip

Literal and regex terminals can be inlined into node expressions, as demonstrated in the operator rule of the complete example.

Terminal functions

term COM_START = '/*'
term COM_END = '*/'

term COM = nested(start=COMM_START, end=COMM_END)

Supported terminal functions are:

  • nested: it takes a start and an end terminal as arguments and matches any sequence of text delimited by its arguments. Nesting (as in nested comments) is supported, so in the example above, the entire string \* nested /* comment */ *\ would be matched instead of stopping at the first */.

Modified terminals

Some special terminal declarations can be prefixed with an optional modifier that gives them additional properties. The supported modifiers are:

ignore

ignore term WHITESPACE = `\s`

ignore terminals are ignored by the later stages of the parser. They are mainly used to specify the regexes/literals for whitespaces, as illustrated in the example above.

comment

comment term SINGLE_LINE_COM = '//'

Whenever it sees a comment token in the token stream, the parser adds a Comment node to the parse tree.

Rules

The structure of the parse tree is specified through rule declarations. Rule declaration resemble the Backus-Naur Form notation, and are written as follow:

rule RuleName = alternative1 | alternative2 | ...

Where RuleName is a unique identifier for the rule, and every alternative is a node expression or a reference to another rule.

A rule can have a single alternative, in which case the pipe separator (|) must be omitted.

Node expressions

This syntax for node expression is:

TypeName { component1 component2... }

With roughly means: if a match can be found for every component, build a Typename node with the matched components as children. A component is either a (possibly inlined) terminal, or a rule invocation.

Nodes resulting from a rule invocation must be bound to a field of the node being created, using the binding syntax (@). For example, assuming that we have a complete spec for an imaginary language supporting expressions (parsed by the expr rule), we could add support for the unary '-' operator with the following declarations:

[...]

node UnaryMinus {
valueExpr: Expr
}

rule unary_minus = UnaryMinus { '-' valueExpr@expr }

[..]

Advanced rule invocations

The rule invocation following a @ symbol can be replaced with an advanced rule invocation of the following form:

  • opt(rule_name) or rule_name?: matches the specified rule, or nothing
  • many(rule_name) or rule_name*: matches the specified rule 0 or more times
  • some(rule_name) or rule_name+: matches the specified rule 1 or more times
  • sepBy(TERM_NAME, rule_name): matches the specified rule 0 or more times, interleaved with the specified terminal
  • sepByTr(TERM_NAME, rule_name): same as sepBy, but accepts a trailing terminal
  • sepBy1(TERM_NAME, rule_name): matches the specified rule 1 or more times, interleaved with the specified terminal
  • sepByTr1(TERM_NAME, rule_name): same as sepBy1 but accepts a trailing terminal

So, for example, if we wanted to add a rule to match a list of coma-separated expressions:

node ExprList {
valueExprs: List<Expr>
}

term COMA = ','

rule expr_list = ExprList { valueExprs@sepBy(COMA, expr) }

Complete example

node Expr { }

node Integer: Expr { }

node Binop: Expr {
left: Expr,
op: Operator,
right: Expr
}

node Operator { }

node Plus: Operator { }

node Minus: Operator { }

term NUMBER = `[0-9]+`

rule main = expr

rule expr =
Number { NUMBER }
| Binop { left@expr op@operator right@expr }

rule operator = Plus { '+' } | Minus { '-' }

The preceding spec describes an expression language, in which expressions are either a number or (possibly nested) additions/subtractions.