Meta language
Support for any language can be added by writing a language spec using Sylver's meta language.
Language specs are written in .syl
files and contain a list of declarations representing node types, terminals, and rules.
A complete example is available at the end of this document.
Node Types
The first step of building a language spec is to define a set of types for the parse-tree nodes. These node types resemble structs in mainstream programming languages and can be declared as follows:
node TypeName {
field1: node_type,
field2: node_type,
[...]
}
TypeName
must be an identifier starting with an uppercase letter.
node_type
is either the name of another node type declared in the same spec or a
list type of the form: List<TypeName>
.
The field list can be empty.
For instance:
node Statement { }
node Expr { }
node If {
condition: Expr,
then: Statement,
else: Statement
}
node Block {
statements: List<Statement>
}
[...]
Node Type inheritance
node Expr { }
node Number: Expr { }
node Boolean: Expr { }
Node types can declare a parent type using the:
syntax.
This is mainly useful when running queries on a parse tree: for example, given the above spec,
querying all the Expr
nodes would return the nodes of kind Number
and Boolean
as well.
Field inheritance is not supported at the moment.
Terminals
Before building the parse tree, Sylver groups the individual characters of the input stream into tokens of different sorts (in the case of a programming language these tokens could be: string literals, keywords, punctuation, etc.).
Every language spec contains terminal declarations describing the different sorts of token in a given language.
The longest match prevails when two or more terminal declarations match the same portion of the input stream. If two matches have the same length, the terminal declared with a literal is chosen.
Simple terminal declarations are of the form:
term TAG_NAME = term_expression
This means that the parser will build a token with tag TAG_NAME
every time that it encounters a
sequence of characters matching term_expression
in the input.
TAG_NAME
must be a unique uppercase identifier (possibly including underscores and digits).
Terminal expressions
Three types of terminal expressions are supported: literals, regexes, and terminal functions.
Literals
term CLASS_KEYWORD = 'class'
A terminal of the given tag is created every time that the literal between the quoted literal is found in the input string.
Regex
term INT_LITERAL = `[0-9]+`
A terminal of the given tag is created every time a match for the regex between backticks (`
) is found.
In the example above, any sequence of one or more digits will be grouped into an INT_LITERAL
terminal.
Regexes use a Perl-style syntax.
Literal and regex terminals can be inlined into node expressions, as demonstrated
in the operator
rule of the complete example.
Terminal functions
term COM_START = '/*'
term COM_END = '*/'
term COM = nested(start=COMM_START, end=COMM_END)
Supported terminal functions are:
nested
: it takes a start and an end terminal as arguments and matches any sequence of text delimited by its arguments. Nesting (as in nested comments) is supported, so in the example above, the entire string\* nested /* comment */ *\
would be matched instead of stopping at the first*/
.
Modified terminals
Some special terminal declarations can be prefixed with an optional modifier that gives them additional properties. The supported modifiers are:
ignore
ignore term WHITESPACE = `\s`
ignore
terminals are ignored by the later stages of the parser. They are mainly used to specify the
regexes/literals for whitespaces, as illustrated in the example above.
comment
comment term SINGLE_LINE_COM = '//'
Whenever it sees a comment
token in the token stream, the parser adds a Comment
node to the parse
tree.
Rules
The structure of the parse tree is specified through rule declarations. Rule declaration resemble the Backus-Naur Form notation, and are written as follow:
rule RuleName = alternative1 | alternative2 | ...
Where RuleName
is a unique identifier for the rule, and every alternative is a node expression
or a reference to another rule.
A rule can have a single alternative, in which case the pipe separator (|
) must be omitted.
Node expressions
This syntax for node expression is:
TypeName { component1 component2... }
With roughly means: if a match can be found for every component
, build a Typename
node with the
matched components as children. A component
is either a (possibly inlined) terminal, or a rule invocation.
Nodes resulting from a rule invocation must be bound to a field of the node being created, using the
binding syntax (@
). For example, assuming that we have a complete spec for an imaginary language
supporting expressions (parsed by the expr
rule),
we could add support for the unary '-' operator with the following declarations:
[...]
node UnaryMinus {
valueExpr: Expr
}
rule unary_minus = UnaryMinus { '-' valueExpr@expr }
[..]
Advanced rule invocations
The rule invocation following a @
symbol can be replaced with an advanced rule invocation of the
following form:
opt(rule_name)
orrule_name?
: matches the specified rule, or nothingmany(rule_name)
orrule_name*
: matches the specified rule 0 or more timessome(rule_name)
orrule_name+
: matches the specified rule 1 or more timessepBy(TERM_NAME, rule_name)
: matches the specified rule 0 or more times, interleaved with the specified terminalsepByTr(TERM_NAME, rule_name)
: same assepBy
, but accepts a trailing terminalsepBy1(TERM_NAME, rule_name)
: matches the specified rule 1 or more times, interleaved with the specified terminalsepByTr1(TERM_NAME, rule_name)
: same assepBy1
but accepts a trailing terminal
So, for example, if we wanted to add a rule to match a list of coma-separated expressions:
node ExprList {
valueExprs: List<Expr>
}
term COMA = ','
rule expr_list = ExprList { valueExprs@sepBy(COMA, expr) }
Complete example
node Expr { }
node Integer: Expr { }
node Binop: Expr {
left: Expr,
op: Operator,
right: Expr
}
node Operator { }
node Plus: Operator { }
node Minus: Operator { }
term NUMBER = `[0-9]+`
rule main = expr
rule expr =
Number { NUMBER }
| Binop { left@expr op@operator right@expr }
rule operator = Plus { '+' } | Minus { '-' }
The preceding spec describes an expression language, in which expressions are either a number or (possibly nested) additions/subtractions.