View Syntaxes
Syntax definitions (“syntaxes” henceforth) add structural information to text. Formats like HTML, CSS, XML, JavaScript, etc. are merely text until something tells Espresso that certain keywords, constructs and pieces of text have special meaning. Syntaxes define that extra information using flexible syntax rules.
Big Picture
Before going into details about how to make a syntax, it’s important to realize how Espresso loads and uses syntaxes.
(sugar collection) > (detected xml syntaxes) + (detected xml syntax injections) > syntax compiling and injecting > (pool with syntaxes)
- First, Espresso collects all the sugars (see SugarBasics) it can find.
- Syntaxes and syntax injections (more info on injections later) are collected from the sugars. Each syntax has a root identifier, which is used to reference it for Languages, to include it in other syntaxes, etc.
- Once all syntaxes and injections have been detected, the compiling and injecting process starts. This involves a few steps: syntax injections are prepared by cross-injecting them with other injections – syntaxes are compiled for internal usage – all prepared injections are injected into compiled syntaxes.
- Espresso now has one pool of syntaxes ready to go.
Compatibility
Espresso syntaxes are similar to TextMate language definitions, so anyone with previous experience there should feel right at home. To convert a .tmLanguage to the Espresso XML format, you can use the Espresso Syntax Tool.
Composition of a Syntax
Espresso syntaxes are defined using XML files. The basic structure looks as follows:
<?xml version="1.0"?>
<syntax name="root.type.identifier">
<!-- Syntax zones go into the <zones> tag -->
<zones>
...
</zones>
<!-- The (optional) <library> tag contains collections of reusable zones -->
<library>
...
</library>
</syntax>
The key piece in this XML is the name of the syntax. This identifier should be unique for each syntax in Espresso, so it’s a good idea to prefix it with your domain name or something similar. Syntaxes shipping with Espresso reserve the prefix espresso.default.
In the following sections, we’ll have a look at what the <zones> and <library> tags should contain.
Syntax Zones and Rules
Espresso has a syntax core based on regular expressions (“regexes” henceforth). When all syntax rules have been evaluated for the entire text, the result is a hierarchical structure that contains syntax zones mapping to parts of the text. Each of those zones has a type identifier that describes what it is, and can contain more syntax zones within.
To understand how the core builds this tree, let’s start with the notion of a syntax context. This context maintains a list of all the currently active syntax rules: the rules that can be evaluated to find a new piece of structure. The initial context consists of the rules defined in the root of a grammar, which comes down to all the rules in the <zones> tag.
There are 4 rule types that can be defined: start/end, match, include and cut-off. Each has a specific purpose and XML syntax. For performance reasons, the syntax core has one limitation that you must absolutely remember: syntax regular expressions only search in a single line of the text.
Start/End Rules
This rule produces a syntax zone which begins with a piece of text matching the regex from the starts-with tag and ends with a piece of text matching the ends-with regex. When this rule becomes active (when a start match is found), the new syntax context comprises the syntax rules defined in the <subzones> tag. Once the end expression is encountered, the old syntax context is restored again.
<zone name="zone.identifier">
<starts-with>
<expression>regular expression here</expression>
</starts-with>
<ends-with>
<expression>regular expression here</expression>
</ends-with>
<subzones>
...
</subzones>
</zone>
Match Rules
For simple structures like keywords, it’s sufficient to match a single piece of text. The syntax looks as follows:
<zone name="zone.identifier">
<expression>regular expression here</expression>
</zone>
Include Rules
For more complex syntaxes, it can be handy to use the rules of other syntaxes (for CSS embedded in HTML, for example) or to reuse sets of rules. Include rules address both, and come in 3 forms:
<include syntax="syntax.identifier"/>
<include syntax="self"/>
This rule collects all the root rules (inside the <zones> tag) of the syntax named syntax.identifier, and inserts them in place of the include rule. To recursively include the current syntax’s root rules (useful for languages like XML), use “self” as the attribute value.
<include collection="collection.identifier"/>
<include syntax="syntax.identifier" collection="collection.identifier"/>
In this form, the rule collects all zones from the collection named collection.identifier. Without a “syntax” attribute, the collection is assumed to be in the syntax of the include rule.
Cut-off Rules
<cut-off>
<expression>regular expression here</expression>
</cut-off>
This rule is a processing instruction, rather than a true syntax rule, since it never generates a syntax zone. Normally, the regexes in the various rules are searched in the rest of the current line. Some situations may arise where a search expression could become unwieldy trying to define the cut-off point for the search.
Even though it’s often possible to use regex lookahead for these cases, much more convenient is the ability to say “no search expressions for rules in the current context can go beyond this point”. When the cut-off expression is found in the text, expressions for all real rules cannot go beyond the location of the cut-off.
The usefulness of cut-offs becomes apparent with large syntaxes and the use of syntax injections (see further). For example, the HTML syntax uses cut-offs to make sure embedded CSS doesn’t go beyond its closing </style> tag. Instead making the CSS syntax explicitly aware of this by modifying the rule regexes, a simple external cut-off has the same result. Because the CSS syntax should require no knowledge about languages it’s embedded in, this is clearly a preferable solution to a pure regex-based approach.
Important: a cut-off rule only has an effect on the rules in the same context, not in higher or deeper contexts.
Order and Precedence
When looking for a new syntax zone, the syntax core evaluates all rules in the order they’re written in the syntax definition. The rule that has the earliest occurrence in the text is then used to create a new syntax zone.
Start or match expressions have no precedence over others in the same context; they are simply evaluated in their source order. However, in case an end expression starts at the same point as a begin or match expression, the end expression takes precedence. Do note that this is only the case for occurrences at the exact same position. A match expression of rule A that begins before the end expression of rule B will always be chosen over the end of B, even if A’s match extends far past B’s end.
Rule Captures
Now that we discussed the basic workings of syntax rules, let’s look at some details. Because expressions for syntax rules are often composed of several semantic parts (for example, a HTML <tag> has at least angle brackets and the actual tag name), <expression> tags inside <zone> tags can be combined with <capture> tags to easily define subzones based on captures in the regex. An example:
<zone name="a.surrounded">
<expression>(+)(a*)(+)</expression>
<capture number="1" name="plus"/>
<capture number="2" name="multiple.a"/>
<capture number="3" name="plus"/>
</zone>
Instead of simply generating a syntax zone with identifier “a.surrounded”, the capture definitions insert subzones for “plus”, “multiple.a” and “plus” inside the “a.surrounded” zone.
Syntax Rule Libraries
<library>
<collection name="collection.identifier">
...
</collection>
</library>
The optional zone library in a syntax can define several zone collections, which can then be referenced using include rules. Each collection needs to specify a name, and can contain the same elements as the <zones> tag.
Syntax Injections
To avoid unnecessary dependencies between syntaxes and to prevent duplication, Espresso offers syntax injections. Injections are best compared to rule collections, except that they’re included automatically by the syntax core. The basic structure of a syntax injection looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<injections>
<injection name="injection.identifier" selector="..." action="...">
...
</injection>
</injections>
Each <injection> element contains a list of rules that need to be injected. The allowed elements are the same for collections, so they won’t be repeated here. Each syntax injection must have a unique name (once again, the espresso.default prefix is reserved).
The selector attribute specifies which syntax rules have to be targeted for this particular injection. Note that the selector is interpreted against syntax rules in the syntax definitions, not against the resulting syntax zones. For all targeted rules, an action is performed with the rules in the injection body. The value can be any of these:
- replace-target: replace the target rule with the injection rules
- replace-children: replace the target’s subzone rules with the injection rules
- attach-before-target: insert the injection rules before the target rule, so inside the parent rule
- attach-after-target: same as before, but after the target rule
- insert-before-children: insert the injection rules into the target rule, before the first subzone rule
- insert-after-children: same as before, but after the last subzone rule
Syntax injections have no enforced relative order, but any replace-target action will neutralize any subzone-related actions for the same target rule.