Better C++ Syntax Highlighting - Part 4: Functions
Functions are next on our list. Their declarations, definitions, and calls appear throughout C++ code, and Clang provides a rich set of node types for processing them.
Consider the following example:
And corresponding AST:
The process of annotating functions and operators is a lot more involved than previous node types we’ve seen, so let’s establish some success criteria before getting started:
- Function names (regular and template) should be annotated with
function- this includes theequaltemplate function declarations on line 6 and 12, its use on line 49, thedistancestatic class member function on line 23 and use on line 39, and themainfunction definition on line 26. - Unary operators (lines 13, 43, and 46) should be annotated with
unary-operator. - Binary operators (lines 7, 17, and 38) should be annotated with
binary-operator. - Compound assignment operators like
+=on line 38 should also usebinary-operator. - Overloaded operator declarations and definitions, such as
operator==on lines 30-31 and its use on line 39, should be annotated withfunction-operator. - User-defined literal operators like the
soperator on line 53 andmsoperator on line 56 should match their underlying parameter type. - Variable declarations using functional-style initialization (
p1andp2on lines 36 and 37) should remain as plain tokens.
To annotate functions and operators, we’ll define visitor functions for eight new node types:
FunctionDeclnodes, for regular function declarations and definitions,FunctionTemplateDeclnodes, for template function declarations and definitions,UnaryOperatornodes, for unary operators,BinaryOperatornodes, for binary operators,CallExprnodes, for function calls,CXXOperatorCallExprnodes, for overloaded operator calls,CompoundAssignOperatornodes, for compound assignment operators, andUserDefinedLiteralnodes, for user-defined literal operators.
Function declarations
FunctionDecl nodes represent standard function declarations and definitions.
We’ll annotate these with a function tag, with special handling for overloaded operators.
The isImplicit() check prevents annotating compiler-generated placeholder declarations.
For overloaded operators, isOverloadedOperator() is used to detect operator functions and skip the first 8 characters (operator) to annotate only the operator symbol.
operator itself should instead be highlighted as a language keyword, which we’ll handle in a later post in this series.
Note that CXXMethodDecl nodes (class member functions) are picked up by this visitor, since they derive from FunctionDecl.
This includes static class functions, constructors, and destructors.
Similarly, template function declarations are also visited because each FunctionTemplateDecl contains a child FunctionDecl node representing the actual function.
This means we don’t actually need to set up dedicated visitors for these nodes (unless we want some specialized logic).
With this visitor implemented, function declarations and definitions are properly annotated:
Function calls
CallExpr nodes represent function calls.
As before, we’ll annotate the function name of each call with the function tag.
We can retrieve the function name from the underlying declaration, which we get through getCalleeDecl().
Unlike other AST nodes, CallExpr does not provide direct access to the function name’s location in the source.
The getBeginLoc() function returns the location of the fully-qualified function call, including any namespace and/or class qualifiers.
To work around this, we’ll tokenize the function call’s source range and annotate only the token matching the function’s name.
This approach elegantly handles arbitrarily qualified function calls.
Built-in operators
Unary, binary, and compound assignment operators are captured under UnaryOperator, BinaryOperator, and CompoundAssignOperator nodes respectively.
All three follow the same implementation pattern, so we’ll focus on unary operators as an example.
Unlike other nodes, operator nodes provide direct access to the operator’s location through getOperatorLoc().
We retrieve the operator symbol using the static getOpcodeStr() function.
The implementations of VisitBinaryOperator and VisitCompoundAssignOperator follow the same pattern, using their respective getOpcodeStr() functions.
Unary operators are annotated with unary-operator, while binary and compound assignment operators with binary-operator.
Another type of built-in operator is the array subscript operator, represented by the ArraySubscriptExpr AST node.
Handling this requires setting up a dedicated visitor, as these nodes are not visited by other operator visitors.
Unlike most other operators, Clang does not provide a direct way of retrieving the locations of both the opening and closing brackets.
Functions like getExprLoc() only return the location of the expression the operator is applied to, and not the operator symbols themselves.
To work around this, we simply tokenize the source range of the node and manually annotate both the [ and ] tokens as operators.
Overloaded operators
CXXOperatorCallExpr nodes represent calls to overloaded operators.
The implementation largely follows the same structure as built-in operators:
Overloaded operators are captured under CXXOperatorCallExpr nodes:
We use getOperatorSpelling() to retrieve the operator symbol and annotate it with function-operator to match our handling of overloaded operator declarations from earlier.
Overloaded array subscript operators are handled separately from other overloaded operators, as these require two annotations instead of one.
Similar to what we did when annotating ArraySubscriptExpr nodes in the previous section, we tokenize the source range of the function call and manually annotate both the [ and ] tokens with the function-operator tag.
Note that overloaded operators in template contexts (particularly with fold expressions) can introduce challenges for annotation due to ambiguity around operator resolution. One possible solution is to iterate through the tokens of a template function definition and annotate those that match operator spellings. However, this is difficult to automate, as C++ provides a lot of flexibility when it comes to defining custom operator types. Because of this, I decided to leave the annotation process for these to be manual. I prefer this approach, as I don’t use many fold expression in my code.
User-defined literal operators
UserDefinedLiteral nodes represent user-defined literal operators.
We’ll annotate these to match the type of literal they’re applied to.
Unlike built-in operators, we need to retrieve the operator name from the function declaration:
We get the function declaration through getCalleeDecl() and strip the operator"" prefix to get the actual suffix used in the code.
For the annotation type, the annotation of the operator should match the underlying literal type.
Rather than relying on getLiteralOperatorKind() (which can be misleading for template-based operators), we parse the token directly:
We can do this because literal operators can only be applied to integer, floating-point, character, and string literals.
If the token containing our operator suffix contains quotations marks, the operator is annotated as a string - otherwise, we know the operator is a number.
An alternative approach uses the getLiteralOperatorKind() function, which returns a category corresponding to the function signature of the operator according to the specification:
However, this approach has some unexpected drawbacks.
For example, the C++ std::chrono library does not provide an overload to resolve 200ms into a function that accepts an integer.
Instead, the following overload is called (accepting a variadic list of characters as the digits of the number):
Why is this implemented in such a way? Well, I’m not sure.
Using this approach categorizes 200ms as a string of characters, incorrectly marking the ms as a string instead of a number.
We’ll also need some special handling for annotating function declarations of literal operators.
Currently, our VisitFunctionDecl visitor incorrectly annotates the operator"" portion of the function name in addition to the operator itself.
We’ll fix this by checking for and handling this case explicitly:
We use the getLiteralIdentifer() function to check if the function declaration refers to a literal operator.
Using the returned IdentifierInfo struct, we can query the name of the operator using the getName() function and search for it in the source range of the node.
Unfortunately, there is no direct way to retrieve the location of the operator itself, so we’ll resort to manually searching for a token that matches the name of the operator using the tokenization approach.
One small caveat here is that literal operators are one of the few exceptions to functions that may contain a space in the function name.
Names that contain no space (for example operator""ms) will combine the quotes with the name of the function into the same token.
This must also be accounted for so that only the operator name is annotated.
Literal operator declarations, as with declarations for other functions, are annotated with the function annotation.
Functional-style variable declarations
In most other syntax highlighters, variable declarations using functional-style initialization are incorrectly highlighted as function calls. This likely occurs because functions are identified based on the presence of parentheses.
We can fix this by implementing a VarDecl visitor, which represents variable declarations and definitions.
The key is the isDirectInit() check, which helps identify variables using functional-style initialization.
We annotate these as plain tokens to prevent them from being highlighted as function calls.
With this visitor implemented, functional-style variable declarations are properly handled:
Styling
The final step is to add definitions for the various CSS styles for the different kinds of function annotations:
The plain CSS style is language-agnostic, and provides the default style to use for tokens in code blocks.
We’ve added support for annotating functions declarations, definitions, calls, and several kinds of operators. We also improved the consistency of our syntax highlighting by overriding annotations on functional-style variable initializations. In the <LocalLink text={“next post”} to={“Better C++ Syntax Highlighting - Part 5: Classes”}>, we’ll take a deeper look at annotating the different components of classes: declarations, static and class member variables, constructor initializer lists, and type aliases. Thanks for reading!