Better C++ Syntax Highlighting - Part 10: Keywords

The final piece of the annotation puzzle are language keywords. While it’s possible to detect keywords by adding specific AST visitors, like VisitIfStmt for if or VisitWhileStmt for while statements, this approach quickly becomes tedious due to the sheer number of keywords (and subsequently visitors) this would require to implement. A simpler, more reliable approach is to classify keywords at the tokenization stage.

Approach 1: Using the Lexer

When tokenizing the source file before AST traversal, we can check each token against a list of known C++ keywords and annotate any matches with the keyword tag:

cpp
1
void Tokenizer::tokenize() {
2
// ...
3
static const std::unordered_set<std::string> keywords = {
4
"alignas",
5
"alignof",
6
"and",
7
"and_eq",
8
"asm",
9
"auto",
10
// ...
11
};
12
13
// Tokenize with raw lexer
14
clang::Lexer lexer { ... };
15
clang::Token token;
16
while (true) {
17
lexer.LexFromRawLexer(token);
18
if (token.is(clang::tok::eof)) {
19
break;
20
}
21
22
clang::SourceLocation location = token.getLocation();
23
std::string spelling = clang::Lexer::getSpelling(token, source_manager, options);
24
unsigned line = source_manager.getSpellingLineNumber(location);
25
unsigned column = source_manager.getSpellingColumnNumber(location);
26
bool is_keyword = keywords.contains(spelling);
27
28
m_tokens.emplace_back(spelling, line, column, is_keyword);
29
}
30
}

LexFromRawLexer() keeps tokens in preprocessor directives unmodified. After processing the AST, all tokens identified as keywords are annotated with the keyword tag:

cpp
1
void Consumer::visit_keywords() {
2
for (auto it = m_tokenizer->begin(); it != m_tokenizer->end(); ++it) {
3
const Token& token = *it;
4
if (token.is_keyword) {
5
m_annotator->insert_annotation("keyword", token.line, token.column, token.spelling.length());
6
}
7
}
8
}

The visit_keywords() function is called at the end of HandleTranslationUnit(). This method is simple, fast, and extensible: supporting C++ keywords from new standards would simply mean adding any missing keywords to this list.

Approach 2: Using libclang

libclang is the official C interface to Clang. Unlike Clang’s C++ API, which is more powerful but volatile across versions, libclang offers a stable - though simplified - way to interact with the Clang AST. While it doesn’t expose all the richness of Clang’s internals, it provides more than enough functionality for tasks like annotating language keywords.

Initial setup

The setup process with libclang mirrors the process of setting up a ASTFrontendAction. First, we create an index, which represents a set of translation units that could be compiled or linked together:

cpp
CXIndex index = clang_createIndex(0, 0);

Next, we load the translation unit for the file we’re processing using clang_createTranslationUnitFromSourceFile():

cpp
std::vector<const char*> compilation_flags { ... };
CXTranslationUnit translation_unit = clang_createTranslationUnitFromSourceFile(
index,
filepath.c_str(),
compilation_flags.size(),
compilation_flags.data(),
0,
nullptr);

The CXTranslationUnit holds the parsed AST. From here, we are ready to start traversing.

Traversing the AST

The AST consists of a set of cursors, which represent elements in the source code. Cursors are the libclang equivalent of AST nodes in the C++ API.

We start by retrieving the root cursor for the translation unit and recursively visiting its children:

cpp
std::stack<CXCursor> cursors;
cursors.push(clang_getTranslationUnitCursor(translation_unit));
while (!cursors.empty()) {
CXCursor cursor = cursors.top();
cursors.pop();
// Visitor logic goes here
// ...
// Visit children
clang_visitChildren(cursor, [](CXCursor child, CXCursor /* parent */, CXClientData user_data) -> CXChildVisitResult {
((std::stack<CXCursor>*) user_data)->push(child);
return CXChildVisit_Continue;
}, &cursors);
}

The libclang API defines various CXCursorKind values for identifying different cursor types. While this list isn’t as exhaustive as the C++ API, it’s enough to get a high-level overview of the AST structure.

Before inserting any annotations, we filter for only those cursors that originate from the file we are annotating. This is equivalent to isInMainFile() check from the C++ API:

cpp
CXSourceLocation location = clang_getCursorLocation(cursor);
CXFile file;
clang_getSpellingLocation(location, &file, nullptr, nullptr, nullptr);
CXString filename = clang_getFileName(file);
const char* file = clang_getCString(filename);
if (strcmp(file, filepath) == 0) {
// Cursor is part of "main" file
// ...
}
// Cleanup
clang_disposeString(file);

Annotating keywords

To annotate keywords, we tokenize the source range of each cursor and tag any token of kind CXToken_Keyword as a keyword:

cpp
CXSourceRange extent = clang_getCursorExtent(cursor);
unsigned num_tokens;
CXToken* tokens;
clang_tokenize(translation_unit, extent, &tokens, &num_tokens);
for (unsigned i = 0; i < num_tokens; ++i) {
const CXToken& token = tokens[i];
CXTokenKind kind = clang_getTokenKind(token);
if (kind == CXToken_Keyword) {
CXString spelling = clang_getTokenSpelling(translation_unit, token);
CXSourceLocation location = clang_getTokenLocation(translation_unit, token);
unsigned line, column;
clang_getSpellingLocation(location, &file, &line, &column, nullptr);
m_annotator->insert_annotation("keyword", line, column, strlen(clang_getCString(spelling)));
clang_disposeString(spelling);
}
}
clang_disposeTokens(translation_unit, tokens, num_tokens);

clang_tokenize() returns all tokens within the cursor’s extent. We retrieve the location at which to annotate keywords using the CXSourceLocation of the token.

Note that, as with many C-style APIs, resources like strings and token buffers must be explicitly freed after use.

Cleanup

Once we’ve finished annotating, the translation unit and index are also cleaned up.

cpp
clang_disposeTranslationUnit(translation_unit);
clang_disposeIndex(index);

Styling

This section would not be complete without the definitions for the keyword CSS style:

css
.language-cpp .keyword {
color: rgb(206, 136, 70);
}

Fun fact: This project originally started using libclang. At the time, I wasn’t fully aware of how limited its introspection capabilities were compared to Clang’s full C++ API. Once I began running into roadblocks - missing AST nodes, incomplete type information, and limited traversal flexibility - I decided to migrate the project to the C++ API. That said, libclang still has its place. For lightweight tooling tasks, it’s hard to beat in terms of simplicity and ease of use.

Of course, switching to the C++ API came with a much steeper learning curve. Setting up the project and understanding how the API exposes information took time, but once things clicked, it became fun to experiment and see what worked. One of the biggest challenges early on was figuring out how to access the specific symbols I wanted to annotate. That led to the creation of core helper utilities like the Tokenizer, which helped bridge the gaps in the raw AST traversal and introduced common annotation patterns that we used across the majority of visitors. The amount of information Clang exposes is really exciting, and I constantly found myself coming up with ideas for mini tools I could integrate into my other projects. Even now, this project only scratches the surface of what Clang’s AST offers, which speaks volumes about the depth and richness of the tooling available. I encourage you to try messing around with the API yourself! There’s something in there for everyone.

After a long journey, it’s finally time to bring everything together! Below is the fully annotated version of the example from the very first post in this series:

cpp
Show 261 more lines

There’s also something satisfying about seeing the raw annotated code. It’s a great way to appreciate how all the visitors work together behind the scenes:

text
Show 261 more lines

That’s all for this project! As mentioned in one of the earlier posts, this project remains an eternal work in progress. I tried to cover visitors for as many different language features as I could find, but I’m sure I missed some. If any additions fit nicely, I’ll update the posts accordingly. If not, the full project source is available here for you to explore and extend.

It’s also entirely possible (read: likely) that future Clang releases will render some of these approaches obsolete or overly complicated. Or they’ll break entirely.

Maybe the real visitors were the friends we made along the way. Until next time!

Series: Better C++ Syntax Highlighting - Part 10 of 10