TypeScript Contribution Diary: Identifiers after Numeric Literals

I’ve started making contributions to TypeScript, working my way up from small error reporting nitpicks to (hopefully!) real syntactic and control flow improvements. These blog posts are documentation of the steps I took to figure out what to do for these contributes & then do them. If you’re thinking of contributing to TypeScript or other large open source libraries, I hope seeing how to delve into a giant foreign monolith is of some help!

Meme of drake gesturing away from a first panel and happily pointing towards a second. First: 'User reported bugs or feature suggestions'. Second: 'ECMAScript spec syntax errata'.

Admittedly, this contribution is not the most impactful.

Problem Statement It’s been a while since I poked into TypeScript and I had some free time, so I wanted to challenge myself. This contribution involved going a bit deeper into TypeScript parsing and scanning. To find the issue, I had searched for issues with both the Effort: Moderate and help wanted tags, then sorted by oldest (issues query link). One from 2015 seemed interesting:

#4702: IdentifierStart cannot occur following a NumericLiteral

Rephrasing the issue post: according to the ECMA specification, section 11.8.3 — NumericLiterals, the SourceCharacter after a NumericLiteral must not be an IdentifierStart. For example:

3in: should become an error, not be considered the same as 3 in (referring to checking whether 3 exists in something)
let hasFour = 3in[null, null, null, null]; as an expansion of ☝

tl;dr

You need a space between number literals (3) and identifiers or keywords (a, in). 3in is bad; 3 in is allowed.

Terminology

What on Earth do those terms mean?

SourceCharacter: Skimming through the ECMA specification with Ctrl+F to find other places it’s mentioned, it seem to refer to the characters in the source code — makes sense, given the name.
NumericLiteral: ts.SyntaxKind.NumericLiteral refers to numbers you type out, such as 123 (regular floats), 1_234_567 (floats with separators), and 1234n (BigInts). A “literal” generally refers to built-in primitive values such as booleans, numbers, or strings.
IdentifierStart: Generally refers to the name of something, such as a variable or class. Let’s interpret that in this context to mean any name is either some built-in keyword (such as in) or an identifier (such as a).

For NumericLiteral and IdentifierStart, you can see the equivalent NumericLiteral and Identifier nodes by copy & pasting into ts-ast-viewer:

let value = 7;

It shows that value is an Identifier node (in this case, the name of a variable) and 7 is a NumericLiteral.

If you copy a simple example like 3in[null] into ts-ast-explorer, it shows a tree with nodes (and syntax highlighting) equivalent to 3 in [null]. Looks like the goal for this issue is to add a check after the numeric literal to make sure there’s not what could be the start of an identifier.

Enter the Compiler

How does TypeScript parse in a source file and convert that into an AST?

According to the Basarat Gitbook, that’s the well-named src/parser.ts and src/scanner.ts in the TypeScript source.

The parser is the driving force to understand a source file. It uses the scanner to read in nodes, then puts them together as an AST.
The scanner is used by the parser each time a new node needs to be read into the tree and parsed. It keeps track of the source text and character position within it.

Starting Point

How does TypeScript launch the parsing and scanning of a file?

Skimming through parser.ts in order, we see…

A few type declarations, such as enum Signature Flags
Public node utilities such as visitNodes and forEachChild
A createSourceFile function that takes file name, text, and some other things to create a SourceFile

…that seems promising! It also has some performance.markers around parsing, which we can assume help log timing around how long important operations take. Seems like a good place to look into. In order, this function:

Calls to Parser.parseSourceFile (that’s a good sign too!), which…
Calls parseSourceFileWorker for non-JSON files, which…
Creates a new blank sourceFile with createSourceFile, processes pragmas, creates sourceFile.statements with parseList, and sets some metadata on the source file.

I don’t know or care about pragmas or miscellaneous source file metadata, but based on AST viewing websites, the statements property contains the root-level nodes of the file. parseList must therefore be the place that creates these nodes, so it must call things that parse nodes.

parseList takes in a parseElement lambda that it passes to parseListElement, which will either call some consumeNode function or parseElement.

Question: which is relevant? Do we care about consumeNode or parseElement?

Aside: Debugging

At this point, someone capable with JavaScript tooling might attach debugger breakpoints and debug through TypeScript to answer that question.

Not me!

I just wanted to quickly modify the compiled TypeScript code and re-run it to print out which things are being used without having to worry about setting up debugging (about which I am clueless). My setup instead consisted of:

C:/Code/typescript: A Git cloned repository for navigating the source. I’d made it globally npm linked with npm link and built with npm run build.
C:/Code/tsdevtest: Stub project with a minimal package.json and index.ts. I’d symlinked it to TypeScript with npm link typescript.

index.ts contained a simple test case:

let hasFour = 3in[null]

With this setup, running tsc index.ts in C:/Code/tsdevtest uses C:/Code/typescript/lib/tsc.js.

I also added the following line at the beginning of createSourceFile:

global.wat = fileName === "index.ts";

That line creates a global variable named wat which is only true for files named index.ts. Later on in the code, running wat && console.log(":)") would only print while parsing index.ts.

This is useful because TypeScript will also compile files containing built-in types, such as lib.d.ts. Running with --noLib removes them but results in “Cannot find global type Array” and similar errors if parsing succeeds.

Back to the Hunt

…anyway, we wanted to see how nodes are parsed out of the source file, and were looking at parseListElement to see which sub-function it calls. I put a wat && console.log("consumeNode") before return consumeNode(node); and a wat && console.log("parseElement") before return parseElement();.

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. There are three output lines of parseElement.$ Three parseElements and no consumeNodes.

😲 Only parseElement was being printed, so only parseStatement was being called to. It contains a switch statement that, depending on the value of token(), can call a bunch of different parse* functions, or default to parseExpressionOrLabeledStatement.

The token function returns a variable named currentToken, of type SyntaxKind, so it seems safe to assume this is the current syntax kind being parsed. It’s parsed with scanner.scan, which we know from the Basarat Gitbook Scanner page starts at the starting index of a node and finds the end of it. scan contains another switch statement over a ch variable, which is the character code of the SourceCharacter being looked at.

Given that the relevant contents for parsing start with 3in, ch would have to be "3".charCodeAt(0), or 51, for the start of parsing 3 as a NumericLiteral. tsc.js showed a case 51: as compiled from an equivalent line in scanner.ts: case CharacterCodes._3:. That line calls to scanNumber.

…so the function that scans a number to be parsed is scanNumber. 💪

Number Scanning

How does scanNumber scan source code into NumericLiterals?

scanNumber was about 50 lines of real code and kind of hard to understand. My first guesses squinting through its first three root-level if statements were that they, in order:

Check if the number starts with a ., such as with .5
Check if the number is a “scientific” number, meaning it starts with E or e, such as with 1e3 (1000)
Check if the number has a _ separator

The last if/else pair checks whether the number is a decimal or scientific:

If either, it returns a numeric literal
If neither, it checks whether it’s a BigInt before returning

Awesome! Here is the one place where a NumericLiteral may be created. Just be sure, I added a wat && console.log("Result:", result) at the end:

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. There is three output line, Result: 3.$ Result: 3.

Confirmed. Great.

Checking for Identifiers

How can we know whether the next token is an identifier?

In theory, we could check if the character code of the next thing is an a-Z letter, but there are almost certainly esoteric rules around that. How does the scanner scan in an identifier, rather than a numeric literal? Presumably there must be some internal TypeScript logic that checks whether we’re starting with an identifier?

Going back to scan, the default case for any character code not handled as a special character immediately calls to an isIdentifierStart function. isIdentifierStart checks if the character code is within A-Z or a-z, is $ or _ (which are valid name starters), is a high-valued ASCII character, or is a unicode identifier. I don’t know what those last two mean but that looks like the right place!

Reinforced Number Scanning

Per a previous post of mine that included how to add diagnostics, I made a new diagnostic:

"An identifier cannot follow a numeric literal.": {
	"category": "Error",
	"code": 1351
},

…and in scanNumber, before the if/else returns, added:

if (isIdentifierStart(text.charCodeAt(pos), languageVersion)) {
	error(Diagnostics.An_identifier_cannot_follow_a_numeric_literal, pos, 1);
}

Upon running tsc index.ts, voila! The error showed!

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. An error, 'An identifier cannot follow a numeric literal.', is reported on the 'i' after the 3 in let hasFour = 3in[null]$ It never stops feeling amazing to give myself errors.

Test Time

Per my previous post on verifying parser changes, TypeScript has a “baselines” system where you check in source files with expected types and errors. I added a tests/cases/compiler/identifierStartAfterNumericLiteral.ts to explicitly test these changes:

let valueIn = 3in[null]

…but after re-running jake baseline, there were new errors found in a bunch of files, such as bigIntIndex.errors.txt. 😢

…which reminded me, BigInts such as 1n are valid numbers I should verify still work. Adding a 3in[null], 3nin[null], 1e9, and 1n to index.ts:

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. Four raw logs: 3in[null], 3nin[null], 1e9, 1n. Three 'An identifier cannot follow a numeric literal' errors on the character after a number: 3in[null], 3nin[null], 1n.$ Errors for bigints! That's not good.

😕. It looks like TypeScript thought in the second and third cases that n was starting a new identifier instead of whatever was after the number.

A few minutes of cursing and poking around scanNumber later, I noticed that checkBigIntSuffix increments pos for the case of it finding an n suffix. Brain blast! 💡 Aha! We can’t know what the end position of the number is until after we’ve scanned past any last n; checking for a subsequent IdentifierStart therefore has to come after.

I extracted the new logic into a checkForIdentifierAfterNumericLiteral method and called it immediately before each return in scanNumber.

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. Six raw logs: 3in[null], 3nin[null], 1e9, 1n, 1e9in[null], 1nin[null]. Four 'An identifier cannot follow a numeric literal' errors on the character after a number: 3in[null], 3nin[null], 1e9in[null], 1nin[null].$ ✨ It all works! ✨

The Pull Request

Added error for IdentifierStart immediately after a NumericLiteral by Josh Goldberg · Pull Request #28857:

A couple pieces of feedback were given on the pull request:

The whole identifier, not the first character of it, should be complained on.
The error message was a little vague.

Sounds reasonable. I updated the error message pretty simply. Expanding the red squiggly was a little less straightforward…

Full Identifier Complaints

In order to have the error messages’ span length contain the full identifier, it’s necessary to know how long the identifier is. That necessitates reading in the identifier using whatever logic TypeScript would normally use.

I didn’t already know where to find this, but numbers are scanned in with scanNumber, it made sense to first check for scanIdentifier.

…which doesn’t exist…

…but a scanIdentifierParts does exist, which consists of a while loop that continuously advances as long as it finds characters that match a function named isIdentifierPart or some complicated things following a character matching CharacterCodes.backslash. Seems legit.

One issue with scanning for an identifier is that the scanIdentifierParts function updates pos normally, even though we didn’t want to actually skip parsing the new identifier. I solved this by keeping track of pos’ original value and restoring it after the function was called.

I added a few test cases for identifiers of longer lengths and updated the PR.

$Running 'type index.ts & tsc index.ts' in C:\Code\tsdevtest. Five raw logs: 12334in[null]; 1234of[null]; 1234nof[null]; 1234na[null]; 1234abcd[null];. Five subsequent 'An identifier cannot follow a numeric literal' errors on the character before [.$ Available in the TypeScript nightlies: npm i -g typescript@next

The PR was merged in and a followup issue filed to help improve the diagnostic messages even more. Hooray! 🎉

Maybe I’ll tackle that other issue another day…

Key Takeaways

It’s totally fine, sometimes even better, to use console.log debugging in a pinch! (though I love debuggers for longer use)
Understand your edge cases: numbers are complicated.
Pay attention to your error messages.

It’s worth it to help your users understand their exact errors. Thanks danielrossenwaser and sheetalkamat for the rapid reviews!

A Note on Streamlining

It was commented to me after my last two posts that the progression of steps documented in them could make it appear that I effortlessly glided through the investigations linearly, without any real “I’m stumped.” moments. Not so! I intentionally omit dead ends in the from these blog posts because they would be seven times as long. I have little to no idea what I’m doing. 💯