A High Level Architecture of the TypeScript compiler
Notes on how the TypeScript compiler works. Resources I took the notes from:
Program
The project coordinator:
Read the TSConfig
: setup the program, get starting filesPre-process files
: follow imports to discover all possible filesTokenize and Parse
: convert text to a syntax treeBinder
: convert identifiers in syntax tree to symbolsType Check
: use binder and syntax tree to look for issues in codeTransform
: changes the syntax tree to match tsconfig optionsEmit
: prints the syntax tree into .js, .d.ts, and other files
Let's see how that API looks like:
import * as ts from 'typescript';
// code to syntax tree
const program = ts.createProgram(files, opts);
// binding and type checking
const checker = program.getTypeChecker();
// syntax tree to code
program.emit();
The steps can broken down into three pieces
- Source Code to Data: Tokenize and parsing phases
- Type Checking: binding and type checking phases
- Creating Files: emitting files into the disk
Source code to data
Converting code to a syntax tree
function welcome(str: string) {
console.log(str);
}
const msg: string = 'Hello World!';
welcome(msg); // Hello World!
It creates the syntax tree using a scanner and a parser.
Scanner
The scanner
receives the text (source code) and outputs a sequence of tokens: text -> tokens
const msg: string = 'Hello World';
It transforms this source code into tokens like this:
ConstKeyword
WhitespaceTrivia
Identifier
ColonToken
WhitespaceTrivia
StringKeyword
WhitespaceTrivia
EqualsToken
WhitespaceTrivia
...
The scanner
also has diagnostics, an expressive way to show users their JavaScript/TypeScript code is invalid, has errors, or is missing anything.
No closing quote to the open quote:
const noEnd = '; // => Unterminated string literal.
An invalid character:
const ๐ฏ๐ต = 'Japan Flag'; // Invalid character
Numeric separators can't be consecutive:
const num = 1__0; // Multiple consecutive numeric separators are not permitted.
And so on.
Parser
The parser
gets the tokens generated by the scanner
and creates the syntax tree: tokens -> AST
The AST is a series of Node
s. Each Node
holds its position, the syntax kind, the value, etc.
A simple example of a syntax tree is the representation of a variable declaration. Given this source code:
const num = 10;
The parser
will generate this syntax tree:
SourceFile
- VariableStatement
- VariableDeclarationList
- VariableDeclaration
- Identifier
- NumericLiteral
- EndOfFileToken
The VariableStatement
has a list of VariableDeclaration
, which has the Identifier
(num
) and the NumericLiteral
(10
) as children.
If we add the a number
type to the num
, the AST will have new child Node
.
const num: number = 10;
The syntax will look like this:
SourceFile
- VariableStatement
- VariableDeclarationList
- VariableDeclaration
- Identifier
- NumberKeyword // --> this node represents the `number` type
- NumericLiteral
- EndOfFileToken
There's a cool AST playground we can use to have a better understanding on how the source code and the generated syntax tree are related one to another.
The parser
diagnostics shows the "right JavaScript code in the wrong place".
Private identifiers outside of a class:
#constant = 10; // The left-hand side of an assignment expression must be a variable or a property access.
BigInt
should be integers not decimals:
const decimalBigInt = 1.2n; // A bigint literal must be an integer.
Reserved words in JavaScript:
const extends = 'extends'; // 'extends' is not allowed as a variable declaration name.
And so on.
Type Checking
Binder
The binder
transforms the syntax into symbols: syntax -> symbols
It produces symbols in the form of symbol tables. It keeps track of the declaration(s) of identifiers, so if the checker wants to find the type of an expression, it can look up the identifier in the symbol table, get the declaration, and find its type annotation or initial value.
Processing code, there're different scopes. Let's see an example
const msg: string = 'Hello World!';
function welcome(str: string) {
console.log(str);
}
welcome(msg);
For each scope, we have different symbols:
- In the global scope, we have the
msg
and thewelcome
variables. - In the function scope, we have the
str
variable.
Symbols are tables for each scope in the program to store identifier with its metadata (where it was declared โ line โ and its flag).
Getting the syntax tree from the previous source code, the binder
generates these symbol tables:
Global Scope:
msg
:- declared line 0
- flags:
BlockScopedVariable
welcome
:- declared line 6
- flags:
Function
welcome Function Scope:
parent
: Global Scopestr
:- declared line 2
- flags:
BlockScopedVariable
And the binder
tries to keep track of the identifiers across files.
Another responsibility of the binder
is to structure the flow nodes. It creates the flow graph to represent each scope and its types. It has different "containers" and each container could have different types for the same variable. Let's see an example.
function logValue(x: string | number) {
if (typeof x === 'string') {
console.log('string', x);
} else {
console.log('number', x);
}
x;
}
logValue
function containerx
with typestring | number
- "if condition is true" container
x
with typestring
- "if condition is false" container
x
with typenumber
x
again with typestring | number
- it should check all the flow nodes because it's possible
x
has been changed by different conditionals- return statements
- mutation
- casting
- it should check all the flow nodes because it's possible
The binder
diagnostics show errors like:
The impossibility to declare a variable with the same naming it was used before for another variable:
const num = 1;
const num = 2; // Cannot redeclare block-scoped variable 'num'.
When identifiers are duplicated:
class User {}
type User = {}; // Duplicate identifier 'User'.
Checker
There're three main systems: how it checks, comparing types, and the inference system.
The checker
checks each part of the syntax tree.
If we have a simple variable statement:
const msg: string = 'test';
It results into this syntax tree:
SourceFile
- VariableStatement
- VariableDeclarationList
- VariableDeclaration
- Identifier
- StringKeyword
- StringLiteral
- EndOfFileToken
And the checker
checks each node of the syntax tree.
checker
.checkSourceElementWorker()
.checkVariableStatement()
.checkGrammarVariableDeclarationList()
.checkVariableDeclaration()
.checkVariableLikeDeclaration()
.checkTypeAssignableToAndOptionallyElaborate()
.isTypeRelatedTo(identifier, stringLiteral);
And the checker
considers the source type and target string as the same type and returns true
.
A more complex type checking analysis is when you need to check object type. e.g.
{ hello: number } = { hello: 'world' }
- both are an object type
- compare each object "field" (object's keys)
- the return type (the value): are they assignable?
- if fails the check
Another example:
Promise<string> = Promise<{ hello: string }>;
- both are a
Promise
object - compare the type arguments inside the promise
- the string is not assignable to the object and it fails the check
The initializer inference
For type inference, the syntax tree, rather than providing the StringKeyword
, there's no type given:
const msg = 'test';
SourceFile
- VariableStatement
- VariableDeclarationList
- VariableDeclaration
- Identifier
// - No type given
- StringLiteral
- EndOfFileToken
So the type checker
tries to understand the syntax shape to fill in the gaps: From StringLiteral
, it understands that the type is a string (StringKeyword
) and move that into the type.
Type parameter inference is a bit different. Take this code as an example:
declare function setup<T>(config: { initial(): T }): T;
For type parameter inference, it needs to be defined further down. For the type checker to understand or to infer the type of T
, you should write the function passing an object with the a initial function which returns a value and the type checker will "infer" that the value's type is T
.
Generic Function
- Generic Arg
T
- Return value is
T
- Param is an object which has a function
initial
that returns aT
And you write the actual setup
function call:
const test = setup({ initial: () { return 'test' }});
Now the return type of your implementation (string type) will be moved to the T
type.
{ initial: (): T } = { initial: () { return 'test' }}
So now the type looks like this for this specific function:
declare function setup<string>(config: { initial(): string }): string;
Emitting files
Emitter
It transforms the syntax tree into files: syntax tree -> files
- What to emit? e.g.
*.js
,*.map
,*.d.ts
- Printer: syntax tree to text
- Tracking temp vars
- Transformers: syntax tree to syntax tree
Tranforming the syntax tree into a new syntax tree to remove static types. Basically transforming TypeScript into legitimate JavaScript.
SourceFile SourceFile
- VariableStatement - VariableStatement
- VariableDeclarationList - VariableDeclarationList
- VariableDeclaration ---> - VariableDeclaration
- Identifier - Identifier
- StringKeyword // removed type
- StringLiteral - StringLiteral
- EndOfFileToken - EndOfFileToken
The emitter
remove the types, emits JavaScript output and .dts
files (type check them).