TypeScriptの字句解析考察 #TypeScript

Typescriptのソースコード考察

本文はTypeScript言語の字句解析部分についての考察になります。

今回考察したソースファイル

src/compiler/scanner.ts
https://github.com/microsoft/TypeScript/blob/main/src/compiler/scanner.ts

・ TypeScirptの字句解析

文字

プログラムのソースコードは一つ一つの文字で組み合わせております。
（英文字、数字、空白、符号など）

一つ一つの文字は各自のコードがあります。
例えば一般的に使われている Unicodeの中で「A」は 41 (UTF-8)
ある文字が英文字かどうかを判別したい場合、コードが「A」コードから「Z」の間がどうかで判別することができます。

TypeScirptの中では、CharacterCodesのenumでこれらを定義されています。

/* コードの一部だけ切り抜き */
/** @internal */
export const enum CharacterCodes {
    // ... ヌル文字、改行符などいろいろ
    nullCharacter = 0,
    maxAsciiCharacter = 0x7F,

    lineFeed = 0x0A,              // \n
    carriageReturn = 0x0D,        // \r
    lineSeparator = 0x2028,
    paragraphSeparator = 0x2029,
    nextLine = 0x0085,
    
    // ... 数字
    _0 = 0x30,
    _1 = 0x31,
    _2 = 0x32,
    _3 = 0x33,
    _4 = 0x34,

    // ... 英文字
    a = 0x61,
    b = 0x62,
    c = 0x63,
    d = 0x64,
    e = 0x65,
    f = 0x66,
    
    // ... 符号
    ampersand = 0x26,             // &
    asterisk = 0x2A,              // *
    at = 0x40,                    // @
    backslash = 0x5C,             // \

数字の判断

文字コードが　「０」 (0x30) と「９」 (0x39)の間のコードかどうかで判断することができます。

function isDigit(ch: number) {
    // TODO(cyrusn): Find a way to support this for unicode digits.
    return ch >= CharacterCodes._0 && ch <= CharacterCodes._9;
}

改行コードの判断

4種類の改行コードが対応されています。

export function isLineBreak(ch: number): boolean {
    // ES5 7.3:
    // The ECMAScript line terminator characters are listed in Table 3.
    //     Table 3: Line Terminator Characters
    //     Code Unit Value     Name                    Formal Name
    //     \u000A              Line Feed               <LF>
    //     \u000D              Carriage Return         <CR>
    //     \u2028              Line separator          <LS>
    //     \u2029              Paragraph separator     <PS>
    // Only the characters in Table 3 are treated as line terminators. Other new line or line
    // breaking characters are treated as white space but not as line terminators.

    return ch === CharacterCodes.lineFeed ||
        ch === CharacterCodes.carriageReturn ||
        ch === CharacterCodes.lineSeparator ||
        ch === CharacterCodes.paragraphSeparator;
}

空白の判断

export function isWhiteSpaceLike(ch: number): boolean {
    return isWhiteSpaceSingleLine(ch) || isLineBreak(ch);
}

/** Does not include line breaks. For that, see isWhiteSpaceLike. */
export function isWhiteSpaceSingleLine(ch: number): boolean {
    // Note: nextLine is in the Zs space, and should be considered to be a whitespace.
    // It is explicitly not a line-break as it isn't in the exact set specified by EcmaScript.
    return ch === CharacterCodes.space ||
        ch === CharacterCodes.tab ||
        ch === CharacterCodes.verticalTab ||
        ch === CharacterCodes.formFeed ||
        ch === CharacterCodes.nonBreakingSpace ||
        ch === CharacterCodes.nextLine ||
        ch === CharacterCodes.ogham ||
        ch >= CharacterCodes.enQuad && ch <= CharacterCodes.zeroWidthSpace ||
        ch === CharacterCodes.narrowNoBreakSpace ||
        ch === CharacterCodes.mathematicalSpace ||
        ch === CharacterCodes.ideographicSpace ||
        ch === CharacterCodes.byteOrderMark;
}

変数名判断

JavaScriptの中で変数名に対していくつかのルールが設けています。
TypeScript ソースの isUnicodeIdentifierStart と isUnicodeIdentifierPart　
で変数名はルールに従えているかどうかを判断します。

/** @internal */ export function isUnicodeIdentifierStart(code: number, languageVersion: ScriptTarget | undefined) {
    return languageVersion! >= ScriptTarget.ES2015 ?
        lookupInUnicodeMap(code, unicodeESNextIdentifierStart) :
        languageVersion === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierStart) :
            lookupInUnicodeMap(code, unicodeES3IdentifierStart);
}

function isUnicodeIdentifierPart(code: number, languageVersion: ScriptTarget | undefined) {
    return languageVersion! >= ScriptTarget.ES2015 ?
        lookupInUnicodeMap(code, unicodeESNextIdentifierPart) :
        languageVersion === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierPart) :
            lookupInUnicodeMap(code, unicodeES3IdentifierPart);
}

変数名が長い場合もありますので、中身は二分探索法を使ってメモリを節約します。


const unicodeES5IdentifierStart = [170, 170, 181, ........略]
const unicodeES5IdentifierPart = [170, 170, 181, ........略]

function lookupInUnicodeMap(code: number, map: readonly number[]): boolean {
    // Bail out quickly if it couldn't possibly be in the map.
    if (code < map[0]) {
        return false;
    }

    // Perform binary search in one of the Unicode range maps
    let lo = 0;
    let hi: number = map.length;
    let mid: number;

    while (lo + 1 < hi) {
        mid = lo + (hi - lo) / 2;
        // mid has to be even to catch a range's beginning
        mid -= mid % 2;
        if (map[mid] <= code && code <= map[mid + 1]) {
            return true;
        }

        if (code < map[mid]) {
            hi = mid;
        }
        else {
            lo = mid + 2;
        }
    }

    return false;
}

isUnicodeIdentifierStart と isUnicodeIdentifierPart　合わせて、変数名のバリデーションが正しいかどうか判断することができます

/** @internal */
export function isIdentifierText(name: string, languageVersion: ScriptTarget | undefined, identifierVariant?: LanguageVariant): boolean {
    let ch = codePointAt(name, 0);
    if (!isIdentifierStart(ch, languageVersion)) {
        return false;
    }

    for (let i = charSize(ch); i < name.length; i += charSize(ch)) {
        if (!isIdentifierPart(ch = codePointAt(name, i), languageVersion, identifierVariant)) {
            return false;
        }
    }

    return true;
}

おわり

普段使っているTypeScriptの中身を見ると意外におもしろいかと思いました。
次回はchecker.tsのエラー分析なども見たいと思います。