More than 3 years have passed since last update.

本家Rustコンパイラのソースを読もうとしてみる（2）

Rust

Last updated at 2020-11-19Posted at 2020-08-16

前回までのあらすじ

Guide to Rustc Developmentという便利なガイドに沿って、Rustの本家コンパイラであるrust-lang/rustのソースコードを読んでみようという試み。今回は「Source Code Representations」から。どこまで読めるかは自分にも分かりません。

目次（初回）

（2020/9/2追記）第3回を書いている途中でソースのディレクトリ構成が大きく変わってしまったので、この記事に書いてあるファイルのパスは最新のものとは異なっています。ディレクトリ名は大体推測可能なので大きな問題にはならないと思いますが、第3回の記事にもう少し情報を記載しています。

Source Code Representations

ここでは与えられたソースコードがコンパイラ内部でどのように変換され、表現されていくのかを中心に説明されている。ガイドがそうであるように、あまり細かいロジックには深入りせず、大きな流れを把握することを目的に見ていこう。

コンパイラの入り口

rustc_driverというクレートがコンパイラの入り口のようだ。補足として、ソースのlibrustc_xxxディレクトリにあるのがrustc_xxxクレート、という命名規則になっている。

実質的にはrustc_driverのrun_compiler()から処理が開始され、まずコマンドライン引数などを解析、その後rustc_interfaceの同名の関数run_compiler()を起動し、コンパイラ処理を開始するという流れのようだ。

librustc_driver/lib.rs

// Parse args and run the compiler. This is the primary entry point for rustc.
// The FileLoader provides a way to load files from sources other than the file system.
pub fn run_compiler(
    at_args: &[String],
    callbacks: &mut (dyn Callbacks + Send),
    file_loader: Option<Box<dyn FileLoader + Send + Sync>>,
    emitter: Option<Box<dyn Write + Send>>,
) -> interface::Result<()> {
    let mut args = Vec::new();
    for arg in at_args {
        match args::arg_expand(arg.clone()) {
            Ok(arg) => args.extend(arg),
            Err(err) => early_error(
                ErrorOutputType::default(),
                &format!("Failed to load argument file: {}", err),
            ),     
        }     
    }     
    let diagnostic_output =
        emitter.map(|emitter| DiagnosticOutput::Raw(emitter)).unwrap_or(DiagnosticOutput::Default);
    let matches = match handle_options(&args) {
        Some(matches) => matches,
        None => return Ok(()),
    };
...
    interface::run_compiler(config, |compiler| {
...
    })
}

ソースファイルからtoken列まで

前回確認したように、ソースファイル→token列→ASTという流れがあるはずなので、それを念頭に見ていこう。

しかしtokenの処理はちょっと追いづらい。文字列をtoken列に変換する処理ではrustc_lexerが中心的な役割を担っているものの、一部の処理がrustc_parse::lexerにあったり、そこで使われる定義がrustc_astにあったりする。ドキュメントを見る感じだと、rustc_lexerは単独で他の用途に再利用できるように設計されているのだが、ここでの解析ではコンパイラが使うには十分でないため、rustc_parse::lexerで追加の解析を行うような設計になっているようだ。

まず、分かりやすいparserの入り口から見ていく。parse_crate_from_file()でparserを生成し、そのメソッドであるparse_crate_mod()によってASTを生成し、値を返しているように見える。

librustc_parse/lib.rs

pub fn parse_crate_from_file<'a>(input: &Path, sess: &'a ParseSess) -> PResult<'a, ast::Crate> {
    let mut parser = new_parser_from_file(sess, input, None);
    parser.parse_crate_mod()
}

parserの生成について、parse_crate_from_file()を少し追っていくと、maybe_source_file_to_parser()という関数にあるmaybe_file_to_stream()でソースファイルからtoken列の変換を行っているようだ。

librustc_parse/lib.rs

/// Given a `source_file` and config, return a parser. Returns any buffered errors from lexing the
/// initial token stream.
fn maybe_source_file_to_parser(
    sess: &ParseSess,
    source_file: Lrc<SourceFile>,
) -> Result<Parser<'_>, Vec<Diagnostic>> {
    let end_pos = source_file.end_pos;
    let (stream, unclosed_delims) = maybe_file_to_stream(sess, source_file, None)?;
    let mut parser = stream_to_parser(sess, stream, None);
    parser.unclosed_delims = unclosed_delims;
    if parser.token == token::Eof {
        parser.token.span = Span::new(end_pos, end_pos, parser.token.span.ctxt());
    }

    Ok(parser)
}

maybe_file_to_stream()の実装についてもう少し見ていこう。ソースファイルはStringReaderという形式に変換され、into_token_trees()のメソッドが実行される。into_token_trees()の内部では、さらにTokenTreesReaderという形式に変換され、parse_all_token_trees()のメソッドが実行される。このメソッドを追っていくと、next_token()というメソッドに行き着く。

librustc_parse/lib.rs

// parsing the token stream.
pub fn maybe_file_to_stream(
    sess: &ParseSess,
    source_file: Lrc<SourceFile>,
    override_span: Option<Span>,
) -> Result<(TokenStream, Vec<lexer::UnmatchedBrace>), Vec<Diagnostic>> {
    let srdr = lexer::StringReader::new(sess, source_file, override_span);
    let (token_trees, unmatched_braces) = srdr.into_token_trees();

    match token_trees {
        Ok(stream) => Ok((stream, unmatched_braces)),
        Err(err) => {
            let mut buffer = Vec::with_capacity(1);
            err.buffer(&mut buffer);
            // Not using `emit_unclosed_delims` to use `db.buffer`
            for unmatched in unmatched_braces {
                if let Some(err) = make_unclosed_delims_error(unmatched, &sess) {
                    err.buffer(&mut buffer);
                }
            }
            Err(buffer)
        }
    }
}

librustc_parse/lexer/tokentrees.rs

impl<'a> StringReader<'a> {
    crate fn into_token_trees(self) -> (PResult<'a, TokenStream>, Vec<UnmatchedBrace>) {
        let mut tt_reader = TokenTreesReader {
            string_reader: self,
            token: Token::dummy(),
            joint_to_prev: Joint,
            open_braces: Vec::new(),
            unmatched_braces: Vec::new(),
            matching_delim_spans: Vec::new(),
            last_unclosed_found_span: None,
            last_delim_empty_block_spans: FxHashMap::default(),
            matching_block_spans: Vec::new(),
        };
        let res = tt_reader.parse_all_token_trees();
        (res, tt_reader.unmatched_braces)
    }
}

librustc_parse/lexer/mod.rs

impl<'a> StringReader<'a> {
...
    /// Returns the next token, including trivia like whitespace or comments.
    pub fn next_token(&mut self) -> Token {
        let start_src_index = self.src_index(self.pos);
        let text: &str = &self.src[start_src_index..self.end_src_index];

        if text.is_empty() {
            let span = self.mk_sp(self.pos, self.pos);
            return Token::new(token::Eof, span);
        }

        {
            let is_beginning_of_file = self.pos == self.start_pos;
            if is_beginning_of_file {
                if let Some(shebang_len) = rustc_lexer::strip_shebang(text) {
                    let start = self.pos;
                    self.pos = self.pos + BytePos::from_usize(shebang_len);

                    let sym = self.symbol_from(start + BytePos::from_usize("#!".len()));
                    let kind = token::Shebang(sym);

                    let span = self.mk_sp(start, self.pos);
                    return Token::new(kind, span);
                }
            }
        }

        let token = rustc_lexer::first_token(text);

        let start = self.pos;
        self.pos = self.pos + BytePos::from_usize(token.len);

        debug!("try_next_token: {:?}({:?})", token.kind, self.str_from(start));

        let kind = self.cook_lexer_token(token.kind, start);
        let span = self.mk_sp(start, self.pos);
        Token::new(kind, span)
    }
...
}

ここまではrustc_parse::lexerで処理が進んできたが、このnext_token()でrustc_lexerのfirst_token()が呼ばれ、その先のadvance_token()で文字列からtokenへの変換が行われる。（詳細な処理は割愛したが、文字列の先頭を見てtoken種別に当てはめる処理が並んでいる。）

librustc_lexer/src/lib.rs

/// Parses the first token from the provided input string.
pub fn first_token(input: &str) -> Token {
    debug_assert!(!input.is_empty());
    Cursor::new(input).advance_token()
}
...
impl Cursor<'_> {
    /// Parses a token from the input string.
    fn advance_token(&mut self) -> Token {
        let first_char = self.bump().unwrap();
        let token_kind = match first_char {
...       （ここでtoken化）
        };
        Token::new(token_kind, self.len_consumed())
    }
...
}

再びnext_token()に戻ると、first_token()によってtokenが得られた後cook_lexer_token()という関数を呼んでいる。コメントにあるように、どうもここで
rustc_lexer::TokenKindからよりリッチなrustc_ast::TokenKindへの変換を行っているようだ。

librustc_parse/lexer/mod.rs

mpl<'a> StringReader<'a> {
...
    /// Turns simple `rustc_lexer::TokenKind` enum into a rich
    /// `librustc_ast::TokenKind`. This turns strings into interned
    /// symbols and runs additional validation.
    fn cook_lexer_token(&self, token: rustc_lexer::TokenKind, start: BytePos) -> TokenKind {
        match token {
...
        }
    }
...
}

token列からASTまで

再度parserの入り口に戻り、今度はparse_crate_mod()を深掘りしていこう。

librustc_parse/lib.rs

pub fn parse_crate_from_file<'a>(input: &Path, sess: &'a ParseSess) -> PResult<'a, ast::Crate> {
    let mut parser = new_parser_from_file(sess, input, None);
    parser.parse_crate_mod()
}

parse_crate_mod()はast::Crateを返しており、その要素となるMod（モジュール）はparse_mod()により与えられている。さらにそのModはparse_item()が返すItemをベクタの要素として持っている。AST（のルート）はクレートごとに作られ、クレートにはモジュールがあり、モジュールはアイテムという大きい単位で構成されているという感じだろうか。

librustc_parse/parser/item.rs

impl<'a> Parser<'a> {
    /// Parses a source module as a crate. This is the main entry point for the parser.
    pub fn parse_crate_mod(&mut self) -> PResult<'a, ast::Crate> {
        let lo = self.token.span;
        let (module, attrs) = self.parse_mod(&token::Eof)?;
        let span = lo.to(self.token.span);
        let proc_macros = Vec::new(); // Filled in by `proc_macro_harness::inject()`.
        Ok(ast::Crate { attrs, module, span, proc_macros })
    }
...
    /// Parses the contents of a module (inner attributes followed by module items).
    pub fn parse_mod(&mut self, term: &TokenKind) -> PResult<'a, (Mod, Vec<Attribute>)> {
        let lo = self.token.span;
        let attrs = self.parse_inner_attributes()?;
        let module = self.parse_mod_items(term, lo)?;
        Ok((module, attrs))
    }

    /// Given a termination token, parses all of the items in a module.
    fn parse_mod_items(&mut self, term: &TokenKind, inner_lo: Span) -> PResult<'a, Mod> {
        let mut items = vec![];
        while let Some(item) = self.parse_item()? {
            items.push(item);
            self.maybe_consume_incorrect_semicolon(&items);
        }

        if !self.eat(term) {
            let token_str = super::token_descr(&self.token);
            if !self.maybe_consume_incorrect_semicolon(&items) {
                let msg = &format!("expected item, found {}", token_str);
                let mut err = self.struct_span_err(self.token.span, msg);
                err.span_label(self.token.span, "expected item");
                return Err(err);
            }
        }

        let hi = if self.token.span.is_dummy() { inner_lo } else { self.prev_token.span };

        Ok(Mod { inner: inner_lo.to(hi), items, inline: true })
    }
}

アイテムについてもう少し詳しく見ていこう。parse_item()を追っていくと、parse_item_kind()で大きなif/else if式があり、ここで先頭キーワード（use、fn、externなど）をもとにした大きな分類を行っている。この大きな分類がItemという単位のようだ。

librustc_parse/parser/item.rs

impl<'a> Parser<'a> {
...
    /// Parses one of the items allowed by the flags.
    fn parse_item_kind(
        &mut self,
        attrs: &mut Vec<Attribute>,
        macros_allowed: bool,
        lo: Span,
        vis: &Visibility,
        def: &mut Defaultness,
        req_name: ReqName,
    ) -> PResult<'a, Option<ItemInfo>> {
        let mut def = || mem::replace(def, Defaultness::Final);

        let info = if self.eat_keyword(kw::Use) {
            // USE ITEM
            let tree = self.parse_use_tree()?;
            self.expect_semi()?;
            (Ident::invalid(), ItemKind::Use(P(tree)))
        } else if self.check_fn_front_matter() {
            // FUNCTION ITEM
            let (ident, sig, generics, body) = self.parse_fn(attrs, req_name)?;      
            (ident, ItemKind::Fn(def(), sig, generics, body))
        } else if self.eat_keyword(kw::Extern) {  
...         （キーワードのelse ifが続く）
        } else {
            return Ok(None);
        };
        Ok(Some(info))
    }
...
}

これ以降は基本的にItemごとの処理が適用されていくことになる。ここでは、プログラムにおいて大半を占めるであろう関数関連の処理を見ていこう。関数の本体（{}で囲まれたブロック）は、基本的にstatement（文）の集まりであり、statementはexpression（式）をベースとして構成されていることを念頭に置いておく。

処理はparse_fn()、parse_fn_body()と続いていき、parse_block_tail()でstatementを集めているようだ（コードは記載してません）。そのまま追っていくと、parse_stmt_without_recovery()でstatementの分類が行われているように見える。

librustc_parse/parser/stmt.rs

impl<'a> Parser<'a> {
...
    fn parse_stmt_without_recovery(&mut self) -> PResult<'a, Option<Stmt>> {
        maybe_whole!(self, NtStmt, |x| Some(x));
    
        let attrs = self.parse_outer_attributes()?;
        let lo = self.token.span;

        let stmt = if self.eat_keyword(kw::Let) {
            self.parse_local_mk(lo, attrs.into())?
        } else if self.is_kw_followed_by_ident(kw::Mut) {
...         （キーワードのelse ifが続く）
        } else if self.token != token::CloseDelim(token::Brace) {
            // Remainder are line-expr stmts.
            let e = self.parse_expr_res(Restrictions::STMT_EXPR, Some(attrs.into()))?;
            self.mk_stmt(lo.to(e.span), StmtKind::Expr(e))
        } else {
            self.error_outer_attrs(&attrs);
            return Ok(None);
        };
        Ok(Some(stmt))
    }
...
}

そして処理はexpressionの解析へと続いていく。このあたりから条件ごとの枝分かれが激しくなってくるので確信も薄くなってくるのだが、大抵の処理（いわゆる普通の演算とか）はparse_expr_res()に流れていく感じだろうか。

その先を追っていくと、parse_assoc_expr_with()、parse_prefix_expr()、parse_bottom_expr()あたりでexpressionの大まかな分類を行っているようだ。それぞれ、2項演算子、単項演算子、制御などのキーワード（if、for、whileなど）を主に解析しているように見える。

librustc_parse/parser/expr.rs

impl<'a> Parser<'a> {
...
    /// Parses an associative expression with operators of at least `min_prec` precedence.
    pub(super) fn parse_assoc_expr_with(
        &mut self,
        min_prec: usize,
        lhs: LhsExpr,
    ) -> PResult<'a, P<Expr>> {
...
        while let Some(op) = self.check_assoc_op() {
...
            lhs = match op {
                AssocOp::Add
                | AssocOp::Subtract
                | AssocOp::Multiply
                | AssocOp::Divide
                | AssocOp::Modulus
                | AssocOp::LAnd
                | AssocOp::LOr
                | AssocOp::BitXor
                | AssocOp::BitAnd
                | AssocOp::BitOr
                | AssocOp::ShiftLeft
                | AssocOp::ShiftRight
                | AssocOp::Equal
                | AssocOp::Less
                | AssocOp::LessEqual
                | AssocOp::NotEqual
                | AssocOp::Greater
                | AssocOp::GreaterEqual => {
                    let ast_op = op.to_ast_binop().unwrap();
                    let binary = self.mk_binary(source_map::respan(cur_op_span, ast_op), lhs, rhs);
                    self.mk_expr(span, binary, AttrVec::new())
                }
                AssocOp::Assign => {
                    self.mk_expr(span, ExprKind::Assign(lhs, rhs, cur_op_span), AttrVec::new())
                }
                AssocOp::AssignOp(k) => {
                    let aop = match k {
                        token::Plus => BinOpKind::Add,
                        token::Minus => BinOpKind::Sub,
                        token::Star => BinOpKind::Mul,
                        token::Slash => BinOpKind::Div,
                        token::Percent => BinOpKind::Rem,
                        token::Caret => BinOpKind::BitXor,
                        token::And => BinOpKind::BitAnd,
                        token::Or => BinOpKind::BitOr,
                        token::Shl => BinOpKind::Shl,
                        token::Shr => BinOpKind::Shr,
                    };
                    let aopexpr = self.mk_assign_op(source_map::respan(cur_op_span, aop), lhs, rhs);
                    self.mk_expr(span, aopexpr, AttrVec::new())
                }
                AssocOp::As | AssocOp::Colon | AssocOp::DotDot | AssocOp::DotDotEq => {
                    self.span_bug(span, "AssocOp should have been handled by special case")
                }
            };

...
        Ok(lhs)
    }
...
    /// Parses a prefix-unary-operator expr.                                  
    fn parse_prefix_expr(&mut self, attrs: Option<AttrVec>) -> PResult<'a, P<Expr>> {
        let attrs = self.parse_or_use_outer_attributes(attrs)?;
        self.maybe_collect_tokens(!attrs.is_empty(), |this| {
            let lo = this.token.span;
            // Note: when adding new unary operators, don't forget to adjust TokenKind::can_begin_expr()
            let (hi, ex) = match this.token.uninterpolate().kind {
                token::Not => this.parse_unary_expr(lo, UnOp::Not), // `!expr`
                token::Tilde => this.recover_tilde_expr(lo),        // `~expr`
                token::BinOp(token::Minus) => this.parse_unary_expr(lo, UnOp::Neg), // `-expr`
                token::BinOp(token::Star) => this.parse_unary_expr(lo, UnOp::Deref), // `*expr`
                token::BinOp(token::And) | token::AndAnd => this.parse_borrow_expr(lo),
                token::Ident(..) if this.token.is_keyword(kw::Box) => this.parse_box_expr(lo),
                token::Ident(..) if this.is_mistaken_not_ident_negation() => {
                    this.recover_not_expr(lo)
                }
                _ => return this.parse_dot_or_call_expr(Some(attrs)),
            }?;
            Ok(this.mk_expr(lo.to(hi), ex, attrs))
        })
    }
...
    /// At the bottom (top?) of the precedence hierarchy,
    /// Parses things like parenthesized exprs, macros, `return`, etc.
    ///
    /// N.B., this does not parse outer attributes, and is private because it only works
    /// correctly if called from `parse_dot_or_call_expr()`.
    fn parse_bottom_expr(&mut self) -> PResult<'a, P<Expr>> {
...
        if let token::Literal(_) = self.token.kind {
...
        } else if self.eat_keyword(kw::If) {
            self.parse_if_expr(attrs)
        } else if self.check_keyword(kw::For) {
            if self.choose_generics_over_qpath(1) {
                // NOTE(Centril, eddyb): DO NOT REMOVE! Beyond providing parser recovery,
                // this is an insurance policy in case we allow qpaths in (tuple-)struct patterns.
                // When `for <Foo as Bar>::Proj in $expr $block` is wanted,
                // you can disambiguate in favor of a pattern with `(...)`.
                self.recover_quantified_closure_expr(attrs)
            } else {
                assert!(self.eat_keyword(kw::For));
                self.parse_for_expr(None, self.prev_token.span, attrs)
            }
        } else if self.eat_keyword(kw::While) {
...
        } else {
            self.parse_lit_expr(attrs)
        }
    }
...
}

ここまで見てきた関係をざっくりまとめると、こんな感じになる。

crate > module > item > block > statement > expression

今回は関数の本体を例として見たのでこのような感じだが、場合によってはitemの下にblockがないようなパターンなどもあり得るだろう。（例えば、constでグローバルな定数を定義したりとか。もちろん他にもいろいろなパターンがありそう。）

マクロの展開

上記ASTの生成後、マクロの展開が行われる。ガイドによると、マクロの展開はrustc_expandで行われ、特に中心的な役割を担うのがfully_expand_fragment()というメソッドのようだ。ただ、Rustのマクロは複雑で一筋縄ではいかなそうなので、あまり深追いはしないでおこうと思う。

最初にASTを生成する段階ではマクロは展開せず、単純に「マクロ呼び出し」として分類することで印だけ付けているような状態だ。処理としては以下のparse_path_start_expr()あたりと思われる。

librustc_parse/parser/expr.rs

impl<'a> Parser<'a> {
...
    fn parse_path_start_expr(&mut self, attrs: AttrVec) -> PResult<'a, P<Expr>> {
        let path = self.parse_path(PathStyle::Expr)?;     
        let lo = path.span;     
     
        // `!`, as an operator, is prefix, so we know this isn't that.
        let (hi, kind) = if self.eat(&token::Not) {
            // MACRO INVOCATION expression
            let mac = MacCall {
                path,
                args: self.parse_mac_args()?,
                prior_type_ascription: self.last_type_ascription,
            };
            (self.prev_token.span, ExprKind::MacCall(mac))
        } else if self.check(&token::OpenDelim(token::Brace)) {
            if let Some(expr) = self.maybe_parse_struct_expr(&path, &attrs) {
                return expr;
            } else {
                (path.span, ExprKind::Path(None, path))    
            }
        } else {
            (path.span, ExprKind::Path(None, path))
        };

        let expr = self.mk_expr(lo.to(hi), kind, attrs);
        self.maybe_recover_from_bad_qpath(expr, true)
    }
...
}

AST生成後、rustc_expandのexpand_crate()が呼ばれ、そこからfully_expand_fragment()が呼び出される。fully_expand_fragment()はAstFragmentというASTの断片を返すようになっており、この断片でもとのASTにおけるマクロ呼び出し箇所を置き換えることで、マクロが展開されたASTを生成している（たぶん）。

librustc_expand/expand.rs

impl<'a, 'b> MacroExpander<'a, 'b> {
...
    pub fn expand_crate(&mut self, mut krate: ast::Crate) -> ast::Crate {
...
        match self.fully_expand_fragment(krate_item).make_items().pop().map(P::into_inner) {
            Some(ast::Item { attrs, kind: ast::ItemKind::Mod(module), .. }) => {
                krate.attrs = attrs;
                krate.module = module;
            }
            None => {
                // Resolution failed so we return an empty expansion
                krate.attrs = vec![];
                krate.module = ast::Mod { inner: orig_mod_span, items: vec![], inline: true };
            }
            Some(ast::Item { span, kind, .. }) => {
                krate.attrs = vec![];
                krate.module = ast::Mod { inner: orig_mod_span, items: vec![], inline: true };
                self.cx.span_err(
                    span,
                    &format!(
                        "expected crate top-level item to be a module after macro expansion, found {} {}",
                        kind.article(), kind.descr()
                    ),
                );
            }
        };
        self.cx.trace_macros_diag();
        krate
    }
...
    // Recursively expand all macro invocations in this AST fragment.
    pub fn fully_expand_fragment(&mut self, input_fragment: AstFragment) -> AstFragment {
...
    }
...
}

つづく

ざっくりだがAST生成のところまで流れを追うことができた。次回は「Source Code Representations」の続きで、HIRやMIRの生成部分を見ていきたい。

本家Rustコンパイラのソースを読もうとしてみる（3）←次回

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up