The MIR (Mid-level IR)
MIR is Rust's Mid-level Intermediate Representation. It is constructed from HIR. MIR was introduced in RFC 1211. It is a radically simplified form of Rust that is used for certain flow-sensitive safety checks – notably the borrow checker! – and also for optimization and code generation.
If you'd like a very high-level introduction to MIR, as well as some of the compiler concepts that it relies on (such as control-flow graphs and desugaring), you may enjoy the rust-lang blog post that introduced MIR.
Introduction to MIR
MIR is defined in the compiler/rustc_middle/src/mir/
module, but much of the code
that manipulates it is found in compiler/rustc_mir_build
,
compiler/rustc_mir_transform
, and
compiler/rustc_mir_dataflow
.
Some of the key characteristics of MIR are:
- It is based on a control-flow graph.
- It does not have nested expressions.
- All types in MIR are fully explicit.
Key MIR vocabulary
This section introduces the key concepts of MIR, summarized here:
- Basic blocks: units of the control-flow graph, consisting of:
- statements: actions with one successor
- terminators: actions with potentially multiple successors; always at the end of a block
- (if you're not familiar with the term basic block, see the background chapter)
- Locals: Memory locations allocated on the stack (conceptually, at
least), such as function arguments, local variables, and
temporaries. These are identified by an index, written with a
leading underscore, like
_1
. There is also a special "local" (_0
) allocated to store the return value. - Places: expressions that identify a location in memory, like
_1
or_1.f
. - Rvalues: expressions that produce a value. The "R" stands for
the fact that these are the "right-hand side" of an assignment.
- Operands: the arguments to an rvalue, which can either be a
constant (like
22
) or a place (like_1
).
- Operands: the arguments to an rvalue, which can either be a
constant (like
You can get a feeling for how MIR is constructed by translating simple programs into MIR and reading the pretty printed output. In fact, the playground makes this easy, since it supplies a MIR button that will show you the MIR for your program. Try putting this program into play (or clicking on this link), and then clicking the "MIR" button on the top:
fn main() { let mut vec = Vec::new(); vec.push(1); vec.push(2); }
You should see something like:
// WARNING: This output format is intended for human consumers only
// and is subject to change without notice. Knock yourself out.
fn main() -> () {
...
}
This is the MIR format for the main
function.
MIR shown by above link is optimized.
Some statements like StorageLive
are removed in optimization.
This happens because the compiler notices the value is never accessed in the code.
We can use rustc [filename].rs -Z mir-opt-level=0 --emit mir
to view unoptimized MIR.
This requires the nightly toolchain.
Variable declarations. If we drill in a bit, we'll see it begins with a bunch of variable declarations. They look like this:
let mut _0: (); // return place
let mut _1: std::vec::Vec<i32>; // in scope 0 at src/main.rs:2:9: 2:16
let mut _2: ();
let mut _3: &mut std::vec::Vec<i32>;
let mut _4: ();
let mut _5: &mut std::vec::Vec<i32>;
You can see that variables in MIR don't have names, they have indices,
like _0
or _1
. We also intermingle the user's variables (e.g.,
_1
) with temporary values (e.g., _2
or _3
). You can tell apart
user-defined variables because they have debuginfo associated to them (see below).
User variable debuginfo. Below the variable declarations, we find the only
hint that _1
represents a user variable:
scope 1 {
debug vec => _1; // in scope 1 at src/main.rs:2:9: 2:16
}
Each debug <Name> => <Place>;
annotation describes a named user variable,
and where (i.e. the place) a debugger can find the data of that variable.
Here the mapping is trivial, but optimizations may complicate the place,
or lead to multiple user variables sharing the same place.
Additionally, closure captures are described using the same system, and so
they're complicated even without optimizations, e.g.: debug x => (*((*_1).0: &T));
.
The "scope" blocks (e.g., scope 1 { .. }
) describe the lexical structure of
the source program (which names were in scope when), so any part of the program
annotated with // in scope 0
would be missing vec
, if you were stepping
through the code in a debugger, for example.
Basic blocks. Reading further, we see our first basic block (naturally it may look slightly different when you view it, and I am ignoring some of the comments):
bb0: {
StorageLive(_1);
_1 = const <std::vec::Vec<T>>::new() -> bb2;
}
A basic block is defined by a series of statements and a final terminator. In this case, there is one statement:
StorageLive(_1);
This statement indicates that the variable _1
is "live", meaning
that it may be used later – this will persist until we encounter a
StorageDead(_1)
statement, which indicates that the variable _1
is
done being used. These "storage statements" are used by LLVM to
allocate stack space.
The terminator of the block bb0
is the call to Vec::new
:
_1 = const <std::vec::Vec<T>>::new() -> bb2;
Terminators are different from statements because they can have more
than one successor – that is, control may flow to different
places. Function calls like the call to Vec::new
are always
terminators because of the possibility of unwinding, although in the
case of Vec::new
we are able to see that indeed unwinding is not
possible, and hence we list only one successor block, bb2
.
If we look ahead to bb2
, we will see it looks like this:
bb2: {
StorageLive(_3);
_3 = &mut _1;
_2 = const <std::vec::Vec<T>>::push(move _3, const 1i32) -> [return: bb3, unwind: bb4];
}
Here there are two statements: another StorageLive
, introducing the _3
temporary, and then an assignment:
_3 = &mut _1;
Assignments in general have the form:
<Place> = <Rvalue>
A place is an expression like _3
, _3.f
or *_3
– it denotes a
location in memory. An Rvalue is an expression that creates a
value: in this case, the rvalue is a mutable borrow expression, which
looks like &mut <Place>
. So we can kind of define a grammar for
rvalues like so:
<Rvalue> = & (mut)? <Place>
| <Operand> + <Operand>
| <Operand> - <Operand>
| ...
<Operand> = Constant
| copy Place
| move Place
As you can see from this grammar, rvalues cannot be nested – they can
only reference places and constants. Moreover, when you use a place,
we indicate whether we are copying it (which requires that the
place have a type T
where T: Copy
) or moving it (which works
for a place of any type). So, for example, if we had the expression x = a + b + c
in Rust, that would get compiled to two statements and a
temporary:
TMP1 = a + b
x = TMP1 + c
(Try it and see, though you may want to do release mode to skip over the overflow checks.)
MIR data types
The MIR data types are defined in the compiler/rustc_middle/src/mir/
module. Each of the key concepts mentioned in the previous section
maps in a fairly straightforward way to a Rust type.
The main MIR data type is Body
. It contains the data for a single
function (along with sub-instances of Mir for "promoted constants",
but you can read about those below).
- Basic blocks: The basic blocks are stored in the field
Body::basic_blocks
; this is a vector ofBasicBlockData
structures. Nobody ever references a basic block directly: instead, we pass aroundBasicBlock
values, which are newtype'd indices into this vector. - Statements are represented by the type
Statement
. - Terminators are represented by the
Terminator
. - Locals are represented by a newtype'd index type
Local
. The data for a local variable is found in theBody::local_decls
vector. There is also a special constantRETURN_PLACE
identifying the special "local" representing the return value. - Places are identified by the struct
Place
. There are a few fields:- Local variables like
_1
- Projections, which are fields or other things that "project
out" from a base place. These are represented by the newtype'd type
ProjectionElem
. So e.g. the place_1.f
is a projection, withf
being the "projection element" and_1
being the base path.*_1
is also a projection, with the*
being represented by theProjectionElem::Deref
element.
- Local variables like
- Rvalues are represented by the enum
Rvalue
. - Operands are represented by the enum
Operand
.
Representing constants
When code has reached the MIR stage, constants can generally come in two forms:
MIR constants (mir::Constant
) and type system constants (ty::Const
).
MIR constants are used as operands: in x + CONST
, CONST
is a MIR constant;
similarly, in x + 2
, 2
is a MIR constant. Type system constants are used in
the type system, in particular for array lengths but also for const generics.
Generally, both kinds of constants can be "unevaluated" or "already evaluated".
An unevaluated constant simply stores the DefId
of what needs to be evaluated
to compute this result. An evaluated constant (a "value") has already been
computed; their representation differs between type system constants and MIR
constants: MIR constants evaluate to a mir::ConstValue
; type system constants
evaluate to a ty::ValTree
.
Type system constants have some more variants to support const generics: they
can refer to local const generic parameters, and they are subject to inference.
Furthermore, the mir::Constant::Ty
variant lets us use an arbitrary type
system constant as a MIR constant; this happens whenever a const generic
parameter is used as an operand.
MIR constant values
In general, a MIR constant value (mir::ConstValue
) was computed by evaluating
some constant the user wrote. This const evaluation produces
a very low-level representation of the result in terms of individual bytes. We
call this an "indirect" constant (mir::ConstValue::Indirect
) since the value
is stored in-memory.
However, storing everything in-memory would be awfully inefficient. Hence there
are some other variants in mir::ConstValue
that can represent certain simple
and common values more efficiently. In particular, everything that can be
directly written as a literal in Rust (integers, floats, chars, bools, but also
"string literals"
and b"byte string literals"
) has an optimized variant that
avoids the full overhead of the in-memory representation.
ValTrees
An evaluated type system constant is a "valtree". The ty::ValTree
datastructure
allows us to represent
- arrays,
- many structs,
- tuples,
- enums and,
- most primitives.
The most important rule for
this representation is that every value must be uniquely represented. In other
words: a specific value must only be representable in one specific way. For example: there is only
one way to represent an array of two integers as a ValTree
:
ValTree::Branch(&[ValTree::Leaf(first_int), ValTree::Leaf(second_int)])
.
Even though theoretically a [u32; 2]
could be encoded in a u64
and thus just be a
ValTree::Leaf(bits_of_two_u32)
, that is not a legal construction of ValTree
(and is very complex to do, so it is unlikely anyone is tempted to do so).
These rules also mean that some values are not representable. There can be no union
s in type
level constants, as it is not clear how they should be represented, because their active variant
is unknown. Similarly there is no way to represent raw pointers, as addresses are unknown at
compile-time and thus we cannot make any assumptions about them. References on the other hand
can be represented, as equality for references is defined as equality on their value, so we
ignore their address and just look at the backing value. We must make sure that the pointer values
of the references are not observable at compile time. We thus encode &42
exactly like 42
.
Any conversion from
valtree back to a MIR constant value must reintroduce an actual indirection. At codegen time the
addresses may be deduplicated between multiple uses or not, entirely depending on arbitrary
optimization choices.
As a consequence, all decoding of ValTree
must happen by matching on the type first and making
decisions depending on that. The value itself gives no useful information without the type that
belongs to it.
Promoted constants
See the const-eval WG's docs on promotion.