Parallel Compilation

As of August 2022, the only stage of the compiler that is already parallel is codegen. Some parts of the compiler already have parallel implementations, such as query evaluation, type check and monomorphization, but the general version of the compiler does not include these parallelization functions. To try out the current parallel compiler, one can install rustc from source code with parallel-compiler = true in the config.toml.

The lack of parallelism at other stages (for example, macro expansion) also represents an opportunity for improving compiler performance.

These next few sections describe where and how parallelism is currently used, and the current status of making parallel compilation the default in rustc.

Codegen

During monomorphization the compiler splits up all the code to be generated into smaller chunks called codegen units. These are then generated by independent instances of LLVM running in parallel. At the end, the linker is run to combine all the codegen units together into one binary. This process occurs in the rustc_codegen_ssa::base module.

Data Structures

The underlying thread-safe data-structures used in the parallel compiler can be found in the rustc_data_structures::sync module. These data structures are implemented differently depending on whether parallel-compiler is true.

data structureparallelnon-parallel
Lrcstd::sync::Arcstd::rc::Rc
Weakstd::sync::Weakstd::rc::Weak
Atomic{Bool}/{Usize}/{U32}/{U64}std::sync::atomic::Atomic{Bool}/{Usize}/{U32}/{U64}(std::cell::Cell<bool/usize/u32/u64>)
OnceCellstd::sync::OnceLockstd::cell::OnceCell
Lock<T>(parking_lot::Mutex<T>)(std::cell::RefCell)
RwLock<T>(parking_lot::RwLock<T>)(std::cell::RefCell)
MTRef<'a, T>&'a T&'a mut T
MTLock<T>(Lock<T>)(T)
ReadGuardparking_lot::RwLockReadGuardstd::cell::Ref
MappedReadGuardparking_lot::MappedRwLockReadGuardstd::cell::Ref
WriteGuardparking_lot::RwLockWriteGuardstd::cell::RefMut
MappedWriteGuardparking_lot::MappedRwLockWriteGuardstd::cell::RefMut
LockGuardparking_lot::MutexGuardstd::cell::RefMut
MappedLockGuardparking_lot::MappedMutexGuardstd::cell::RefMut
  • These thread-safe data structures interspersed during compilation can cause a lot of lock contention, which actually degrades performance as the number of threads increases beyond 4. This inspires us to audit the use of these data structures, leading to either refactoring to reduce use of shared state, or persistent documentation covering invariants, atomicity, and lock orderings.

  • On the other hand, we still need to figure out what other invariants during compilation might not hold in parallel compilation.

WorkLocal

WorkLocal is a special data structure implemented for parallel compiler. It holds worker-locals values for each thread in a thread pool. You can only access the worker local value through the Deref impl on the thread pool it was constructed on. It will panic otherwise.

WorkLocal is used to implement the Arena allocator in the parallel environment, which is critical in parallel queries. Its implementation is located in the rustc-rayon-core::worker_local module. However, in the non-parallel compiler, it is implemented as (OneThread<T>), whose T can be accessed directly through Deref::deref.

Parallel Iterator

The parallel iterators provided by the rayon crate are easy ways to implement parallelism. In the current implementation of the parallel compiler we use a custom fork of rayon to run tasks in parallel.

Some iterator functions are implemented to run loops in parallel when parallel-compiler is true.

Function(Omit Send and Sync)IntroductionOwning Module
par_iter<T: IntoParallelIterator>(t: T) -> T::Itergenerate a parallel iteratorrustc_data_structure::sync
par_for_each_in<T: IntoParallelIterator>(t: T, for_each: impl Fn(T::Item))generate a parallel iterator and run for_each on each elementrustc_data_structure::sync
Map::par_body_owners(self, f: impl Fn(LocalDefId))run f on all hir owners in the craterustc_middle::hir::map
Map::par_for_each_module(self, f: impl Fn(LocalDefId))run f on all modules and sub modules in the craterustc_middle::hir::map
ModuleItems::par_items(&self, f: impl Fn(ItemId))run f on all items in the modulerustc_middle::hir
ModuleItems::par_trait_items(&self, f: impl Fn(TraitItemId))run f on all trait items in the modulerustc_middle::hir
ModuleItems::par_impl_items(&self, f: impl Fn(ImplItemId))run f on all impl items in the modulerustc_middle::hir
ModuleItems::par_foreign_items(&self, f: impl Fn(ForeignItemId))run f on all foreign items in the modulerustc_middle::hir

There are a lot of loops in the compiler which can possibly be parallelized using these functions. As of August 2022, scenarios where the parallel iterator function has been used are as follows:

callerscenariocallee
rustc_metadata::rmeta::encoder::prefetch_mirPrefetch queries which will be needed later by metadata encodingpar_iter
rustc_monomorphize::collector::collect_crate_mono_itemsCollect monomorphized items reachable from non-generic itemspar_for_each_in
rustc_interface::passes::analysisCheck the validity of the match statementsMap::par_body_owners
rustc_interface::passes::analysisMIR borrow checkMap::par_body_owners
rustc_typeck::check::typeck_item_bodiesType checkMap::par_body_owners
rustc_interface::passes::hir_id_validator::check_crateCheck the validity of hirMap::par_for_each_module
rustc_interface::passes::analysisCheck the validity of loops body, attributes, naked functions, unstable abi, const bodysMap::par_for_each_module
rustc_interface::passes::analysisLiveness and intrinsic checking of MIRMap::par_for_each_module
rustc_interface::passes::analysisDeathness checkingMap::par_for_each_module
rustc_interface::passes::analysisPrivacy checkingMap::par_for_each_module
rustc_lint::late::check_crateRun per-module lintsMap::par_for_each_module
rustc_typeck::check_crateWell-formedness checkingMap::par_for_each_module

There are still many loops that have the potential to use parallel iterators.

Query System

The query model has some properties that make it actually feasible to evaluate multiple queries in parallel without too much of an effort:

  • All data a query provider can access is accessed via the query context, so the query context can take care of synchronizing access.
  • Query results are required to be immutable so they can safely be used by different threads concurrently.

When a query foo is evaluated, the cache table for foo is locked.

  • If there already is a result, we can clone it, release the lock and we are done.
  • If there is no cache entry and no other active query invocation computing the same result, we mark the key as being "in progress", release the lock and start evaluating.
  • If there is another query invocation for the same key in progress, we release the lock, and just block the thread until the other invocation has computed the result we are waiting for. Cycle error detection in the parallel compiler requires more complex logic than in single-threaded mode. When worker threads in parallel queries stop making progress due to interdependence, the compiler uses an extra thread (named deadlock handler) to detect, remove and report the cycle error.

Parallel query still has a lot of work to do, most of which is related to the previous Data Structures and Parallel Iterators. See this tracking issue.

Rustdoc

As of November 2022, there are still a number of steps to complete before rustdoc rendering can be made parallel. More details on this issue can be found here.

Resources

Here are some resources that can be used to learn more (note that some of them are a bit out of date):