Day 3 - rayon
Let me start this article with a confession: I suck at writing parallel code.
I have only a vague idea about mutexes, locks and other low level primitives
for handling data between threads. And since I work mostly with Python,
I tend to use multiprocessing
(if I bother to parallelize at all) rather
than threads.
However, one of the selling points of Rust is its approach to concurrent/parallel programming. Mozilla recently released a video encouraging people to write parallel code that is safe and approachable. Then I heard about rayon - a library for data parallelism that is very approachable and lightweight.
If you have an iterator chain that computes something on every
element, you can stick some variation of par_iter()
in the chain and now
your code is*
parallel. So simple!
Niko Matsakis, the author of rayon
, has a
blog post
describing the motivation behind this crate and some implementation details.
Worth checking out if you're curious how rayon
works under the hood.
How much is π?
Let's assume we don't know the value of π. How would we calculate it? One
approach is to measure the area of a unit circle (a circle with radius=1).
Since circle area is πr<sup>2</sup>
and r=1, the area of a unit circle
equals π.
If we plot the circle as a graph of a function, the area under the graph can be calculated as a definite integral. Let's represent our circle function in Rust:
fn semicircle(x: f64, r: f64) -> f64 {
(r * r - x * x).sqrt()
}
Note: this is actually a semicircle, because a circle doesn't represent a
function (it maps an x
value to two y
values).
There are various techniques for numerical integration. One of those is known as the trapezoidal rule. Here's a naive Rust implementation:
fn integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64
{
let delta = 1f64 / points as f64;
(0..points)
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
We cut the area under function into narrow strips of width delta
and sum up
their areas. Don't be afraid if you don't follow all the math here. The
important thing is - every strip is independent of each other. This is an
embarassingly parallel
task! Let's see how rayon
can help us.
extern crate rayon;
use rayon::prelude::*;
fn parallel_integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64 + Sync
{
let delta = 1f64 / points as f64;
(0..points)
.into_par_iter()
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
See any changes? Apart from the imports, the only things that changed are
trait bounds on the function (we need it to be Sync
) and the range is
converted into a parallel iterator. That's it!
println!("sequential: {}", 4f64 * integrate(semicircle, 10_000_000));
println!("parallel: {}", 4f64 * parallel_integrate(semicircle, 10_000_000));
sequential: 3.1415926535533574
parallel: 3.141592653552613
We're pretty close to the value of π!
Monte Carlo method
Imagine you're throwing darts at a square board which has a circle drawn on it. If you're as poor at darts as I am and throw at random, some of your hits will be inside of the circle, some outside of it. Monte Carlo method for calculating π works just like that: pick a number of points at random and count these within the circle. The ratio of this count to total number of points approximates the ratio of the areas (which is π/4).
Note: this is a worse example of parallelization because rand::random()
uses a thread-local random generator that reseeds itself from OS-provided
entropy source.
extern crate rand;
fn monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}
fn parallel_monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.into_par_iter()
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}
println!("sequential MC: {}", monte_carlo_pi(1_000_000));
println!("parallel MC: {}", parallel_monte_carlo_pi(1_000_000));
Again our parallel implementation differs only in the call to into_par_iter()
.
sequential MC: 3.138308
parallel MC: 3.142376
The results will be different in every run, but we can see it sort of looks like π.
Benchmarks
Twelve o'clock and all is well, but how do we know if the parallel versions really perform better? Benchmarks!
The following code works only on nightly, due to test
feature gate.
#![feature(test)]
extern crate test;
#[cfg(test)]
mod tests {
use super::{semicircle, integrate, parallel_integrate, monte_carlo_pi, parallel_monte_carlo_pi};
use test::Bencher;
#[bench]
fn bench_sequential_integrate(b: &mut Bencher) {
b.iter(|| integrate(semicircle, 100_000))
}
#[bench]
fn bench_parallel_integrate(b: &mut Bencher) {
b.iter(|| parallel_integrate(semicircle, 100_000))
}
#[bench]
fn bench_sequential_monte_carlo(b: &mut Bencher) {
b.iter(|| monte_carlo_pi(100_000))
}
#[bench]
fn bench_parallel_monte_carlo(b: &mut Bencher) {
b.iter(|| parallel_monte_carlo_pi(100_000))
}
}
$ cargo bench
Finished release [optimized] target(s) in 0.0 secs
Running target\release\rayon-bench-06c3c3f0de9581df.exe
running 4 tests
test tests::bench_parallel_integrate ... bench: 958,000 ns/iter (+/- 457,837)
test tests::bench_parallel_monte_carlo ... bench: 4,604,488 ns/iter (+/- 1,072,339)
test tests::bench_sequential_integrate ... bench: 1,532,289 ns/iter (+/- 111,210)
test tests::bench_sequential_monte_carlo ... bench: 9,736,849 ns/iter (+/- 455,576)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
The results above are on my dual core Intel i5 laptop. As you can see, the
parallel versions run almost twice as fast, utilizing both CPU cores. However,
rayon
doesn't guarantee parallel execution. To quote Niko:
the decision of whether or not to use parallel threads is made dynamically, based on whether idle cores are available.
*
This approach is known as potential parallelism. Keep that in mind when using
rayon
and always measure performance.