Day 3 - rayon

Let me start this article with a confession: I suck at writing parallel code. I have only a vague idea about mutexes, locks and other low level primitives for handling data between threads. And since I work mostly with Python, I tend to use multiprocessing (if I bother to parallelize at all) rather than threads.

However, one of the selling points of Rust is its approach to concurrent/parallel programming. Mozilla recently released a video encouraging people to write parallel code that is safe and approachable. Then I heard about rayon - a library for data parallelism that is very approachable and lightweight.

If you have an iterator chain that computes something on every element, you can stick some variation of par_iter() in the chain and now your code is* parallel. So simple!

Niko Matsakis, the author of rayon, has a blog post describing the motivation behind this crate and some implementation details. Worth checking out if you're curious how rayon works under the hood.

How much is π?

Let's assume we don't know the value of π. How would we calculate it? One approach is to measure the area of a unit circle (a circle with radius=1). Since circle area is πr<sup>2</sup> and r=1, the area of a unit circle equals π.

If we plot the circle as a graph of a function, the area under the graph can be calculated as a definite integral. Let's represent our circle function in Rust:

fn semicircle(x: f64, r: f64) -> f64 {
    (r * r - x * x).sqrt()
}

Note: this is actually a semicircle, because a circle doesn't represent a function (it maps an x value to two y values).

There are various techniques for numerical integration. One of those is known as the trapezoidal rule. Here's a naive Rust implementation:

fn integrate<F>(f: F, points: usize) -> f64
    where F: Fn(f64) -> f64
{
    let delta = 1f64 / points as f64;
    (0..points)
        .map(|i| {
            let a = i as f64 * delta;
            delta * (f(a) + f(a + delta)) / 2f64
        })
        .sum()
}

We cut the area under function into narrow strips of width delta and sum up their areas. Don't be afraid if you don't follow all the math here. The important thing is - every strip is independent of each other. This is an embarassingly parallel task! Let's see how rayon can help us.

extern crate rayon;

use rayon::prelude::*;

fn parallel_integrate<F>(f: F, points: usize) -> f64
    where F: Fn(f64) -> f64 + Sync
{
    let delta = 1f64 / points as f64;
    (0..points)
        .into_par_iter()
        .map(|i| {
            let a = i as f64 * delta;
            delta * (f(a) + f(a + delta)) / 2f64
        })
        .sum()
}

See any changes? Apart from the imports, the only things that changed are trait bounds on the function (we need it to be Sync) and the range is converted into a parallel iterator. That's it!

println!("sequential: {}", 4f64 * integrate(semicircle, 10_000_000));
println!("parallel: {}", 4f64 * parallel_integrate(semicircle, 10_000_000));
sequential: 3.1415926535533574
parallel: 3.141592653552613

We're pretty close to the value of π!

Monte Carlo method

Imagine you're throwing darts at a square board which has a circle drawn on it. If you're as poor at darts as I am and throw at random, some of your hits will be inside of the circle, some outside of it. Monte Carlo method for calculating π works just like that: pick a number of points at random and count these within the circle. The ratio of this count to total number of points approximates the ratio of the areas (which is π/4).

Note: this is a worse example of parallelization because rand::random() uses a thread-local random generator that reseeds itself from OS-provided entropy source.

extern crate rand;

fn monte_carlo_pi(points: usize) -> f64 {
    let within_circle = (0..points)
        .filter_map(|_| {
            let x = rand::random::<f64>() * 2f64 - 1f64;
            let y = rand::random::<f64>() * 2f64 - 1f64;
            if x * x + y * y <= 1f64 { Some(1) } else { None }
        })
        .count();
    4f64 * within_circle as f64 / points as f64
}

fn parallel_monte_carlo_pi(points: usize) -> f64 {
    let within_circle = (0..points)
        .into_par_iter()
        .filter_map(|_| {
            let x = rand::random::<f64>() * 2f64 - 1f64;
            let y = rand::random::<f64>() * 2f64 - 1f64;
            if x * x + y * y <= 1f64 { Some(1) } else { None }
        })
        .count();
    4f64 * within_circle as f64 / points as f64
}

println!("sequential MC: {}", monte_carlo_pi(1_000_000));
println!("parallel MC: {}", parallel_monte_carlo_pi(1_000_000));

Again our parallel implementation differs only in the call to into_par_iter().

sequential MC: 3.138308
parallel MC: 3.142376

The results will be different in every run, but we can see it sort of looks like π.

Benchmarks

Twelve o'clock and all is well, but how do we know if the parallel versions really perform better? Benchmarks!

The following code works only on nightly, due to test feature gate.

#![feature(test)]

extern crate test;

#[cfg(test)]
mod tests {
    use super::{semicircle, integrate, parallel_integrate, monte_carlo_pi, parallel_monte_carlo_pi};
    use test::Bencher;

    #[bench]
    fn bench_sequential_integrate(b: &mut Bencher) {
        b.iter(|| integrate(semicircle, 100_000))
    }

    #[bench]
    fn bench_parallel_integrate(b: &mut Bencher) {
        b.iter(|| parallel_integrate(semicircle, 100_000))
    }

    #[bench]
    fn bench_sequential_monte_carlo(b: &mut Bencher) {
        b.iter(|| monte_carlo_pi(100_000))
    }

    #[bench]
    fn bench_parallel_monte_carlo(b: &mut Bencher) {
        b.iter(|| parallel_monte_carlo_pi(100_000))
    }
}
$ cargo bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target\release\rayon-bench-06c3c3f0de9581df.exe

running 4 tests
test tests::bench_parallel_integrate     ... bench:     958,000 ns/iter (+/- 457,837)
test tests::bench_parallel_monte_carlo   ... bench:   4,604,488 ns/iter (+/- 1,072,339)
test tests::bench_sequential_integrate   ... bench:   1,532,289 ns/iter (+/- 111,210)
test tests::bench_sequential_monte_carlo ... bench:   9,736,849 ns/iter (+/- 455,576)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

The results above are on my dual core Intel i5 laptop. As you can see, the parallel versions run almost twice as fast, utilizing both CPU cores. However, rayon doesn't guarantee parallel execution. To quote Niko:

the decision of whether or not to use parallel threads is made dynamically, based on whether idle cores are available.

* This approach is known as potential parallelism. Keep that in mind when using rayon and always measure performance.

Further reading