# Day 3 - rayon

Let me start this article with a confession: *I suck at writing parallel code.*
I have only a vague idea about mutexes, locks and other low level primitives
for handling data between threads. And since I work mostly with Python,
I tend to use `multiprocessing`

(if I bother to parallelize at all) rather
than threads.

However, one of the selling points of Rust is its approach to concurrent/parallel programming. Mozilla recently released a video encouraging people to write parallel code that is safe and approachable. Then I heard about rayon - a library for data parallelism that is very approachable and lightweight.

If you have an iterator chain that computes something on every
element, you can stick some variation of `par_iter()`

in the chain and now
your code is`*`

parallel. So simple!

Niko Matsakis, the author of `rayon`

, has a
blog post
describing the motivation behind this crate and some implementation details.
Worth checking out if you're curious how `rayon`

works under the hood.

## How much is π?

Let's assume we don't know the value of π. How would we calculate it? One
approach is to measure the area of a unit circle (a circle with radius=1).
Since circle area is `πr<sup>2</sup>`

and r=1, the area of a unit circle
equals π.

If we plot the circle as a graph of a function, the area under the graph can be calculated as a definite integral. Let's represent our circle function in Rust:

```
fn semicircle(x: f64, r: f64) -> f64 {
(r * r - x * x).sqrt()
}
```

Note: this is actually a semicircle, because a circle doesn't represent a
function (it maps an `x`

value to two `y`

values).

There are various techniques for numerical integration. One of those is known as the trapezoidal rule. Here's a naive Rust implementation:

```
fn integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64
{
let delta = 1f64 / points as f64;
(0..points)
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
```

We cut the area under function into narrow strips of width `delta`

and sum up
their areas. Don't be afraid if you don't follow all the math here. The
important thing is - every strip is independent of each other. This is an
embarassingly parallel
task! Let's see how `rayon`

can help us.

```
extern crate rayon;
use rayon::prelude::*;
fn parallel_integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64 + Sync
{
let delta = 1f64 / points as f64;
(0..points)
.into_par_iter()
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
```

See any changes? Apart from the imports, the *only* things that changed are
trait bounds on the function (we need it to be `Sync`

) and the range is
converted into a parallel iterator. That's it!

```
println!("sequential: {}", 4f64 * integrate(semicircle, 10_000_000));
println!("parallel: {}", 4f64 * parallel_integrate(semicircle, 10_000_000));
```

```
sequential: 3.1415926535533574
parallel: 3.141592653552613
```

We're pretty close to the value of π!

## Monte Carlo method

Imagine you're throwing darts at a square board which has a circle drawn on it. If you're as poor at darts as I am and throw at random, some of your hits will be inside of the circle, some outside of it. Monte Carlo method for calculating π works just like that: pick a number of points at random and count these within the circle. The ratio of this count to total number of points approximates the ratio of the areas (which is π/4).

**Note**: this is a worse example of parallelization because `rand::random()`

uses a thread-local random generator that reseeds itself from OS-provided
entropy source.

```
extern crate rand;
fn monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}
fn parallel_monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.into_par_iter()
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}
println!("sequential MC: {}", monte_carlo_pi(1_000_000));
println!("parallel MC: {}", parallel_monte_carlo_pi(1_000_000));
```

Again our parallel implementation differs only in the call to `into_par_iter()`

.

```
sequential MC: 3.138308
parallel MC: 3.142376
```

The results will be different in every run, but we can see it sort of looks like π.

## Benchmarks

*Twelve o'clock and all is well*, but how do we know if the parallel versions
really perform better?
Benchmarks!

The following code works only on nightly, due to `test`

feature gate.

```
#![feature(test)]
extern crate test;
#[cfg(test)]
mod tests {
use super::{semicircle, integrate, parallel_integrate, monte_carlo_pi, parallel_monte_carlo_pi};
use test::Bencher;
#[bench]
fn bench_sequential_integrate(b: &mut Bencher) {
b.iter(|| integrate(semicircle, 100_000))
}
#[bench]
fn bench_parallel_integrate(b: &mut Bencher) {
b.iter(|| parallel_integrate(semicircle, 100_000))
}
#[bench]
fn bench_sequential_monte_carlo(b: &mut Bencher) {
b.iter(|| monte_carlo_pi(100_000))
}
#[bench]
fn bench_parallel_monte_carlo(b: &mut Bencher) {
b.iter(|| parallel_monte_carlo_pi(100_000))
}
}
```

```
$ cargo bench
Finished release [optimized] target(s) in 0.0 secs
Running target\release\rayon-bench-06c3c3f0de9581df.exe
running 4 tests
test tests::bench_parallel_integrate ... bench: 958,000 ns/iter (+/- 457,837)
test tests::bench_parallel_monte_carlo ... bench: 4,604,488 ns/iter (+/- 1,072,339)
test tests::bench_sequential_integrate ... bench: 1,532,289 ns/iter (+/- 111,210)
test tests::bench_sequential_monte_carlo ... bench: 9,736,849 ns/iter (+/- 455,576)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
```

The results above are on my dual core Intel i5 laptop. As you can see, the
parallel versions run almost twice as fast, utilizing both CPU cores. However,
`rayon`

doesn't *guarantee* parallel execution. To quote Niko:

the decision of whether or not to use parallel threads is made dynamically, based on whether idle cores are available.

`*`

This approach is known as *potential parallelism*. Keep that in mind when using
`rayon`

and always measure performance.