Immutable vs Mutable Programming Performance in Golang

The work I’m currently doing with my migration of Oystr from Node.JS to Golang requires me to re-visit a feature that required me to take a graph data set and perform a number of transformations to achieve the desired result. When I built this the first time in Node.JS, I thought I’d attempt the transforms by mutating an array of values. After a pretty frustrating day it became obvious that I needed to move to an immutable mindset.

Although the data sets aren’t huge, a 12×12 matrix and a 200×8 matrix, they are large enough (for my mind) that it made it difficult to follow the ~10 mutating transformations being applied to each data set. Once I had re-factored my transformations into an immutable pipeline, the simplicity of debugging and maintenance was immediately clear and well worth any performance hit that may have been induced from the increased memory usage… or was it.

That last bit sort of bugged me as I couldn’t say with any real certainty if the performance hit was big or small. With Node.JS, unlike with Go, testing the performance isn’t as simple as doing a BenchmarkXXX function and, at the time, I didn’t have a working mutable version of my algorithm anyway so I decided to drop it. Fast-forward to now, and I am now re-doing it in Go so I figured I’d just do some quick tests to see how much less (or maybe more) efficient it was to do an immutable transform.

This post presents what I found and some suggestions for you to consider when you find yourself in the same predicament in the future.

Use Case

In my specific scenario I’m building 10 different transforms with matrix-like data sets. I didn’t want to go through and build the full thing twice so I chose a simpler task. Instead I’ve built a basic matrix with implementations for the addition, subtraction, scalar multiplication, and matrix multiplication operators. I hoped that these 4 operations would provide enough variance to observe different performance scenarios.

In addition to operator variation, I also built two different forms of the matrices. With one set of the matrices the underlying data structure is a slice. With the other set of matrices the underlying data structure is an array. My reasoning for this is that I wanted to see what impact a reduced memory allocation scenario would have on the different matrix types.

You’ll find the code for all four matrix types (immutable slice, mutable slice, immutable array, mutable array) at https://github.com/chris-tomich/immutability-benchmarking.

Results

N.B. The following bench marks were only performed on the operations themselves (this is what I’m interested in) and ignore the initial allocation which is why there can be zero allocations for some of the tests.

The setup I used is a Dell XPS 15 with a i7-4702HQ, 16GB RAM, and a 1TB mSATA SSD.

You can find my Excel spreadsheets with my go test output tables at https://github.com/chris-tomich/immutability-benchmarking/tree/master/benchmarks.

Slices

I first performed the tests with slices (as this is what I’ll be using). I ran each bench mark 10 times to try and smooth out any potential external factors (background processes etc.). Additionally I ran the benchmarks with various matrix sizes to see how that would impact the performance. The graphs below summarise these tests. The y-axis represents the percentage increase in time of using an immutable structure over a mutable structure and the x-axis represents the size of a side in a square matrix.

SliceMatrixAddition.png

SliceMatrixSubtraction.png

SliceScalarMultiplication.png

SliceMatrixMultiplication.png

After looking at the graphs above, what I noticed was that as the matrix get’s larger, the performance hit for immutable appears to reach a floor for addition, subtraction, and scalar multiplication for the immutable structures. Taking into consideration these results and how each operation is created, I imagine this is because of the time required for allocation of memory doesn’t change much even though the matrix is larger.

The obvious difference is the matrix multiplication operation. With this operation, the difference is minimal with a ~5% – ~10% decrease in processing performance. The reason here is pretty obvious as matrix multiplication requires a third matrix to be created even in the mutable implementation as the values within the original matrices are required for the full duration of processing the result and could even result in a different sized matrix. The interesting thing to note is that this doesn’t hold true when the size of the matrix starts to get really large. I’m not sure if this is related to the amount of memory in my machine, memory paging, addressing etc. so probably needs to be explored further if you’re dealing with really large data sets.

Array

After doing the slices I thought to myself “I wonder if I can improve the performance by using an array instead”. You know… the whole “zero allocs bro” line of thinking. Of course using arrays is kind of impractical for most scenarios, but I figured there may be some scenarios where I could use a static sized array so figured I may as well build array based data structures and run the benchmarks again. I used the same sizes as the slice based version and got the following (surprising) results.

ArrayMatrixAddition.png

ArrayMatrixSubtraction.png

ArrayScalarMultiplication.png

ArrayMatrixMultiplication.png

For me this is where the surprises really did come. The mutable structures performed slightly better than their slice counterparts. I expected this to be the case but what I didn’t expect was that the immutable versions performed worse!!! And a lot worse in some cases like the matrix multiplication which is on average ~650x worse!!! Even though I’ve been able to reduce the memory allocations 52 times using an array and reduce memory usage ever so slightly, my processing performance is considerably worse.

The bad performance is particularly true for large data sets. I imagine (pure speculation here) that this is related to how much contiguous memory can be allocated for the array. What makes me suspect this is that you’ll notice in the graphs I’ve provided above I only provide benchmarks for 10×10, 30×30, and 90×90. This is because the 270×270 and 810×810 arrays took in excess of 10 min to complete and by default go test waits a maximum of 10 mins for the benchmark to complete.

Conclusion

For me personally, I’ve got quite a few important lessons from this exercise. Depending on how you want to interpret the results above, you may agree or disagree with them.

The first lesson is not to make any assumptions on what the compiler is doing in terms of optimization and performance. There are clearly some shortcuts the compiler is able to take in the slice-based immutable implementations that it can’t do for the array-based immutable implementations leading to the horrible performance.

Second, and this is probably more based upon the personal gripe I have with mindless and overly simple metrics, but less allocations does not mean better. A common “selling point” I see in a lot of Go libraries is “zero allocations”. It’s clear from these results that this doesn’t automatically mean better in every scenario.

The third lesson is compound as it directly answers the original intent of this post. In some scenarios, an immutable algorithm is the only solution anyway so programming immutable data structures has little effect on the performance (e.g. the matrix multiplication operation). In the cases where you observe a statistically significant hit on performance for operations that can be done mutably, you need to really consider whether this statistical significance is practically significant.

A ~40% increase in the the processing time may seem significant but the context in which that occurs for my use case it’s actually very insignificant in terms of what I gain with ease of debugging and maintenance. When we make decisions for the “greater good”, the “greater” part should include non-performance related gains.

In terms of moving forward I’ll definitely be using immutable transforms in Oystr. In a future post I’ll explore how the immutable data structures here could be easily extended with lazy evaluation which should improve the performance for certain scenarios and shouldn’t (fingers crossed) be conceptually difficult to implement.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s