In February 2015, I posted the first pull request for a port of Prakash Surya's multilist and ARC re-work to ZFS on Linux. The goal was to reduce the lock contention on arcs_mtx, a single mutex embedded within each of the per-state ARC lists. The new multilist facility provided as part of this pull request includes an almost drop-in replacement to the standard linked list typelist_t. Rather than maintaining a single lock for a single linked list (per ARC state), the lists were split up into a number of sub-lists, each of which had their own mutex.
The benchmark used for testing this work consists of numerous concurrent 4K reads of 100% cached data. In the original OpenZFS ARC implementation, with a single arcs_mtx lock, the benchmark didn't scale well as additional reader tasks were added. There was a great deal of contention on the single mutex. (The before-and-after results for illumos are described here.)
Given ZoL's divergent development with respect to the "upstream" OpenZFS code (from illumos), porting this patch required dealing with a number of conflicts which developed over time. Some of the issues are documented in the final commit.
Once the code was ported and in working condition, my next step was to try to duplicate the benchmark results under Linux. My initial results were not encouraging: The performance wasn't improved much at all and in some cases, was even worse. My benchmarking was also handicapped by the lack of access to sufficiently "big" hardware. The largest system which I had direct access to was a 2x6-core Opteron (2-node NUMA system) with only 64GiB RAM. I began using large spot instances on Amazon EC2 to run the tests but it wasn't very convenient. It also brought to light the differences in the locking primitives under a virtualized (Xen) environment as opposed to running on bare metal.
I was eventually put in touch with the good people at ServerCentral, who, in the name of furthering the ZoL development effort, gave me access to a dedicated server with 4 E7-4850 CPUs, each of which has 10 cores and 2 threads per core. The system has 80 threads available. Backing it is 512GiB of RAM and a bunch of hard drives in several JBODs. In short, it's a perfect system on which to perform this type of testing.
Using this 4xE7 system, not only was I able to find some (rather trivial) bottlenecks which greatly improved the performance of the benchmark mentioned above, but I also found several other similar bottlenecks, some of which have been fixed, some of which have not yet.
In subsequent postings, I'll outline some of the specific bottlenecks I encountered and their fixes, if any. Pretty much any scaling-related fix or issue I posted or commented on regarding ZoL (zfs or spl repositories) were discovered through testing on the E7 system.
This guest post was published with permission from Tim Chase. It originally appeared here.