Upload & Sell: Off
| p.1 #1 · Lightroom Performance Testing - Multithreading |
Every so often a thread pops up with someone asking how well LR scales with CPUs, and in the ensuing melee many claims of good/bad multithreading, scaling etc seem to be thrown around with little supporting evidence beyond a screenshot of the windows Task Manager.
This has bugged me for a while, as this is an inherently flawed method for evaluating multithreaded applications because an application can occupy a CPU/core without actually doing anything with it. This might result from simply bad coding or a compiler bug, or (more usually) it is related to I/O bottlenecks of some sort - the CPU/core is "busy" but is doing nothing while waiting for data to process. For "distributed task" type parallelism (such as when processing multiple raw files at once) where the calculations on each CPU/core are independent of one another, this is likely to be memory or disk I/O that is being waited on. For truly parallel calculations, where a single calculation is spread across multiple CPUs/cores and all processors rely on the output of one another, the communication between CPUs/cores becomes the limiting factor. In terms of scaling, the distributed model usually scales quite linearly whilst the single parallel calculation will usually scale with a diminishing return and often reaches an optimal number of CPUs/cores before actually becoming slower with more.
So i decided to do some better testing of LR in order to see just how well it does scale. For this test i am using a not so well known (though far from secret) feature in Windows 7 - the ability to set the number of CPU cores you boot with. By booting with a different number of cores it is possible to truly separate out how well LR scales.
Intel 975 chipset, Q6600 quad core running at 2.4Ghz, 4Gb DDR 800 RAM.
Win 7 64Bit, LR 3.3.
Files are read from a SATA conventional drive (~80Mb/sec seq read speed) and written to an Intel X-25M SSD.
The test is an export test - converting 88 RAW images from a Canon 350D (8MP) to full size jpg. Three export tests are performed - all 88 as a single export, 2 x 44 images in parallel, and 4 x 22 images in parallel. In all tests it is the same 88 images which are exported in total, the difference is how we load up the processor doing it.
As i don't know how to set multiple parallel exports going at once, i did these by selecting and choosing "export with previous" multiple times as fast as i could. This introduces a small error, but one i couldn't avoid. Similarly, timing was conducted manually, with me starting and stopping the watch. Both of these introduce an error of a few seconds in the results, but i don't think it affects the overall outcome of the experiment.
A note - An export test doesn't test how well other parts of LR such as the brush tools and image correction tools scale with CPU/core count. Testing this would be rather harder i think. It might also yield a very different result, as exports come under the "distributed" heading of parallelism whilst most edits probably don't. I wouldn't be surprised if many of the Develop module functions are single threaded due to the additional difficulty in writing them multithreaded.
First the raw results, then some discussion (including the task manager screenshots i maligned earlier). I've tried to condense everything as much as possible, and on balance it seems easier to plot everything in terms of relative speedup. So these plots show how much faster things go with increasing core count and parallel export number. As a guide, the base times plotted here are 12 mins 37 secs (1 core / single export), 7mins 10sec (2 cores / single export), 5 mins 10 secs (3 cores / single export) and 4mins 40 secs (4 cores / single export).
1 - Speedup with core count
2 - Speedup with number of parallel exports
Note that all speedups are calculated relative to the slowest export on the individual trend lines, not against the slowest overall export (hence all lines start from 1).
So what do these tell us? Well clearly LR is automatically detecting and using multiple CPU cores, and spawning off 1 raw conversion per core i would guess. The speedup with core count is a little non-linear, but thats not unusual nor unexpected. Consider that the inherent system overhead contributes to some non-linearity (with more impact for low core counts), and I/O issues increasingly limit things as core count goes up. The beneficial effect of running a second export process is unexpected though, and suggests that the optimum output rate occurs with a higher load than LR chooses to output automatically.
Overall I think these are pretty good scaling results for a real world application, but given the tailing off between 3 and 4 cores here i'd be interested to see how these numbers change for 6 and 8 core machines.
Task Manager Plots
Just to give an idea of how the system was loaded.
3. Task Manager history for 1 core testing. From left to right - 1, 2 and 4 parallel exports
A couple of interesting things to see here. Firstly the little drops in the CPU load for the single export job, and their absence when more heavily loaded with 2 and 4 parallel jobs. I'm pretty sure from looking at the export in real time that these coincide with similar memory usage drops and are associated with file I/O. Interestingly they don't happen for every file processed (at least not as the LR export slider lists the file conversions), but sometimes for every second file. I wonder if LR is caching more than 1 conversion in memory. At higher loads (2 and 4 parallel exports) these troughs are "filled in". A simplistic explanation would suggest this means fewer lost cycles and thus explains the speed up when running more than one export in parallel. It might even be right.
I also noticed that LR uses more memory with increasing numbers of parallel exports.
4. Task Manager history for 2 core testing. From left to right - 1, 2 and 4 parallel exports
A few things to notice here. Firstly, for the single export process here there is a lot more up and down in the CPU core loads than for the single CPU core test above. Running multiple parallel export jobs again "fills" up these toughs keeping the CPU busier, yielding the performance increase seen in the timings. As for the single CPU core results, running 4 parallel exports yields little extra because the CPU is already saturated at 2. I would note that the real difference between the 2 and 4 parallel exports was 3 seconds, and thus within the experimental error. No significance should be given to one being faster than the other.
Another thing to notice is that there is some asymmetry apparent between the core loads in the single export process - one core is more loaded than the other. I don't really know why this is, but i wonder if the Windows kernel has its affinity set quite strongly so as to stop it hopping between cores and wasting the CPU cycles associated with the switch. I remember earlier multiprocessor versions of Windows were criticised for doing this.
5. Task Manager history for 3 core testing. From left to right - 1, 2 and 4 parallel exports.
Pretty much the same story as for the 2 core testing. It looks like we can still see that affinity effect in the single export process too.
6. Task Manager history for 4 core testing. From left to right - 1, 2 and 4 parallel exports.
Again, i think the same story here as before.
I think its pretty conclusive that LR is using multiple CPU cores when exporting images to disk. It looks like it uses a "distributed" type model, so that each image conversion is spawned off to a separate CPU core, which obviously means that to get maximum speed up you need to be exporting several images at a time.
The scaling is pretty good, at least within the confines of my system, but the tailing off between 3 and 4 CPU cores does suggest that higher core counts might not yield so great a performance increase. Of course, hexa and octa core processors come with a faster memory subsystem and better caching than my system, which might serve to compensate somewhat. I wouldn't want to speculate on the effect of having two physical CPUs rather than one with the same core count.
There is an interesting and significant performance boost for splitting the export job in two and running parallel exports. This seems to load the CPU a little more optimally and also improves the scalability with CPU core count.
And finally a reminder that this test only covers file exporting. It is much harder to determine whether the interactive editing tools are multi-threaded. It is my feeling that they aren't, though i have no actual evidence for this. If this is the case though, you will get a snappier feeling LR experience with a faster dual CPU than a slower Quad, but you will get faster exports with the Quad for sure.
Comments and stuff welcome - i've got my beer and popcorn ready...