Home · Register · Search · View Winners · Software · Hosting · Software · Join Upload & Sell

Moderated by: Fred Miranda
Username   Password

  New fredmiranda.com Mobile Site
  New Feature: SMS Notification alert
  New Feature: Buy & Sell Watchlist
  

FM Forums | Post-processing & Printing | Join Upload & Sell

  

Archive 2011 · Lightroom Performance Testing - Multithreading
  
 
15Bit
Offline
• • • •
Upload & Sell: Off
p.1 #1 · Lightroom Performance Testing - Multithreading


Every so often a thread pops up with someone asking how well LR scales with CPUs, and in the ensuing melee many claims of good/bad multithreading, scaling etc seem to be thrown around with little supporting evidence beyond a screenshot of the windows Task Manager.

This has bugged me for a while, as this is an inherently flawed method for evaluating multithreaded applications because an application can occupy a CPU/core without actually doing anything with it. This might result from simply bad coding or a compiler bug, or (more usually) it is related to I/O bottlenecks of some sort - the CPU/core is "busy" but is doing nothing while waiting for data to process. For "distributed task" type parallelism (such as when processing multiple raw files at once) where the calculations on each CPU/core are independent of one another, this is likely to be memory or disk I/O that is being waited on. For truly parallel calculations, where a single calculation is spread across multiple CPUs/cores and all processors rely on the output of one another, the communication between CPUs/cores becomes the limiting factor. In terms of scaling, the distributed model usually scales quite linearly whilst the single parallel calculation will usually scale with a diminishing return and often reaches an optimal number of CPUs/cores before actually becoming slower with more.

So i decided to do some better testing of LR in order to see just how well it does scale. For this test i am using a not so well known (though far from secret) feature in Windows 7 - the ability to set the number of CPU cores you boot with. By booting with a different number of cores it is possible to truly separate out how well LR scales.

Testing setup:

Computer:

Intel 975 chipset, Q6600 quad core running at 2.4Ghz, 4Gb DDR 800 RAM.
Win 7 64Bit, LR 3.3.

Files are read from a SATA conventional drive (~80Mb/sec seq read speed) and written to an Intel X-25M SSD.

Test

The test is an export test - converting 88 RAW images from a Canon 350D (8MP) to full size jpg. Three export tests are performed - all 88 as a single export, 2 x 44 images in parallel, and 4 x 22 images in parallel. In all tests it is the same 88 images which are exported in total, the difference is how we load up the processor doing it.

As i don't know how to set multiple parallel exports going at once, i did these by selecting and choosing "export with previous" multiple times as fast as i could. This introduces a small error, but one i couldn't avoid. Similarly, timing was conducted manually, with me starting and stopping the watch. Both of these introduce an error of a few seconds in the results, but i don't think it affects the overall outcome of the experiment.

A note - An export test doesn't test how well other parts of LR such as the brush tools and image correction tools scale with CPU/core count. Testing this would be rather harder i think. It might also yield a very different result, as exports come under the "distributed" heading of parallelism whilst most edits probably don't. I wouldn't be surprised if many of the Develop module functions are single threaded due to the additional difficulty in writing them multithreaded.

The Results

First the raw results, then some discussion (including the task manager screenshots i maligned earlier). I've tried to condense everything as much as possible, and on balance it seems easier to plot everything in terms of relative speedup. So these plots show how much faster things go with increasing core count and parallel export number. As a guide, the base times plotted here are 12 mins 37 secs (1 core / single export), 7mins 10sec (2 cores / single export), 5 mins 10 secs (3 cores / single export) and 4mins 40 secs (4 cores / single export).

1 - Speedup with core count







2 - Speedup with number of parallel exports







Note that all speedups are calculated relative to the slowest export on the individual trend lines, not against the slowest overall export (hence all lines start from 1).

So what do these tell us? Well clearly LR is automatically detecting and using multiple CPU cores, and spawning off 1 raw conversion per core i would guess. The speedup with core count is a little non-linear, but thats not unusual nor unexpected. Consider that the inherent system overhead contributes to some non-linearity (with more impact for low core counts), and I/O issues increasingly limit things as core count goes up. The beneficial effect of running a second export process is unexpected though, and suggests that the optimum output rate occurs with a higher load than LR chooses to output automatically.

Overall I think these are pretty good scaling results for a real world application, but given the tailing off between 3 and 4 cores here i'd be interested to see how these numbers change for 6 and 8 core machines.

Task Manager Plots

Just to give an idea of how the system was loaded.

3. Task Manager history for 1 core testing. From left to right - 1, 2 and 4 parallel exports







A couple of interesting things to see here. Firstly the little drops in the CPU load for the single export job, and their absence when more heavily loaded with 2 and 4 parallel jobs. I'm pretty sure from looking at the export in real time that these coincide with similar memory usage drops and are associated with file I/O. Interestingly they don't happen for every file processed (at least not as the LR export slider lists the file conversions), but sometimes for every second file. I wonder if LR is caching more than 1 conversion in memory. At higher loads (2 and 4 parallel exports) these troughs are "filled in". A simplistic explanation would suggest this means fewer lost cycles and thus explains the speed up when running more than one export in parallel. It might even be right.

I also noticed that LR uses more memory with increasing numbers of parallel exports.

4. Task Manager history for 2 core testing. From left to right - 1, 2 and 4 parallel exports







A few things to notice here. Firstly, for the single export process here there is a lot more up and down in the CPU core loads than for the single CPU core test above. Running multiple parallel export jobs again "fills" up these toughs keeping the CPU busier, yielding the performance increase seen in the timings. As for the single CPU core results, running 4 parallel exports yields little extra because the CPU is already saturated at 2. I would note that the real difference between the 2 and 4 parallel exports was 3 seconds, and thus within the experimental error. No significance should be given to one being faster than the other.

Another thing to notice is that there is some asymmetry apparent between the core loads in the single export process - one core is more loaded than the other. I don't really know why this is, but i wonder if the Windows kernel has its affinity set quite strongly so as to stop it hopping between cores and wasting the CPU cycles associated with the switch. I remember earlier multiprocessor versions of Windows were criticised for doing this.

5. Task Manager history for 3 core testing. From left to right - 1, 2 and 4 parallel exports.







Pretty much the same story as for the 2 core testing. It looks like we can still see that affinity effect in the single export process too.

6. Task Manager history for 4 core testing. From left to right - 1, 2 and 4 parallel exports.







Again, i think the same story here as before.


Conclusion

I think its pretty conclusive that LR is using multiple CPU cores when exporting images to disk. It looks like it uses a "distributed" type model, so that each image conversion is spawned off to a separate CPU core, which obviously means that to get maximum speed up you need to be exporting several images at a time.

The scaling is pretty good, at least within the confines of my system, but the tailing off between 3 and 4 CPU cores does suggest that higher core counts might not yield so great a performance increase. Of course, hexa and octa core processors come with a faster memory subsystem and better caching than my system, which might serve to compensate somewhat. I wouldn't want to speculate on the effect of having two physical CPUs rather than one with the same core count.

There is an interesting and significant performance boost for splitting the export job in two and running parallel exports. This seems to load the CPU a little more optimally and also improves the scalability with CPU core count.

And finally a reminder that this test only covers file exporting. It is much harder to determine whether the interactive editing tools are multi-threaded. It is my feeling that they aren't, though i have no actual evidence for this. If this is the case though, you will get a snappier feeling LR experience with a faster dual CPU than a slower Quad, but you will get faster exports with the Quad for sure.

Comments and stuff welcome - i've got my beer and popcorn ready...



May 01, 2011 at 05:52 PM
James_N
Offline
• • •
Upload & Sell: Off
p.1 #2 · Lightroom Performance Testing - Multithreading


Interesting and detailed research; the conclusion on speeding up exports have been known for about two years. See How to get Faster JPEG Exports from Lightroom
and
Optimizing Adobe Lightroom



May 01, 2011 at 07:09 PM
Alan321
Online
• • • • •
Upload & Sell: Off
p.1 #3 · Lightroom Performance Testing - Multithreading


The thing is that even if Lr was using only a single core for file export - as I believe Ps does - that fact would be masked by Lr doing fresh multi-cored reading, conversion and processing to each file prior to exporting it. Ps does not need to do all of this re-work because its files have already been processed, but Lr is parameter-based and starts almost afresh every time a file has to be exported or is opened in the Develop module. Some pre-processing of the files may be held in the Lr cache but you can't always count on that unless you are exporting recently edited files and have a big enough cache.


If you look at Lloyd Chambers' comprehensive Mac Performance Guide you'll find articles on the multi-core scalability of software such as Ps and Lr with up to 12 cores. The main site is at http://macperformanceguide.com/index_topics.html He concentrates on Ps and some of the tests are now out of date but there's plenty of good info to be found if you hunt around for it. That may be boring for windows users, however, because it is strictly a Mac site

e.g. Lr used to be - but I don't know if it still is - much slower at exporting compressed tiff files than any other file types. Uncompressed TIFFs could be exported fastest, so what you export matters along with how you break the task into multiple concurrent jobs.

While Lr utilises multiple cores for a number of tasks (perhaps most of them) it does not seem to use them all fully and so the scalability is not as good as it ought to be. As an example, in my 2011 MacBook Pro I have often seen the four cores being used at less than 100% but almost never see the four virtual cores being used even when the primary cores are fully used. Something in the Lr implementation seems to waste the available processing power.

Getting file data from a drive is a bottleneck too, often restricted to a single core until the data is on-board for processing. If a benchmark test exports the same files over and over then it might be gaining a benefit from the operating system caching and/or hard drive caching that might not apply if different files were used. This could account for different results by different testers.

- Alan



May 02, 2011 at 01:14 PM
15Bit
Offline
• • • •
Upload & Sell: Off
p.1 #4 · Lightroom Performance Testing - Multithreading


Alan,

I don't think there were any hard drive caching issues for my tests, as i repeated some of them to make sure there was consistency. If there were then they applied equally to all tests, making the results at least internally consistent.

The hard disk access was playing on my mind when i did the testing - the thread was initially called "....multithreading and hard disk speeds", but i didn't have time yesterday to do any more tests. I might have a go at that tonight to see what difference it makes running off various storage media.

When discussing scalability you should really stick to just real cores and not use the Hyperthreading virtual cores. These can offer an improvement for some tasks, but as the virtual processor has no actual execution units (it shares these with a real core) then you can end up with confused numbers on the scaling charts. The fact that LR doesn't touch them might be deliberate on Adobe's part, as for some tasks HT was found to slow things down.



May 02, 2011 at 03:45 PM
Alan321
Online
• • • • •
Upload & Sell: Off
p.1 #5 · Lightroom Performance Testing - Multithreading


I didn't expect it to use the virtual cores and the real cores fully, but I had hoped that it would use the combination to the equivalent of 100%, instead of partially loading the primary cores and ignoring the virtual cores.

I'm now well out of touch with software development but I've read that multi-threading is very easy to implement on the Mac OS, taking just a few lines of code. Even if windows cannot allow it I'd like to see the Mac versions of software do so. Then we wouldn't need to manually split exports into batches or just suffer unnecessary delays. I wonder why it is that modern software from the big software companies still doesn't get it right even on things like file handling. Surely they can afford to learn how to implement it better.

In the meantime, the best gains for Lr seem to come from using more RAM and a speedy solid state drive (preferably SATA 3.0) rather than lots of extra CPU cores.

- Alan



May 02, 2011 at 07:07 PM
 

Search in Used Dept. 



thedigitalbean
Offline
• • • • •
Upload & Sell: On
p.1 #6 · Lightroom Performance Testing - Multithreading


Alan321 wrote:
[snip]
I'm now well out of touch with software development but I've read that multi-threading is very easy to implement on the Mac OS, taking just a few lines of code.[snip]
- Alan


Nope. GCD (grand central dispatch) and Intel's TBB (thread building blocks) can make life a little easier, but its far, far, far from taking existing code and just adding a few lines to make multithreaded. And even that is only for algorithms or operations which aren't inherently serial (like reading or writing from/to a disk).



May 02, 2011 at 07:27 PM
thedigitalbean
Offline
• • • • •
Upload & Sell: On
p.1 #7 · Lightroom Performance Testing - Multithreading


As for the 'virtual cores', thats comes about because of hyper threading, and yes its usually a deliberate reason to avoid them. Hyper threading can actually end up hurting performance (because of cache implications) and I've seen many pieces of software avoid using it.


May 02, 2011 at 07:30 PM
PhilDrinkwater
Offline
• • •
Upload & Sell: Off
p.1 #8 · Lightroom Performance Testing - Multithreading


Thanks for doing the tests. Appreciated.


May 05, 2011 at 11:37 AM
Monkey Falls
Offline
• •
Upload & Sell: Off
p.1 #9 · Lightroom Performance Testing - Multithreading


Great info.

I haven't run any tests and I'm certainly no computer expert, but I do observe all 4 of my real cores and all four of my virtual cores fully utilized when I export from LR3. My little CPU widget in Win7 shows all 8 "cores" utilized. Of course it bounces around and does not show 100% usage constantly, but on average all 8 "cores" are being highly utilized.



May 05, 2011 at 01:05 PM





FM Forums | Post-processing & Printing | Join Upload & Sell

    
 

You are not logged in. Login or Register

Username   Password    Retrive password