Performance Testing - How does LR4 utilise multiple cores
/forum/topic/1165920/0

1
       2       3       end

15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

The appearance of yet another thread discussing the slow performance of LR4 and the question of how well it uses multicore CPUs has spurred me into doing some actual performance just to see if I can more quantitatively identify some of the slowdowns many of us seem to see. I have done some testing previously, but not looking at the interactive aspects of the Library and Develop modes. These are at:

Hard Disk I/O with LR4 – http://www.fredmiranda.com/forum/topic/1123725/0#10730582

Export Module CPU scaling with LR3.3 - http://www.fredmiranda.com/forum/topic/1005214/0&year=2011#9547107

The objective of this current test is to determine how well LR uses multiple cores in the Library and Develop modules of LR and to give some subjective evaluation of how responsive the system “feels” along with it. I apologise in advance for how long this is, but I tried to make it fairly thorough and brevity is then difficult.

The system on test is:

Intel i5-3570K clocked down to 1600Mhz with turbo boost off
Asus P8Z77-V-Pro motherboard
16Gb of DDR3-1600
128Gb Samsung 830 SSD - Boot drive
80Gb Intel X25 G2 SSD – Images, LR catalogue, ACR temp dir

I down-clocked the CPU to reduce the responsiveness of the system as much as possible so as to give a good chance of seeing where the slowdowns occur. Also, all I/O is going to or from an SSD to minimise any I/O interference.

I am testing with LR4.2 and using a nifty little tool from Sysinternals called Process Explorer which allows monitoring of the CPU usage of a single process. It also allows the user to set processor affinity for a process – giving the possibility to limit the number of cores a process can use (from 1-4 on my system).

Before each test I restarted LR and cleared the ACR and Previews caches, or at least tried to. I noticed a bug here – the “Discard 1:1 Previews” command doesn’t seem to work. At least it doesn’t delete anything from the hard disk and clicking between images after running it doesn’t give the re-rendering you would expect if it really had deleted them. So instead I manually deleted the preview cache from disk instead.

For these tests I have made two specific catalogues for testing. One has 81 images taken from a Fuji X10 and is a mixture of RAW and JPG files. The second is a single image - a panorama made up of several images from a 5D stitched together and saved as a TIF. It has a lot of trees, leaves etc. and so a lot of fine detail to render. The image is 13147 x 4532 pixels and takes up 454Mb on disk.

---------------------------------------------------------------------------
Test 1 – Library Module: Rendering previews for 81 images.

For this test I selected all 81 images and set the catalogue to render 1:1 previews. I timed it for rendering on 1, 2, 3 and 4 cores and also took a trace of the CPU load during the processing.

Rendering on 1 core







As you would expect, the CPU is pegged at 100% for the core for the duration of the rendering.

Rendering on 2 cores







Now we start to see spikes in the trace, but each spike peaks at 50% overall load (i.e. filling 2 cores).

Rendering on 3 cores







Again we see spikes in the trace, but they are now peaking below the 75% max load allowed for 3 cores.

Rendering on 4 cores







And again the spikes, and again they are not reaching near the 100% max allowed load for 4 cores, though they do peak higher than for 3 cores, at around 75%

Scaling with core count







Scaling reflects what we see in the CPU traces – there is a speed up but scaling is some way from ideal. Note that the slowest time (1 core) was 6m 3sec, whilst the fastest (4 cores) was 2m 43sec.

Notes

It seems that LR is using more cores as they are available, though not with great scaling – rendering previews on 4 cores is only 2.2x faster than on 1 core. It is interesting that for rendering on 2-4 cores there are 41 “spikes” in the cpu trace, corresponding to the rendering of 81 images. From the poor scaling and the way that the “spikes” increase in height with core count I suspect that individual renders are to some extent spread over more than 1 core. We can test this by rendering the large panorama on 4 cores:







So rendering of a single image is multithreaded and we can speculate that maybe size “matters” – if there is enough data all cores can spin up to render the image, but if the image is smaller maybe this does not happen.

---------------------------------------------------------------------------
Test 2 – Library Module: Rendering previews on the fly - clicking from image to image

For this test I just clicked on images in the 81 image catalogue, zooming in on each to make the image render.

Rendering on 1 core







Rendering on 4 cores







Notes:

I only present the 1 and 4 core results as the others pretty much followed the trend above. We can confirm here that the rendering of a single image is multithreaded, and that we are not making full use of the 4 cores. I note though that the X10 images are not all that big – it is set to record at 6MP.

Perceptively I would say that it was pretty responsive when I did this on images which had no edits, a bit slower on images which had. I couldn’t see any increase in the CPU load plot for rendering of images with edits though.

---------------------------------------------------------------------------
Test 3 –Develop Module, rendering and scrolling around

Now in the Develop module, for this test I’ve used the 81 image catalogue and clicked from image to image, zooming in and scrolling around on each. Some of them are jpegs, some are RAW, some have edits, some don’t. LR is allowed to use all 4 cores.







Notes

Every tall spike here represents a change of image, the lower loads represent scrolling around within the image. It thus appears that whilst initial rendering can take up a big chunk of CPU power, scrolling around doesn’t even utilise 2 cores fully. Perceptively I would say that scrolling around within the images felt a little sluggish. I could see the picture being re-drawn every time I dragged it, and I would say that when I clocked the CPU back to 4.3Ghz scrolling around in images was much more snappy.

---------------------------------------------------------------------------
Test 4 –Develop Module: Moving the sliders

This test is done on the big panorama and is to see how image adjustments are reflected in the CPU trace. To start, all adjustments are set to default (i.e. the image is “reset”) and then individual sliders are moved around and set back to default before moving to the next. I was zoomed out of the image so that it fit in the screen while doing these, but I did try repeating with the image zoomed to 100% and saw no difference in the results. In order of the numbers on the trace, the adjustments are:

1. Exposure
2. Highlights
3. Shadows
4. Whites
5. Blacks
6. Clarity
7. Vibrance
8. Saturation
9. Quick tweak on exposure, highlights, shadows followed by zooming in and scrolling around







Notes

It seemed that no matter how fast I moved the sliders around I couldn’t make LR use much more than 2 cores’ worth of CPU. I would say that I saw no real lag in the editing response. So the edits are threaded, but are also so fast that they don’t need a lot of CPU. I remind you that this is a 454MbTIF image, and I’ve down-clocked my CPU to 1.6Ghz, so the software seems pretty well coded.

Again I see that scrolling around the image (no.9 at the end) doesn’t chew up a lot of CPU, but again I can see the image re-drawing as I drag. And again bumping the CPU back to full speed makes it a lot more responsive.

---------------------------------------------------------------------------
Test 5 –Develop Module: The sharpening and noise reduction sliders

Using the panorama, with the image “reset” to default before starting. Zoomed to fit the screen to begin:

1. Sharpening by typing a number in the box
2. Sharpening by moving the slider
3. Noise reduction by typing a number in the box
4. Noise reduction by moving the slider
5. Same as 1, but zoomed to 100% view
6. Same as 2, zoomed at 100% view
7. Same as 3, zoomed at 100% view
8. Same as 4, zoomed at 100% view







Notes

Sharpening is again threaded according to the CPU trace, and perceptively happens quickly also. Noise reduction is the first action where I can peg all 4 cores, and I would say is also perceptively slower to respond. It wasn’t “very” slow though. Scrolling around after applying sharpening and NR was a lot slower though, leading to test 6…

---------------------------------------------------------------------------
Test 6 –Develop Module: Scrolling around with sharpening and NR applied

Simple test here - scrolling around the panorama with and without sharpening and NR applied. So:

1. Scrolling around “reset” image
2. Sharpening and NR applied
3. Scrolling around again







Notes

I think here we have found the problem. Scrolling around and zooming with no sharpening and NR applied the response is decent. Once sharpening and NR are applied it slows down a lot. Note though that the CPU load is about the same for both actions. I would say that although I didn’t do a CPU load plot for it, I have separately tested NR and sharpening and it’s the NR that is the culprit.

---------------------------------------------------------------------------
Test 7 –Develop Module: Local adjustments with and without NR applied

Further exploring the effect of having NR turned on. Again the panorama and sweeping around a local adjustment brush that has a combination of Exposure, Contrast, Highlights, Shadows, Clarity and Saturation set. The image was “reset” between adjustments :

1. Zoomed out to fit screen, no NR applied
2. Zoomed out to fit screen, NR applied
3. Zoomed to 100%, no NR applied
4. Zoomed to 100%, NR applied







Notes

Some differences to be seen in the CPU trace: when doing local adjustments with the NR turned on it does seem to use more CPU, but it doesn’t get near using 100%. Perceptively I would have to say that it really *needs* to be using 100% as there was several seconds of lag when doing the local adjustments with NR turned on and the image zoomed to 100%. With the NR turned off it was a touch sluggish, but more than acceptably responsive.

---------------------------------------------------------------------------
Test 7 – Exporting

Testing whether image exporting is using all the CPU. Again the big panorama image. During the course of this I noticed something “funny” going on, and so tried exporting with the long edge set to different lengths, just to see:







Notes

Again, LR is showing good multithreading – exporting a single image can use multiple cores. This is significant I think, as the easiest option to speed up exports would be by not having multithreading and assigning 1 image per core. Doing it this way indicates that Adobe has probably spent a lot of time on the code for the export module.

There is some interesting behaviour though - with the long edge of the image set to anything above 1000 pixels all 4 cores get utilised for the export, with the long edge set between 750 and 1000px, you get 3 cores, between 500 and 750px you get 2 cores and below that you get 1 core. I expect this is a deliberate choice Adobe have, but i don’t know the reasons and I don’t feel like speculating. I’ve not tested, but it is my guess that when exporting a batch of downsized images LR will export multiple images in parallel to compensate for limiting the number of cores each export can use.

---------------------------------------------------------------------------
CONCLUSIONS

I think we can conclude that LR4 is pretty thoroughly multi-threaded, for which Adobe should be praised. All the adjustments I tried utilise multiple cores as they are being implemented. Scaling is not perfect, but it never is for these types of calculation. The only part of the software that seemingly doesn’t make good use of multiple cores is zooming and panning images, and when NR has been applied this really really slows down.

I confess I don’t really understand why this is. It is obvious from the CPU traces that edits (incl NR) are applied as you move the slider, not as you move around the image. You can tell this from the fact that applying NR peaks the CPU at over 90%, whilst scrolling around the image with NR applied never gets it close to this. On the other hand, if the images are being fully rendered as you move the sliders then zooming and scrolling should simply involve shifting the data from RAM to display, which should surely be instantaneous as it is in the Library module. So I am curious about what actually goes on in the Develop module.

In terms of how many cores you should buy, I would say that from these numbers there is little value in going above 4 cores to improve responsiveness. I would expect more cores to significantly speed up large batches of image exports though.

So in summary, for responsiveness in Library and Develop modules I reckon an overclocked quad core CPU is the way to go. And do noise reduction as late in your workflow as possible.



snapsy
Registered: Feb 24, 2008
Total Posts: 4687
Country: United States

Nice work. One important note though is that observing CPU utilization spread across cores is not necessarily indicative of an application that was explicitly written to take advantage of multiple cores, ie doesn't imply multi-threaded operation for a given task. The distribution of utilization also occurs when the OS dispatches a different core across executive time slices of the kernel. For example, if I write a single-threaded app that does processor-intensive work I'll see all cores participating in that work even though only one core is executing the logic at a time. The preemptive dispatches occur at intervals fast enough where they're not detected by the coarse sampling of perfmon, giving the false appearance that all cores are participating in the work at the same time.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

Thanks snapsy,

You are right, but in the case you state you would see the single threaded process hopping between cores as it executes (a sign of poorly handled scheduling / affinity if it does that a lot) but at no time would it be able to take up more than the equivalent of 100% of one core. Thats why i used Process Explorer rather than Windows Taskmanager - Process Explorer can give information *only* for the chosen process, not the system as a whole or as an average. So, as i understand it, a single threaded process would not be able to hop around cores and appear to take more than 100% of 1 core.

Also, as best i can measure (and feel subjectively), more cores does make things faster for these tests. And thats measured changing core count only for LR as i'm not rebooting with different numbers of cores turned on.



Hammy
Registered: May 21, 2002
Total Posts: 2844
Country: temp

15Bit wrote:The only part of the software that seemingly doesn’t make good use of multiple cores is zooming and panning images, and when NR has been applied this really really slows down.


Just guessing here, but would the image manipulation be done more with the graphics card - as in the case of Photoshop, moreso even with Premiere Pro? That seems to be Adobe's direction and pattern: certain functions/filters being highly multi-threaded, others nil or less so, and appearance being handled by the display module/hardware.


15Bit wrote:I down-clocked the CPU to reduce the responsiveness of the system as much as possible so as to give a good chance of seeing where the slowdowns occur.

Is that what naturally happens at 63°25' North latitude in the coming winter months?
Or does that make it easier to vent outside air in for 6Ghz overclocking?

Thanks for the time to do the tech/leg work on LR!



amonline
Registered: Jul 16, 2006
Total Posts: 6840
Country: United States

15Bit wrote:
In terms of how many cores you should buy, I would say that from these numbers there is little value in going above 4 cores to improve responsiveness. I would expect more cores to significantly speed up large batches of image exports though.

So in summary, for responsiveness in Library and Develop modules I reckon an overclocked quad core CPU is the way to go. And do noise reduction as late in your workflow as possible.


I agree completely.

Great job on the testing.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

Hammy,

From what i have gathered reading around there is no graphics acceleration in LR except for video. There is a post somewhere on adobe's forum from one of their devs stating the the LR rendering pipeline is long and complex and wouldn't be easy to adapt to graphics card optimisation. That said, Phase One have quite a lot of Open CL acceleration in Capture One, so i am expecting LR5 to have it too.

A couple of years ago i did set my home quad core machine to doing 1 week long molecular dynamics calculations during the winter, partly offsetting the electric cost against heating my flat

I seem to remember you use LR for your business - how is your experience of the core scaling for exports?



Matthew Cham
Registered: Sep 13, 2007
Total Posts: 338
Country: United States

Thank you for this brilliant testing. This is exactly the kind of information I was looking for in my other thread. Sounds like additional cores above 1 core provide incrementally lower performance gains. Best bang for the buck is 1 core, and lowest bang for the buck is 6 cores. If money is no object, then get as many cores as you can afford.

This brings me to my next question: If money is no object, will you see more responsiveness from a 6-core CPU (@OC 4.0 GHz) or from an overclocked 4-core CPU (@OC 4.4 GHz)? That is, will the underutilized 5th and 6th core outperform the overclocking gains from the most expensive 4-core CPU (i7-3820 @OC 4.4 GHz)?



morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

Brilliant work. Very thorough.

When I apply nr, I see 4 threads light up with another 2 barely active. I my bet cpu usage does not climb above 41% even with process priority set to high. Ram is not the bottleneck my latency is pretty low and my speed is 2133.

I would be very interested to see your tests on a specific condition which us shadow ir highlight set to something and then nr applied.



Hammy
Registered: May 21, 2002
Total Posts: 2844
Country: temp

15Bit wrote:
A couple of years ago i did set my home quad core machine to doing 1 week long molecular dynamics calculations during the winter, partly offsetting the electric cost against heating my flat


ROFL!!



15Bit wrote:
I seem to remember you use LR for your business - how is your experience of the core scaling for exports?


Actually, I've been stuck with PS/CS since ... a long time ago. But I'm looking more and more and LR for the bulk of the simple things that we do and it can do easier.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

Matthew Cham wrote:
This brings me to my next question: If money is no object, will you see more responsiveness from a 6-core CPU (@OC 4.0 GHz) or from an overclocked 4-core CPU (@OC 4.4 GHz)? That is, will the underutilized 5th and 6th core outperform the overclocking gains from the most expensive 4-core CPU (i7-3820 @OC 4.4 GHz)?


Without a 6 core setup to check i can't be definitive, but i strongly suspect the quad @ 4.4Ghz will feel faster. Mine certainly feels perky @4.3Ghz (except when i have NR turned on). Another thing i haven't checked is Hyperthreading, as i don't have it. But from what i've read, LR doesn't benefit greatly from Hyperthreading, and looking at the plots above i would tend to agree. So i'm not convinced an i7 is worth the money over an i5.



morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

^On 8 of the threads, only 4 of them are properly active during me playing around with the noise slider, 2 of them are sort of doing something and 2 are doing nothing. My net cpu usage does not go above 41%.

In export, if I select more than one export task (irrespective of how many images in each task), then all 8 light up. if I have only 1 multi-image export task, only 5 light up. Exporting benefits from multithreading.

Other threads in other forums indicate that the LR experience feels snappier with HT turned off.

I get what you are saying about an OC 4 possibly beating a stock 6 core SB-E but would you expect much of an improvement if that 6 core was also OC to 4.5 or so?



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

morganb4 wrote:
I get what you are saying about an OC 4 possibly beating a stock 6 core SB-E but would you expect much of an improvement if that 6 core was also OC to 4.5 or so?


At this low core count we aren't at the point where scaling becomes negative (i.e. getting slower with more cores), we just just a limited performance increase. So if you are comparing a quad core and a hex core, both running at the same clock speed, i would expect the hex core to be noticeably faster when rendering 1:1 previews, and maybe a shade faster when moving sliders and such (though this is pretty much instant on my quad at full speed). The "bottleneck" in terms of the user experience - zooming and scrolling around - obviously isn't using the 4 cores i have, so i doubt a 6 core will be perceptively faster. For sure you won't be sitting at your computer shouting "eureka".



morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

Ok... So would you please confirm my little experiment above? I I. E. Do you get instant noise sliders joy with highlight or shadow or clarity set to anything other than zero? Zoomed in on a high mp raw like 5d3 or 2 or something?

If so, I'm going to buy your setup.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

I'll try to test it tonight when i get home. I don't have any modern high MP RAW files though - the biggest i have are 1DsII files. I shoot an original 5D...



morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

The 1ds2 is 16mp which will be fine. Thanks HEAPS.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

Ok, a bit more testing then. This time to try out the interplay of various sliders with the NR plugin.

So for this one i've used a RAW file from a 1DsII (16Mp). Again i clocked down the CPU to 1.6Ghz. The test is simple - set a value on the various adjustments listed below and then slowly moved the NR slider up to 60 and down again. At each end i waited for the image to "catch up" with the slider setting. I did this a couple of times for each. I reset the adjusted value to "0" before testing the next adjustment. Everything was done at 100% zoom. So:

1. Reset image, no adjustments
2. Clarity +1
3. Highlights +1
4. Shadows +1
5. White clipping +1
6. Black clipping +1
7. Exposure +30
8. Contrast +15







Notes

Perceptively i found NR to be quite responsive without any adjustments applied. With clarity, highlights and shadows turned on, it was a *lot* slower, showing a couple of seconds lag at each end of the NR slider. The other adjustments were nice and snappy, just like having nothing turned on.

You can actually see this in the plots - the three "slow" adjustments use more CPU, *but* they are not maxing out the 4 cores for the duration of "lag" as i would hope.

I would say that i repeated this experiment with some sharpening turned on and response slowed a touch for all the "fast" adjustments above (including "no" adjustment). It slowed a lot for the clarity, highlights and shadows though.

Then, just to see, i clocked my CPU back to 4.3Ghz and repeated the NR adjustment with not adjustments and with clarity set to +1:







The CPU use is about the same as at 1.6Ghz, and again there is a noticeable slowdown with the clarity set, but the responsiveness is a lot higher.

And finally, I would comment that whilst responsiveness of LR when *moving* the NR slider is seriously slowed by the application of clarity etc, zooming and scrolling around the screen with sharpening, NR and clarity turned on seemed to be about the same as with just NR and sharpening on (and clarity off). So it seems clarity, highlights and shadows only affect the calculation of NR, not the redrawing of the image to the screen.

It would seem that this is a complex issue with many variables. They all (so far) involve the NR slider though, so the easiest solution seems to be to do NR at the end of your workflow.


morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

Wow, so perceptually, with you fully clocked system what is the lag on the noise sliders with shadow/highlight set? 0.5 seconds or still 2 seconds?

Thanks massively.

Ben



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

On the 1DsII image i reckon around a second maximum. I would describe it as laggy but usable. For sure i would not sit swearing at the screen in frustration. For the big pano i tested earlier, the lag is more like 3 secs, so its moving into annoying territory there.

What system are you running currently?



morganb4
Registered: Nov 03, 2005
Total Posts: 5313
Country: Australia

Clearly your setup is working better than mine. I have my suspicions that there is some weird CPU board interaction going on here.

i7 2600K on a GA-Z68X-UD3R-B3 board. at 4.5GHz, tmax = ~low 40s during working.
8GHz 2133 RAM dropped to command rate = 1 and other tweaks - 26ns latency (was a 16gig kit but couldnt run all at T1 so removed offendeding sticks)
SSD . 2xHD6850.
WIN7 & MacOS 10.8.2

I will splash for a 6 core if you think I will see an improvement on a different board.



15Bit
Registered: Jan 27, 2008
Total Posts: 3817
Country: Norway

There is nothing in my tests that suggest a 6 core chip is worth buying unless you do really big batch processing and exports.

On your system I would first try turning off Hyperthreading and seeing if that makes a difference. In principle, your i7 at 4.5Ghz should be more than the equal of my i5, and the only difference between them that i can think of is the Hyperthreading cores.



1
       2       3       end