15Bit Offline Upload & Sell: Off
|
p.1 #1 · p.1 #1 · Performance Testing - How does LR4 utilise multiple cores | |
The appearance of yet another thread discussing the slow performance of LR4 and the question of how well it uses multicore CPUs has spurred me into doing some actual performance just to see if I can more quantitatively identify some of the slowdowns many of us seem to see. I have done some testing previously, but not looking at the interactive aspects of the Library and Develop modes. These are at:
Hard Disk I/O with LR4 – https://www.fredmiranda.com/forum/topic/1123725/0#10730582
Export Module CPU scaling with LR3.3 - https://www.fredmiranda.com/forum/topic/1005214/0&year=2011#9547107
The objective of this current test is to determine how well LR uses multiple cores in the Library and Develop modules of LR and to give some subjective evaluation of how responsive the system “feels” along with it. I apologise in advance for how long this is, but I tried to make it fairly thorough and brevity is then difficult.
The system on test is:
Intel i5-3570K clocked down to 1600Mhz with turbo boost off
Asus P8Z77-V-Pro motherboard
16Gb of DDR3-1600
128Gb Samsung 830 SSD - Boot drive
80Gb Intel X25 G2 SSD – Images, LR catalogue, ACR temp dir
I down-clocked the CPU to reduce the responsiveness of the system as much as possible so as to give a good chance of seeing where the slowdowns occur. Also, all I/O is going to or from an SSD to minimise any I/O interference.
I am testing with LR4.2 and using a nifty little tool from Sysinternals called Process Explorer which allows monitoring of the CPU usage of a single process. It also allows the user to set processor affinity for a process – giving the possibility to limit the number of cores a process can use (from 1-4 on my system).
Before each test I restarted LR and cleared the ACR and Previews caches, or at least tried to. I noticed a bug here – the “Discard 1:1 Previews” command doesn’t seem to work. At least it doesn’t delete anything from the hard disk and clicking between images after running it doesn’t give the re-rendering you would expect if it really had deleted them. So instead I manually deleted the preview cache from disk instead.
For these tests I have made two specific catalogues for testing. One has 81 images taken from a Fuji X10 and is a mixture of RAW and JPG files. The second is a single image - a panorama made up of several images from a 5D stitched together and saved as a TIF. It has a lot of trees, leaves etc. and so a lot of fine detail to render. The image is 13147 x 4532 pixels and takes up 454Mb on disk.
---------------------------------------------------------------------------
Test 1 – Library Module: Rendering previews for 81 images.
For this test I selected all 81 images and set the catalogue to render 1:1 previews. I timed it for rendering on 1, 2, 3 and 4 cores and also took a trace of the CPU load during the processing.
Rendering on 1 core
http://farm9.staticflickr.com/8062/8184763360_a9888aeb7d_z.jpg
As you would expect, the CPU is pegged at 100% for the core for the duration of the rendering.
Rendering on 2 cores
http://farm9.staticflickr.com/8198/8185329966_2c40fd13e1_z.jpg
Now we start to see spikes in the trace, but each spike peaks at 50% overall load (i.e. filling 2 cores).
Rendering on 3 cores
http://farm9.staticflickr.com/8478/8185329848_9f1b472dd1_z.jpg
Again we see spikes in the trace, but they are now peaking below the 75% max load allowed for 3 cores.
Rendering on 4 cores
http://farm9.staticflickr.com/8065/8185293783_53ed7774da_z.jpg
And again the spikes, and again they are not reaching near the 100% max allowed load for 4 cores, though they do peak higher than for 3 cores, at around 75%
Scaling with core count
http://farm9.staticflickr.com/8339/8185361463_0dbb106477_z.jpg
Scaling reflects what we see in the CPU traces – there is a speed up but scaling is some way from ideal. Note that the slowest time (1 core) was 6m 3sec, whilst the fastest (4 cores) was 2m 43sec.
Notes
It seems that LR is using more cores as they are available, though not with great scaling – rendering previews on 4 cores is only 2.2x faster than on 1 core. It is interesting that for rendering on 2-4 cores there are 41 “spikes” in the cpu trace, corresponding to the rendering of 81 images. From the poor scaling and the way that the “spikes” increase in height with core count I suspect that individual renders are to some extent spread over more than 1 core. We can test this by rendering the large panorama on 4 cores:
http://farm9.staticflickr.com/8340/8185424918_c06b01bc6a.jpg
So rendering of a single image is multithreaded and we can speculate that maybe size “matters” – if there is enough data all cores can spin up to render the image, but if the image is smaller maybe this does not happen.
---------------------------------------------------------------------------
Test 2 – Library Module: Rendering previews on the fly - clicking from image to image
For this test I just clicked on images in the 81 image catalogue, zooming in on each to make the image render.
Rendering on 1 core
http://farm9.staticflickr.com/8069/8185413335_54a0f6f37d_z.jpg
Rendering on 4 cores
http://farm9.staticflickr.com/8342/8185413253_26f93bc5a7_z.jpg
Notes:
I only present the 1 and 4 core results as the others pretty much followed the trend above. We can confirm here that the rendering of a single image is multithreaded, and that we are not making full use of the 4 cores. I note though that the X10 images are not all that big – it is set to record at 6MP.
Perceptively I would say that it was pretty responsive when I did this on images which had no edits, a bit slower on images which had. I couldn’t see any increase in the CPU load plot for rendering of images with edits though.
---------------------------------------------------------------------------
Test 3 –Develop Module, rendering and scrolling around
Now in the Develop module, for this test I’ve used the 81 image catalogue and clicked from image to image, zooming in and scrolling around on each. Some of them are jpegs, some are RAW, some have edits, some don’t. LR is allowed to use all 4 cores.
http://farm9.staticflickr.com/8480/8185455107_eca7db836c_z.jpg
Notes
Every tall spike here represents a change of image, the lower loads represent scrolling around within the image. It thus appears that whilst initial rendering can take up a big chunk of CPU power, scrolling around doesn’t even utilise 2 cores fully. Perceptively I would say that scrolling around within the images felt a little sluggish. I could see the picture being re-drawn every time I dragged it, and I would say that when I clocked the CPU back to 4.3Ghz scrolling around in images was much more snappy.
---------------------------------------------------------------------------
Test 4 –Develop Module: Moving the sliders
This test is done on the big panorama and is to see how image adjustments are reflected in the CPU trace. To start, all adjustments are set to default (i.e. the image is “reset”) and then individual sliders are moved around and set back to default before moving to the next. I was zoomed out of the image so that it fit in the screen while doing these, but I did try repeating with the image zoomed to 100% and saw no difference in the results. In order of the numbers on the trace, the adjustments are:
1. Exposure
2. Highlights
3. Shadows
4. Whites
5. Blacks
6. Clarity
7. Vibrance
8. Saturation
9. Quick tweak on exposure, highlights, shadows followed by zooming in and scrolling around
http://farm9.staticflickr.com/8344/8185511549_67f667dbeb_z.jpg
Notes
It seemed that no matter how fast I moved the sliders around I couldn’t make LR use much more than 2 cores’ worth of CPU. I would say that I saw no real lag in the editing response. So the edits are threaded, but are also so fast that they don’t need a lot of CPU. I remind you that this is a 454MbTIF image, and I’ve down-clocked my CPU to 1.6Ghz, so the software seems pretty well coded.
Again I see that scrolling around the image (no.9 at the end) doesn’t chew up a lot of CPU, but again I can see the image re-drawing as I drag. And again bumping the CPU back to full speed makes it a lot more responsive.
---------------------------------------------------------------------------
Test 5 –Develop Module: The sharpening and noise reduction sliders
Using the panorama, with the image “reset” to default before starting. Zoomed to fit the screen to begin:
1. Sharpening by typing a number in the box
2. Sharpening by moving the slider
3. Noise reduction by typing a number in the box
4. Noise reduction by moving the slider
5. Same as 1, but zoomed to 100% view
6. Same as 2, zoomed at 100% view
7. Same as 3, zoomed at 100% view
8. Same as 4, zoomed at 100% view
http://farm9.staticflickr.com/8206/8185616774_4f489fe010_z.jpg
Notes
Sharpening is again threaded according to the CPU trace, and perceptively happens quickly also. Noise reduction is the first action where I can peg all 4 cores, and I would say is also perceptively slower to respond. It wasn’t “very” slow though. Scrolling around after applying sharpening and NR was a lot slower though, leading to test 6…
---------------------------------------------------------------------------
Test 6 –Develop Module: Scrolling around with sharpening and NR applied
Simple test here - scrolling around the panorama with and without sharpening and NR applied. So:
1. Scrolling around “reset” image
2. Sharpening and NR applied
3. Scrolling around again
http://farm9.staticflickr.com/8067/8185897004_c9bdd97809_z.jpg
Notes
I think here we have found the problem. Scrolling around and zooming with no sharpening and NR applied the response is decent. Once sharpening and NR are applied it slows down a lot. Note though that the CPU load is about the same for both actions. I would say that although I didn’t do a CPU load plot for it, I have separately tested NR and sharpening and it’s the NR that is the culprit.
---------------------------------------------------------------------------
Test 7 –Develop Module: Local adjustments with and without NR applied
Further exploring the effect of having NR turned on. Again the panorama and sweeping around a local adjustment brush that has a combination of Exposure, Contrast, Highlights, Shadows, Clarity and Saturation set. The image was “reset” between adjustments :
1. Zoomed out to fit screen, no NR applied
2. Zoomed out to fit screen, NR applied
3. Zoomed to 100%, no NR applied
4. Zoomed to 100%, NR applied
http://farm9.staticflickr.com/8349/8185649823_35e4bed71b_z.jpg
Notes
Some differences to be seen in the CPU trace: when doing local adjustments with the NR turned on it does seem to use more CPU, but it doesn’t get near using 100%. Perceptively I would have to say that it really *needs* to be using 100% as there was several seconds of lag when doing the local adjustments with NR turned on and the image zoomed to 100%. With the NR turned off it was a touch sluggish, but more than acceptably responsive.
---------------------------------------------------------------------------
Test 7 – Exporting
Testing whether image exporting is using all the CPU. Again the big panorama image. During the course of this I noticed something “funny” going on, and so tried exporting with the long edge set to different lengths, just to see:
http://farm9.staticflickr.com/8482/8185680105_3b8d5f0a9b_z.jpg
Notes
Again, LR is showing good multithreading – exporting a single image can use multiple cores. This is significant I think, as the easiest option to speed up exports would be by not having multithreading and assigning 1 image per core. Doing it this way indicates that Adobe has probably spent a lot of time on the code for the export module.
There is some interesting behaviour though - with the long edge of the image set to anything above 1000 pixels all 4 cores get utilised for the export, with the long edge set between 750 and 1000px, you get 3 cores, between 500 and 750px you get 2 cores and below that you get 1 core. I expect this is a deliberate choice Adobe have, but i don’t know the reasons and I don’t feel like speculating. I’ve not tested, but it is my guess that when exporting a batch of downsized images LR will export multiple images in parallel to compensate for limiting the number of cores each export can use.
---------------------------------------------------------------------------
CONCLUSIONS
I think we can conclude that LR4 is pretty thoroughly multi-threaded, for which Adobe should be praised. All the adjustments I tried utilise multiple cores as they are being implemented. Scaling is not perfect, but it never is for these types of calculation. The only part of the software that seemingly doesn’t make good use of multiple cores is zooming and panning images, and when NR has been applied this really really slows down.
I confess I don’t really understand why this is. It is obvious from the CPU traces that edits (incl NR) are applied as you move the slider, not as you move around the image. You can tell this from the fact that applying NR peaks the CPU at over 90%, whilst scrolling around the image with NR applied never gets it close to this. On the other hand, if the images are being fully rendered as you move the sliders then zooming and scrolling should simply involve shifting the data from RAM to display, which should surely be instantaneous as it is in the Library module. So I am curious about what actually goes on in the Develop module.
In terms of how many cores you should buy, I would say that from these numbers there is little value in going above 4 cores to improve responsiveness. I would expect more cores to significantly speed up large batches of image exports though.
So in summary, for responsiveness in Library and Develop modules I reckon an overclocked quad core CPU is the way to go. And do noise reduction as late in your workflow as possible.
|