poniedziałek, 6 kwietnia 2015

Real Boost.multi_index performance




Update 09.05.2015: thanks to Joaquín it turned out that I had slight bug in my plotting script - x tick labels were wrong. All pictures have been fixed.

Several months ago I prepared and conducted a presentation about multi_index library from Boost at Nokia TechMeetUp. Essentially I showed that it is pretty useful library that should be used when there's a need to represent the same data with multiple "views". For instance, if we have to access the same list of something sorted by different attributes, Boost.MultiIndex seems to be very good choice as the documentation claims strong consistency and, what's also important, better performance. What's also interesting is that Facebook's library - folly - utilizes that in its TimeoutQueue class.
In the middle of the presentation I put some performance graphs which I mindlessly copied them from the documentation. Had I known that they are old, I'd never do that. Fortunately it was pointed out by friend of mine. These graphs in fact illustrate performance gains, but measurements were made with ancient compiler  versions like GCC 3.4.5.

After the presentation I promised myself that I'll investigate this a bit, but as usual time didn't permit. Till now.

I made measurements with newer versions of GCC and Clang. Unfortunately I have no access to Visual Studio. If you can provide me VS stats I'd be pleased, though.

Here are my results for:
CPU: i5-3320M
RAM: 8GB
OS: Debian 7
Compilation flags: -O3
Boost version: 1.57

Images from documentation are placed at the right side.




In the case of ordered index it's visible that either standard containers were improved or compilers learned how to optimize code using them better. Nevertheless, there is visible trend showing that with more than 106 entries, multi_index container could perform better than naive approach.





In case of two indexes - one indexed and the second sequenced situation changes. In the original chart from the documentation we can observe huge improvement in favor to multi_index starting from 105 elements. My results are quite different - there is no performance gain with multi_index. For 106 elements the difference is dramatic comparing the two diagrams: ~130% and ~30%.





Interestingly, on my machine multi_index performed better in case of only 1 sequenced index. This is surprise for me, because in the documentation it is opposite. Well, in this scenario multi_index outperformed naive approach.
Rest of diagrams are presented below. As you can see they illustrate that multi_index documentation is outdated. Sometimes there are some performance gains, but not that huge as presented in diagrams from documentation.









Undoubtedly this deserves deeper investigation. Especially it should be discovered why something that had to perform better stopped to do so.
Unfortunately currently I have no time to dig in into this. (Maybe in next days I'll have some time to check how cache-friendly multi_index library is). Nevertheless, everything that I showed pushes me even harder into opinion that everything should be measured - even 3rd-party things with a lot of promises in the documentation.

7 komentarzy:

  1. Hi Sławomir,

    I'd like to know more about your results, as they don't seem to be on line with what I'd expect. I've run the test on the same machine (Windows 7, i5-2520M @2.5GHz, 4GB RAM) using different g++ and Boost versions:

    A: g++ (GCC) 3.4.4 (cygming special), Boost 1.34 (May 2007)

    1 ordered index
    10^3 elmts: 110.12% ( 0.59 ms / 0.53 ms)
    10^4 elmts: 115.78% ( 6.21 ms / 5.36 ms)
    10^5 elmts: 112.63% ( 69.64 ms / 61.83 ms)
    space gain: 80.00%
    1 sequenced index
    10^3 elmts: 100.21% ( 0.46 ms / 0.46 ms)
    10^4 elmts: 96.37% ( 4.64 ms / 4.81 ms)
    10^5 elmts: 100.58% ( 46.53 ms / 46.26 ms)
    space gain: 100.00%
    2 ordered indices
    10^3 elmts: 60.17% ( 0.69 ms / 1.15 ms)
    10^4 elmts: 72.24% ( 8.33 ms / 11.53 ms)
    10^5 elmts: 75.94% (101.96 ms / 134.27 ms)
    space gain: 70.00%
    1 ordered index + 1 sequenced index
    10^3 elmts: 57.40% ( 0.59 ms / 1.03 ms)
    10^4 elmts: 62.12% ( 6.49 ms / 10.45 ms)
    10^5 elmts: 66.82% ( 75.62 ms / 113.17 ms)
    space gain: 75.00%
    3 ordered indices
    10^3 elmts: 46.93% ( 0.79 ms / 1.68 ms)
    10^4 elmts: 52.33% ( 8.91 ms / 17.02 ms)
    10^5 elmts: 60.13% (117.43 ms / 195.28 ms)
    space gain: 66.67%
    2 ordered indices + 1 sequenced index
    10^3 elmts: 43.53% ( 0.70 ms / 1.60 ms)
    10^4 elmts: 43.83% ( 7.45 ms / 16.99 ms)
    10^5 elmts: 43.25% ( 92.01 ms / 212.77 ms)
    space gain: 69.23%

    B: g++ (GCC) 3.4.4 (cygming special), Boost 1.58 (April 2015)

    1 ordered index
    10^3 elmts: 108.30% ( 0.58 ms / 0.54 ms)
    10^4 elmts: 112.31% ( 6.12 ms / 5.45 ms)
    10^5 elmts: 112.74% ( 70.07 ms / 62.15 ms)
    space gain: 80.00%
    1 sequenced index
    10^3 elmts: 101.40% ( 0.47 ms / 0.46 ms)
    10^4 elmts: 101.07% ( 4.65 ms / 4.60 ms)
    10^5 elmts: 102.15% ( 46.94 ms / 45.95 ms)
    space gain: 100.00%
    2 ordered indices
    10^3 elmts: 63.36% ( 0.70 ms / 1.10 ms)
    10^4 elmts: 68.48% ( 8.13 ms / 11.87 ms)
    10^5 elmts: 76.46% (104.27 ms / 136.38 ms)
    space gain: 70.00%
    1 ordered index + 1 sequenced index
    10^3 elmts: 53.41% ( 0.60 ms / 1.12 ms)
    10^4 elmts: 61.96% ( 6.56 ms / 10.59 ms)
    10^5 elmts: 68.55% ( 76.93 ms / 112.22 ms)
    space gain: 75.00%
    3 ordered indices
    10^3 elmts: 46.71% ( 0.79 ms / 1.68 ms)
    10^4 elmts: 53.25% ( 9.10 ms / 17.08 ms)
    10^5 elmts: 60.52% (119.69 ms / 197.76 ms)
    space gain: 66.67%
    2 ordered indices + 1 sequenced index
    10^3 elmts: 44.04% ( 0.70 ms / 1.58 ms)
    10^4 elmts: 43.08% ( 7.36 ms / 17.09 ms)
    10^5 elmts: 46.10% ( 91.17 ms / 197.78 ms)
    space gain: 69.23%

    C: g++ (GCC) 4.8.1, Boost 1.58 (April 2015)

    1 ordered index
    10^3 elmts: 100.81% ( 0.17 ms / 0.17 ms)
    10^4 elmts: 98.71% ( 1.96 ms / 1.99 ms)
    10^5 elmts: 88.64% ( 22.18 ms / 25.02 ms)
    space gain: 80.00%
    1 sequenced index
    10^3 elmts: 90.37% ( 0.08 ms / 0.09 ms)
    10^4 elmts: 90.31% ( 0.78 ms / 0.87 ms)
    10^5 elmts: 90.84% ( 8.34 ms / 9.18 ms)
    space gain: 100.00%
    2 ordered indices
    10^3 elmts: 71.05% ( 0.26 ms / 0.37 ms)
    10^4 elmts: 75.69% ( 3.01 ms / 3.98 ms)
    10^5 elmts: 68.74% ( 35.52 ms / 51.67 ms)
    space gain: 70.00%
    1 ordered index + 1 sequenced index
    10^3 elmts: 55.04% ( 0.16 ms / 0.29 ms)
    10^4 elmts: 62.44% ( 1.93 ms / 3.09 ms)
    10^5 elmts: 69.11% ( 24.11 ms / 34.89 ms)
    space gain: 75.00%
    3 ordered indices
    10^3 elmts: 64.47% ( 0.35 ms / 0.54 ms)
    10^4 elmts: 68.54% ( 4.04 ms / 5.89 ms)
    10^5 elmts: 69.46% ( 48.26 ms / 69.49 ms)
    space gain: 66.67%
    2 ordered indices + 1 sequenced index
    10^3 elmts: 56.58% ( 0.27 ms / 0.48 ms)
    10^4 elmts: 59.80% ( 3.14 ms / 5.25 ms)
    10^5 elmts: 60.27% ( 38.38 ms / 63.69 ms)
    space gain: 69.23%

    (continued in next post)

    OdpowiedzUsuń
  2. (continues from previous post)

    So: A and B were run with a fairly old compiler (3.4.4, 2005) using Boost 1.34 (2007) and Boost 1.58 (current, 2015), respectively. Results are basically the same, which shows that there's been no degradation in Boost.MultiIndex code per se. C, on the other hand, uses GCC 4.8.1 (2013) with Boost 1.58: *relative* performance of Boost.MultiIndex is somewhat poorer (but nowhere as poor as you got), but please note that overall absolute timings, when compared with B, are 2.5-5x times faster! So, my interpretation is that compilers have gotten much better (improved overall performance) which also results, incidentally, in a reduction in Boost.MultiIndex relative advantage.

    Please contact me back if you'd like to further analyze your results.

    Best regards,

    OdpowiedzUsuń
    Odpowiedzi
    1. Hi, i think it was good idea to compare Boost 1.34 and 1.58 with this old gcc version. Indeed it shows that there were no significant changes in code base (in terms of performance).
      I also agree that compilers have gotten much better.
      And regarding C - this is also a surprise for me. I doubt OS influences results in any way. What about clang? Can you post your results with recent stable version of it?

      Usuń
  3. It'll take me some days to grab a copy of Clang and do that test. In the meantime, I understand you've changed the test code from the original in Boost.MultiIndex docs as n ranges between 10^4 and 10^6. Can you upload the source code you used? Thank you.

    OdpowiedzUsuń
    Odpowiedzi
    1. No, I haven't changed anything in the code. I checked plotting script and found a bug (instead of 10e2, 10e3, 10e4 I had 10e3, 10e4, 10e5...). So everything but the x ticks is okay. I'll try to update pictures in the post still today.

      Usuń
  4. Can you post the program output? A look at absolute timings might shed some light, I think. Thank you

    OdpowiedzUsuń
    Odpowiedzi
    1. I found some results on the disk, but they are not that used in this post.
      http://pastebin.com/apDnN91W

      Unfortunately I couldn't find output logs for the results from pictures.
      And here is my plotting script with values used in the pictures, for the record: http://pastebin.com/XCfTfu2V

      Usuń