Profiling C++

5 minute read Published:

Testing out some C++ profiling techniques on Linux.


I’ve intended to learn code profiling techniques, but always started on the deep end with perf flamegraphs.

I would follow Brendan Gregg’s instructions, generate my flamegraph, and then not know what to do from that point. It’s not easy to understand a flamegraph on first glance, especially for complex code with lots of dependencies.

I eventually realized that I was trying to do too much. I didn’t understand basic profiling tools and I was reaching straight for the top-shelf techniques.

I took a step back and tried to follow some basic profiling guides:


The code I’ll be playing around with is my pitch-detection code in C++.

This code has evolved over the years to go from a very C-like C++, to a real attempt to use modern C++ idioms:

for (auto it = acf.begin() + size2/2; it != acf.end(); ++it)

Also, pairs, which coincidentally are not tuples (of size n) but exactly size 2:

return (!den) ? std::make_pair(x, array[x]) : std::make_pair(x + delta / (2 * den), array[x] - delta*delta/(8*den));

In fact, the code change I was interested in profiling was to use tie to unpack a tuple instead of std::get:

+       double period_tmp, amplitude_tmp;

        for (auto i : estimates) {
-               if (std::get<1>(i) >= actual_cutoff) {
-                       period = std::get<0>(i);
+               std::tie(period_tmp, amplitude_tmp) = i;
+               if (amplitude_tmp >= actual_cutoff) {
+                       period = period_tmp;

Using perf on Fedora

Install a package and invoke a sysctl:

$ sudo dnf install perf
$ sudo sysctl kernel.perf_event_paranoid=-1  -w

Without the sysctl, perf from userspace is not allowed to collect the information it needs to do profiling.

perf record

From the pitch-detection repository, I run make sinewave to build the included sample program which generates a sinewave with the specified frequency and size, and runs a pitch detection algorithm on it:

$ perf record ./bin/sinewave --freq 1337 --size 64000
Freq: 1337      pitch: 1337.02
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB (133 samples) ]

Now we run perf report to see the costly functions:

Samples: 133  of event 'cycles:ppp', Event count (approx.): 86841198
Overhead  Command   Shared Object             Symbol
  10.50%  sinewave              [.] __sincos
   7.70%  sinewave         [.] 0x0000000000064013
   4.62%  sinewave         [.] 0x000000000006403b
   3.16%  sinewave         [.] fftw_md5putc
   2.62%  sinewave         [.] 0x0000000000026acb
   2.39%  sinewave              [.] malloc_consolidate.part.0
   2.28%  sinewave  [.] __muldc3

As I suspected, such a trivial change as get -> tie probably has no effect and is a worthless optimization at best. Since the code looks uglier, I’ll remove it.


Adding -ffast-math to CXX_FLAGS in the Makefile made a difference:

Samples: 130  of event 'cycles:ppp', Event count (approx.): 89080953
Overhead  Command   Shared Object             Symbol
  13.29%  sinewave         [.] 0x0000000000064013
   7.17%  sinewave              [.] __sincos
   5.25%  sinewave         [.] 0x00000000000642d1
   3.90%  sinewave              [.] _int_malloc
   3.02%  sinewave         [.] 0x000000000006403b
   2.06%  sinewave  [.] __muldc3

It appears that __sincos from the math stlib got faster. The overall execution time of the program didn’t change much:

27914573 nanoseconds, 1000 repetitions (without -ffast-math)
27449447 nanoseconds, 1000 repetitions (-ffast-math)

I ran the timing comparison several times and the results were usually quite close. Insignificant, probably.

This is the template code I used to time the function, inspired from this Stackoverflow answer:

template <typename TimeT = std::chrono::nanoseconds> struct measure {
        template <typename F, typename... Args>
        static typename TimeT::rep
        execution(int reps, F &&func, Args &&... args)
                double tot = 0.0;

                for (int i = 0; i < reps; ++i) {
                        auto start = std::chrono::steady_clock::now();


                        auto duration = std::chrono::duration_cast<TimeT>(
                            std::chrono::steady_clock::now() - start);

                        tot += duration.count();

                return tot / reps;


After adding -flto to CXX_FLAGS, here’s the perf report out:

Samples: 111K of event 'cycles:ppp', Event count (approx.): 75366060043
Overhead  Command   Shared Object             Symbol
  10.22%  sinewave         [.] 0x0000000000064013
   9.90%  sinewave              [.] __sincos
   2.94%  sinewave         [.] 0x000000000006403b
   2.92%  sinewave        [.] get_pitch_mpm
   2.90%  sinewave         [.] 0x00000000000643a1
   2.85%  sinewave         [.] 0x00000000000642d1
   2.69%  sinewave         [.] fftw_md5putc
   2.06%  sinewave              [.] _int_malloc

Inconclusive? Do I need more samples here to truly understand this?

29551224 nanoseconds, 1000 repetitions (without -flto)
27406473 nanoseconds, 1000 repetitions (without -flto)

In this case the timing was consistently better - an unscientific 7% gain.

perf conclusions

perf record and perf top probably gave me a decent insight on what -ffast-math brings to the table.


Callgrind is a subprogram of Valgrind.

Also KCacheGrind, which is available on Fedora 27 (installs some KDE stuff with it though).

The callgrind command generates a callgrind output file:

$ valgrind --tool=callgrind ./bin/sinewave --freq 1337 --size 64000
$ ls callgrind*

We feed this to KCacheGrind to see what it shows:

Here I learned that the sincos call comes from FFTW3, my dependency for performing cross-correlation in my library.

In fact, most of the top calls come from FFTW3. It’s probably a good thing that my library is dominated by the computational work required to do pitch detection, and not some other weird pathologies I introduced via bad code.

There’s probably more to glean from the tool and callgrind’s output but I won’t delve too deep in this first dive into C++ performance analysis.