Categories
gprof profiler profiling

Alternatives to gprof [closed]

172

What other programs do the same thing as gprof?

6

  • 2

    what platforms you are interested in?

    – osgx

    Mar 24, 2010 at 22:10

  • 2

    I’m interested in Linux.

    Apr 16, 2010 at 19:37

  • 2

    possible duplicate of stackoverflow.com/questions/375913/…

    Apr 21, 2010 at 0:59

  • 13

    @Gregory – I’m inclined to agree, and maybe he should contribute with answers of his own, 229 vs 6, all 6 of those answers being to his own questions…

    Sep 14, 2010 at 1:18

  • 5

    How can this question don’t be constructive?

    May 5, 2013 at 23:04


78

Valgrind has an instruction-count profiler with a very nice visualizer called KCacheGrind. As Mike Dunlavey recommends, Valgrind counts the fraction of instructions for which a procedure is live on the stack, although I’m sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light years ahead of gprof.

3

  • 2

    @Norman: ++ That confusion about recursion seems endemic to systems that have the concept of propogating times among nodes in a graph. Also I think wall-clock time is generally more useful than CPU instruction times, and code lines (call instructions) are more useful than procedures. If stack samples at random wall clock times are taken, then the fractional cost of a line (or procedure, or any other description you can make) is simply estimated by the fraction of samples that exhibit it.

    Dec 20, 2009 at 13:50

  • 1

    … I’m emphasizing call instructions, but it applies to any instructions. If one has an honest-to-goodness hotspot bottleneck, such as a bubble sort of a large array of numbers, then the compare/jump/swap/increment instructions of the inner loop will be at the top/bottom of nearly every stack sample. But (especially as software gets big and hardly any routine has much “self” time) many problems actually are call instructions, requesting work that, when it is clear how much it costs, doesn’t really have to be done.

    Dec 20, 2009 at 14:07

  • 3

    … Check this out. I think they are nearly on the right track: rotateright.com/zoom.html

    Dec 20, 2009 at 14:18

78

Valgrind has an instruction-count profiler with a very nice visualizer called KCacheGrind. As Mike Dunlavey recommends, Valgrind counts the fraction of instructions for which a procedure is live on the stack, although I’m sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light years ahead of gprof.

3

  • 2

    @Norman: ++ That confusion about recursion seems endemic to systems that have the concept of propogating times among nodes in a graph. Also I think wall-clock time is generally more useful than CPU instruction times, and code lines (call instructions) are more useful than procedures. If stack samples at random wall clock times are taken, then the fractional cost of a line (or procedure, or any other description you can make) is simply estimated by the fraction of samples that exhibit it.

    Dec 20, 2009 at 13:50

  • 1

    … I’m emphasizing call instructions, but it applies to any instructions. If one has an honest-to-goodness hotspot bottleneck, such as a bubble sort of a large array of numbers, then the compare/jump/swap/increment instructions of the inner loop will be at the top/bottom of nearly every stack sample. But (especially as software gets big and hardly any routine has much “self” time) many problems actually are call instructions, requesting work that, when it is clear how much it costs, doesn’t really have to be done.

    Dec 20, 2009 at 14:07

  • 3

    … Check this out. I think they are nearly on the right track: rotateright.com/zoom.html

    Dec 20, 2009 at 14:18

66

Since I did’t see here anything about perf which is a relatively new tool for profiling the kernel and user applications on Linux I decided to add this information.

First of all – this is a tutorial about Linux profiling with perf

You can use perf if your Linux Kernel is greater than 2.6.32 or oprofile if it is older. Both programs don’t require from you to instrument your program (like gprof requires). However in order to get call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

You can see “live” analysis of your application with perf top:

sudo perf top -p `pidof a.out` -K

Or you can record performance data of a running application and analyze them after that:

1) To record performance data:

perf record -p `pidof a.out`

or to record for 10 secs:

perf record -p `pidof a.out` sleep 10

or to record with call graph ()

perf record -g -p `pidof a.out` 

2) To analyze the recorded data

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

Or you can record performace data of a application and analyze them after that just by launching the application in this way and waiting for it to exit:

perf record ./a.out

This is an example of profiling a test program

The test program is in file main.cpp (I will put main.cpp at the bottom of the message):

I compile it in this way:

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

I use libmalloc_minimial.so since it is compiled with -fno-omit-frame-pointer while libc malloc seems to be compiled without this option.
Then I run my test program

./my_test 100000000 

Then I record performance data of a running process:

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

Then I analyze load per module:

perf report –stdio -g none –sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

Then load per function is analyzed:

perf report –stdio -g none -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

Then call chains are analyzed:

perf report –stdio -g graph -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

So at this point you know where your program spends time.

And this is main.cpp for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j =0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j =0; j < 40; ++j) {
    ++time_value;
  }
  time_value=f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{

  for (int j =0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m =0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i =0; i < 10; ++i) {
    time_value=f1(time_value);
    time_value=f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value );

  for (int i =0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value );
  return 0;
}

8

  • I just ran your example & took 5 stackshots. Here’s what they found: 40% (roughly) of the time f1 was calling delete. 40% (roughly) of the time process_request was calling delete. A good part of the remainder was spent in new. The measurements are rough, but the hotspots are pinpointed.

    Jan 16, 2013 at 1:55

  • What is a stackshot? Is it that pstack outputs?

    – user184968

    Jan 16, 2013 at 14:36

  • 2

    As in my answer, you run it under a debugger and hit ^C at a random time and capture the stack trace. 1) I think that your technique is not useful when you need to analyze performance problems for a program running on your customer’s server. 2) I am not sure how you apply this technique to get information for a program having lots of threads that handle different requests. I mean when the general picture is quite complicated.

    – user184968

    Apr 30, 2013 at 10:46


  • 2

    As for #1. Sometimes customers call and say that your program works slowly. You can’t say immediately that the problem is outside your code, can you? Since you might need some information in order to support your point. In this situation you at some point might need to profile your application. You can’t just ask your customer to start gdb and press ^C and get call stacks. This was my point. This is an example spielwiese.fontein.de/2012/01/22/…. I had this problem and profiling helped a lot.

    – user184968

    Apr 30, 2013 at 14:31

  • 2

    As for #2. Simplifying is a good approach, I agree. Sometimes it works. If a performance problem occur only on a customer’s server and you can’t reproduce them on your server then profiles are of use.

    – user184968

    Apr 30, 2013 at 15:59