Categories
benchmarking c++ getline iostream python

Why is reading lines from stdin much slower in C++ than Python?

2098

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I’m not yet an expert Pythonista, please tell me if I’m doing something wrong or if I’m misunderstanding something.


(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)


C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$ cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I’d update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here’s the comparison, with several solutions/approaches:

ImplementationLines per second
python (default)3,571,428
cin (default/naive)819,672
cin (no sync)12,500,000
fgets14,285,714
wc (not fair comparison)54,644,808

6

  • 24

    Did you run your tests multiple times? Perhaps there is a disk cache issue.

    Feb 21, 2012 at 2:20

  • 3

    @VaughnCato Yes, and on two different machines as well.

    – JJC

    Feb 21, 2012 at 2:22

  • 20

    The problem is synchronization with stdio — see my answer.

    Feb 21, 2012 at 3:30

  • 26

    Since nobody seems to have mentioned why you get an extra line with C++: Do not test against cin.eof()!! Put the getline call into the ‘if` statement.

    – Xeo

    Feb 21, 2012 at 18:29

  • 27

    wc -l is fast because it reads the stream more than one line at a time (it might be fread(stdin)/memchr('\n') combination). Python results are in the same order of magnitude e.g., wc-l.py

    – jfs

    Feb 27, 2012 at 0:21

1873

tl;dr: Because of different default settings in C++ requiring more system calls.

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn’t be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn’t a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method. From this link (emphasis added):

If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.

3

  • 175

    This should be at the top. It is almost certainly correct. The answer cannot lie in replacing the read with an fscanf call, because that quite simply doesn’t do as much work as Python does. Python must allocate memory for the string, possibly multiple times as the existing allocation is deemed inadequate – exactly like the C++ approach with std::string. This task is almost certainly I/O bound and there is way too much FUD going around about the cost of creating std::string objects in C++ or using <iostream> in and of itself.

    Feb 21, 2012 at 3:34

  • 65

    Yes, adding this line immediately above my original while loop sped the code up to surpass even python. I’m about to post the results as the final edit. Thanks again!

    – JJC

    Feb 21, 2012 at 3:45

  • 66

    Note that sync_with_stdio() is a static member function, and a call to this function on any stream object (e.g. cin) toggles on or off synchronization for all standard iostream objects.

    Jan 21, 2015 at 1:16

207

Just out of curiosity I’ve taken a look at what happens under the hood, and I’ve used dtruss/strace on each test.

C++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

0

    202

    I’m a few years behind here, but:

    In ‘Edit 4/5/6’ of the original post, you are using the construction:

    $ /usr/bin/time cat big_file | program_to_benchmark
    

    This is wrong in a couple of different ways:

    1. You’re actually timing the execution of cat, not your benchmark. The ‘user’ and ‘sys’ CPU usage displayed by time are those of cat, not your benchmarked program. Even worse, the ‘real’ time is also not necessarily accurate. Depending on the implementation of cat and of pipelines in your local OS, it is possible that cat writes a final giant buffer and exits long before the reader process finishes its work.

    2. Use of cat is unnecessary and in fact counterproductive; you’re adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and — in certain generations of computers — I/O faster than CPU) — the mere fact that cat was running could substantially color the results. You are also subject to whatever input and output buffering and other processing cat may do. (This would likely earn you a ‘Useless Use Of Cat’ award if I were Randal Schwartz.

    A better construction would be:

    $ /usr/bin/time program_to_benchmark < big_file
    

    In this statement it is the shell which opens big_file, passing it to your program (well, actually to time which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you’re trying to benchmark. This gets you a real reading of its performance without spurious complications.

    I will mention two possible, but actually wrong, ‘fixes’ which could also be considered (but I ‘number’ them differently as these are not things which were wrong in the original post):

    A. You could ‘fix’ this by timing only your program:

    $ cat big_file | /usr/bin/time program_to_benchmark
    

    B. or by timing the entire pipeline:

    $ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'
    

    These are wrong for the same reasons as #2: they’re still using cat unnecessarily. I mention them for a few reasons:

    • they’re more ‘natural’ for people who aren’t entirely comfortable with the I/O redirection facilities of the POSIX shell

    • there may be cases where cat is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output)

    • in practice, on modern machines, the added cat in the pipeline is probably of no real consequence.

    But I say that last thing with some hesitation. If we examine the last result in ‘Edit 5’ —

    $ /usr/bin/time cat temp_big_file | wc -l
    0.01user 1.34system 0:01.83elapsed 74%CPU ...
    

    — this claims that cat consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

    $ /usr/bin/time wc -l < temp_big_file
    

    would have taken only the remaining .49 seconds! Probably not: cat here had to pay for the read() system calls (or equivalent) which transferred the file from ‘disk’ (actually buffer cache), as well as the pipe writes to deliver them to wc. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

    Still, I predict you would be able to measure the difference between cat file | wc -l and wc -l < file and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

    In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually ‘best of 3’ results; after priming the cache, of course):

    $ time wc -l < /tmp/junk
    real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
    $ time cat /tmp/junk | wc -l
    real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
    $ time sh -c 'cat /tmp/junk | wc -l'
    real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)
    

    Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I’m using the shell (bash)’s built-in ‘time’ command, which is cognizant of the pipeline; and I’m on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime — showing that it can only time the single pipeline element passed to it on its command line. Also, the shell’s output gives milliseconds while /usr/bin/time only gives hundredths of a second.

    So at the efficiency level of wc -l, the cat makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

    I should add that there is at least one other significant difference between these styles of testing, and I can’t say whether it is a benefit or fault; you have to decide this yourself:

    When you run cat big_file | /usr/bin/time my_program, your program is receiving input from a pipe, at precisely the pace sent by cat, and in chunks no larger than written by cat.

    When you run /usr/bin/time my_program < big_file, your program receives an open file descriptor to the actual file. Your program — or in many cases the I/O libraries of the language in which it was written — may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the cat binary.

    Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the cat result by some small factor to “forgive” the cost of running cat itself.

    2

    • 36

      Wow, that was quite insightful! While I’ve been aware that cat is unnecessary for feeding input to stdin of programs and that the < shell redirect is preferred, I’ve generally stuck to cat due to the left-to-right flow of data that the former method preserves visually when I reason about pipelines. Performance differences in such cases I’ve found to be negligible. But, I do appreciate your educating us, Bela.

      – JJC

      May 9, 2017 at 1:16

    • 15

      Redirection is parsed out of the shell command line at an early stage, which allows you to do one of these, if it gives a more pleasing appearance of left-to-right flow: $ < big_file time my_program $ time < big_file my_program This should work in any POSIX shell (i.e. not `csh` and I’m not sure about exotica like `rc` : )

      May 10, 2017 at 21:55