Categories
c++ csv parsing text

How can I read and parse CSV files in C++?

314

I need to load and use CSV file data in C++. At this point it can really just be a comma-delimited parser (ie don’t worry about escaping new lines and commas). The main need is a line-by-line parser that will return a vector for the next line each time the method is called.

I found this article which looks quite promising:
http://www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp

I’ve never used Boost’s Spirit, but am willing to try it. But only if there isn’t a more straightforward solution I’m overlooking.

6

  • 11

    I have looked at boost::spirit for parsing. It is more for parsing grammars thank parsing a simple file format. Someone on my team was trying to use it to parse XML and it was a pain to debug. Stay away from boost::spirit if possible.

    – chrish

    Jul 13, 2009 at 19:30

  • 52

    Sorry chrish, but that’s terrible advice. Spirit isn’t always an appropriate solution but I’ve used it – and continue to use it – successfully in a number of projects. Compared to similar tools (Antlr, Lex/yacc etc) it has significant advantages. Now, for parsing CSV it’s probably overkill…

    – MattyT

    Jul 14, 2009 at 12:09

  • 4

    @MattyT IMHO spirit is pretty hard to use for a parser combinator library. Having had some (very pleasant) experience with Haskells (atto)parsec libraries I expected it (spirit) to work similarly well, but gave up on it after fighting with 600 line compiler errors.

    – fho

    Jul 14, 2014 at 13:24

  • 4

  • Why don’t you want to escape commas and new lines! Every search links to this question and I could not find one answer that considers the escaping! 😐

    – Shafquat

    May 18, 2021 at 15:58

348

If you don’t care about escaping comma and newline,
AND you can’t embed comma and newline in quotes (If you can’t escape then…)
then its only about three lines of code (OK 14 ->But its only 15 to read the whole file).

std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
    std::vector<std::string>   result;
    std::string                line;
    std::getline(str,line);

    std::stringstream          lineStream(line);
    std::string                cell;

    while(std::getline(lineStream,cell, ','))
    {
        result.push_back(cell);
    }
    // This checks for a trailing comma with no data after it.
    if (!lineStream && cell.empty())
    {
        // If there was a trailing comma then add an empty element.
        result.push_back("");
    }
    return result;
}

I would just create a class representing a row.
Then stream into that object:

#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>

class CSVRow
{
    public:
        std::string_view operator[](std::size_t index) const
        {
            return std::string_view(&m_line[m_data[index] + 1], m_data[index + 1] -  (m_data[index] + 1));
        }
        std::size_t size() const
        {
            return m_data.size() - 1;
        }
        void readNextRow(std::istream& str)
        {
            std::getline(str, m_line);

            m_data.clear();
            m_data.emplace_back(-1);
            std::string::size_type pos = 0;
            while((pos = m_line.find(',', pos)) != std::string::npos)
            {
                m_data.emplace_back(pos);
                ++pos;
            }
            // This checks for a trailing comma with no data after it.
            pos   = m_line.size();
            m_data.emplace_back(pos);
        }
    private:
        std::string         m_line;
        std::vector<int>    m_data;
};

std::istream& operator>>(std::istream& str, CSVRow& data)
{
    data.readNextRow(str);
    return str;
}   
int main()
{
    std::ifstream       file("plop.csv");

    CSVRow              row;
    while(file >> row)
    {
        std::cout << "4th Element(" << row[3] << ")\n";
    }
}

But with a little work we could technically create an iterator:

class CSVIterator
{   
    public:
        typedef std::input_iterator_tag     iterator_category;
        typedef CSVRow                      value_type;
        typedef std::size_t                 difference_type;
        typedef CSVRow*                     pointer;
        typedef CSVRow&                     reference;

        CSVIterator(std::istream& str)  :m_str(str.good()?&str:nullptr) { ++(*this); }
        CSVIterator()                   :m_str(nullptr) {}

        // Pre Increment
        CSVIterator& operator++()               {if (m_str) { if (!((*m_str) >> m_row)){m_str = nullptr;}}return *this;}
        // Post increment
        CSVIterator operator++(int)             {CSVIterator    tmp(*this);++(*this);return tmp;}
        CSVRow const& operator*()   const       {return m_row;}
        CSVRow const* operator->()  const       {return &m_row;}

        bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == nullptr) && (rhs.m_str == nullptr)));}
        bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);}
    private:
        std::istream*       m_str;
        CSVRow              m_row;
};


int main()
{
    std::ifstream       file("plop.csv");

    for(CSVIterator loop(file); loop != CSVIterator(); ++loop)
    {
        std::cout << "4th Element(" << (*loop)[3] << ")\n";
    }
}

Now that we are in 2020 lets add a CSVRange object:

class CSVRange
{
    std::istream&   stream;
    public:
        CSVRange(std::istream& str)
            : stream(str)
        {}
        CSVIterator begin() const {return CSVIterator{stream};}
        CSVIterator end()   const {return CSVIterator{};}
};

int main()
{
    std::ifstream       file("plop.csv");

    for(auto& row: CSVRange(file))
    {
        std::cout << "4th Element(" << row[3] << ")\n";
    }
}

33

  • 29

    first() next(). What is this Java! Only Joking.

    Jul 14, 2009 at 5:15

  • 5

    @DarthVader: An overlay broad statement that by its broadness is silly. If you would like to clarify why it is bad and then why this badness applies in this context.

    Jan 12, 2012 at 20:10

  • 12

    @DarthVader: I think it is silly to make broad generalizations. The code above works correctly so I can actually see anything wrong with it. But if you have any specific comment on the above I will definitely consider in in this context. But I can see how you can come to that conclusion by mindlessly following a set of generalized rules for C# and applying it to another language.

    Jan 12, 2012 at 21:29


  • 5

    also, if you run into weird linking problems with the above code because another library somewhere defines istream::operator>> (like Eigen), add an inline before the operator declaration to fix it.

    – sk29910

    Jun 28, 2013 at 0:58

  • 4

    The parsing part is missing, one still ends up with strings. This is just an over-engineered line splitter.

    Jul 3, 2014 at 9:16

77

My version is not using anything but the standard C++11 library. It copes well with Excel CSV quotation:

spam eggs,"foo,bar","""fizz buzz"""
1.23,4.567,-8.00E+09

The code is written as a finite-state machine and is consuming one character at a time. I think it’s easier to reason about.

#include <istream>
#include <string>
#include <vector>

enum class CSVState {
    UnquotedField,
    QuotedField,
    QuotedQuote
};

std::vector<std::string> readCSVRow(const std::string &row) {
    CSVState state = CSVState::UnquotedField;
    std::vector<std::string> fields {""};
    size_t i = 0; // index of the current field
    for (char c : row) {
        switch (state) {
            case CSVState::UnquotedField:
                switch (c) {
                    case ',': // end of field
                              fields.push_back(""); i++;
                              break;
                    case '"': state = CSVState::QuotedField;
                              break;
                    default:  fields[i].push_back(c);
                              break; }
                break;
            case CSVState::QuotedField:
                switch (c) {
                    case '"': state = CSVState::QuotedQuote;
                              break;
                    default:  fields[i].push_back(c);
                              break; }
                break;
            case CSVState::QuotedQuote:
                switch (c) {
                    case ',': // , after closing quote
                              fields.push_back(""); i++;
                              state = CSVState::UnquotedField;
                              break;
                    case '"': // "" -> "
                              fields[i].push_back('"');
                              state = CSVState::QuotedField;
                              break;
                    default:  // end of quote
                              state = CSVState::UnquotedField;
                              break; }
                break;
        }
    }
    return fields;
}

/// Read CSV file, Excel dialect. Accept "quoted fields ""with quotes"""
std::vector<std::vector<std::string>> readCSV(std::istream &in) {
    std::vector<std::vector<std::string>> table;
    std::string row;
    while (!in.eof()) {
        std::getline(in, row);
        if (in.bad() || in.fail()) {
            break;
        }
        auto fields = readCSVRow(row);
        table.push_back(fields);
    }
    return table;
}

5

  • this nested vector of strings is a no-go for modern processors. Throws away their caching ability

    Apr 5, 2018 at 6:56

  • plus you got all those switch statements

    Apr 5, 2018 at 7:05

  • The top answer didn’t work for me, as I am on an older compiler. This answer worked, vector initialisation may require this: const char *vinit[] = {""}; vector<string> fields(vinit, end(vinit));

    – dr_rk

    Apr 6, 2018 at 9:16


  • Looks like a great solution and the best solution. Thank you. I think that you could avoid using the counter i by using the method back on your vector called fields.

    – Mark S.

    Jun 9, 2021 at 19:52

  • Very clean solution, this is a better answer than the topmost answer !

    – jgx

    Aug 6, 2021 at 7:19

53

Solution using Boost Tokenizer:

std::vector<std::string> vec;
using namespace boost;
tokenizer<escaped_list_separator<char> > tk(
   line, escaped_list_separator<char>('\\', ',', '\"'));
for (tokenizer<escaped_list_separator<char> >::iterator i(tk.begin());
   i!=tk.end();++i) 
{
   vec.push_back(*i);
}

3

  • 11

    The boost tokenizer doesn’t fully support the complete CSV standard, but there are some quick workarounds. See stackoverflow.com/questions/1120140/csv-parser-in-c/…

    Apr 13, 2010 at 23:03

  • 3

    Do you have to have the whole boost library on your machine, or can you just use a subset of their code to do this? 256mb seems like a lot for CSV parsing..

    – NPike

    Apr 27, 2011 at 23:28

  • 6

    @NPike : You can use the bcp utility that comes with boost to extract only the headers you actually need.

    – ildjarn

    May 24, 2011 at 23:06