Categories
dataframe pandas python

How to iterate over rows in a DataFrame in Pandas

3513

I have a pandas dataframe, df:

   c1   c2
0  10  100
1  11  110
2  12  120

How do I iterate over the rows of this dataframe? For every row, I want to be able to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
   print(row['c1'], row['c2'])

I found a similar question which suggests using either of these:

for date, row in df.T.iteritems():
for row in df.iterrows():

But I do not understand what the row object is and how I can work with it.

8

  • 25

    The df.iteritems() iterates over columns and not rows. Thus, to make it iterate over rows, you have to transpose (the “T”), which means you change rows and columns into each other (reflect over diagonal). As a result, you effectively iterate the original dataframe over its rows when you use df.T.iteritems()

    Dec 14, 2017 at 23:41

  • 106

    In contrast to what cs95 says, there are perfectly fine reasons to want to iterate over a dataframe, so new users should not feel discouraged. One example is if you want to execute some code using the values of each row as input. Also, if your dataframe is reasonably small (e.g. less than 1000 items), performance is not really an issue.

    – oulenz

    Oct 16, 2019 at 8:53

  • 4

    @cs95 It seems to me that dataframes are the go-to table format in Python. So whenever you want to read in a csv, or you have a list of dicts whose values you want to manipulate, or you want to perform simple join, groupby or window operations, you use a dataframe, even if your data is comparitively small.

    – oulenz

    Nov 16, 2019 at 12:19


  • 4

    @cs95 No, but this was in response to “using a DataFrame at all”. My point is that this is why one may have one’s data in a dataframe. If you then want to e.g. run a script for each line of your data, you have to iterate over that dataframe.

    – oulenz

    Nov 16, 2019 at 18:55

  • 35

    I second @oulenz. As far as I can tell pandas is the go-to choice of reading a csv file even if the dataset is small. It’s simply easier programing to manipulate the data with APIs

    – F.S.

    Nov 18, 2019 at 21:29


4613

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index()  # make sure indexes pair with number of rows

for index, row in df.iterrows():
    print(row['c1'], row['c2'])
10 100
11 110
12 120

11

  • 350

    Note: “Because iterrows returns a Series for each row, it does not preserve dtypes across the rows.” Also, “You should never modify something you are iterating over.” According to pandas 0.19.1 docs

    – viddik13

    Dec 7, 2016 at 16:24


  • 7

    @viddik13 that’s a great note thanks. Because of that I ran into a case where numerical values like 431341610650 where read as 4.31E+11. Is there a way around preserving the dtypes?

    – Aziz Alto

    Sep 5, 2017 at 16:30


  • 46

    @AzizAlto use itertuples, as explained below. See also pandas.pydata.org/pandas-docs/stable/generated/…

    – Axel

    Sep 7, 2017 at 11:45


  • 167

    Do not use iterrows. Itertuples is faster and preserves data type. More info

    – James L.

    Dec 1, 2017 at 16:14

  • 25

    From the documentation: “Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed[…]”. Your answer is correct (in the context of the question) but does not mention this anywhere, so it isn’t a very good one.

    – cs95

    May 28, 2019 at 5:00


1945

+100

How to iterate over rows in a DataFrame in Pandas?

Answer: DON’T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with “iter” in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed […].

* It’s actually a little more complicated than “don’t”. df.iterrows() is the correct answer to this question, but “vectorize your ops” is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you’re not sure whether you need an iterative solution, you probably don’t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are “vectorised” by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you’re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with – what that means is your data types are consistent and you don’t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let’s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you’re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn’t always this cut and dry. Sometimes the answer to “what is the best method for an operation” is “it depends on your data”. My advice is to test out different approaches on your data before settling on one.


My Personal Opinion *

Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.

Here is my personal preference when selecting a method to use for a problem.

For the novice:

Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython

For the more experienced:

Vectorization (when possible); apply(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()

Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.

I do tend to go on about how bad apply is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it’s doing. Additionally, there are quite a few use cases for apply has explained in this post of mine.

Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.

* As with any personal opinion, please take with heaps of salt!


Further Reading

* Pandas string methods are “vectorized” in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form “How can I iterate over my df to do X?”. Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I’m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

25

  • 4

    Note that there are important caveats with iterrows and itertuples. See this answer and pandas docs for more details.

    – viddik13

    May 30, 2019 at 11:56

  • 132

    This is the only answer that focuses on the idiomatic techniques one should use with pandas, making it the best answer for this question. Learning to get the right answer with the right code (instead of the right answer with the wrong code – i.e. inefficient, doesn’t scale, too fit to specific data) is a big part of learning pandas (and data in general).

    May 30, 2019 at 14:26

  • 17

    I think you are being unfair to the for loop, though, seeing as they are only a bit slower than list comprehension in my tests. The trick is to loop over zip(df['A'], df['B']) instead of df.iterrows().

    Jun 24, 2019 at 0:58

  • 4

    Under List Comprehensions, the “iterating over multiple columns” example needs a caveat: DataFrame.values will convert every column to a common data type. DataFrame.to_numpy() does this too. Fortunately we can use zip with any number of columns.

    Jan 16, 2020 at 20:44

  • 5

    @Dean I get this response quite often and it honestly confuses me. It’s all about forming good habits. “My data is small and performance doesn’t matter so my use of this antipattern can be excused” ..? When performance actually does matter one day, you’ll thank yourself for having prepared the right tools in advance.

    – cs95

    Jul 26, 2020 at 4:46


526

First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

  • iterrows: dtype might not match from row to row

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

  • iterrows: Do not modify rows

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2, axis = 1)
  • itertuples:

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.

10

  • 6

    Just a small question from someone reading this thread so long after its completion: how df.apply() compares to itertuples in terms of efficiency?

    Jan 26, 2018 at 13:16

  • 7

    Note: you can also say something like for row in df[['c1','c2']].itertuples(index=True, name=None): to include only certain columns in the row iterator.

    Jun 29, 2018 at 7:29

  • 13

    Instead of getattr(row, "c1"), you can use just row.c1.

    – viraptor

    Aug 13, 2018 at 6:20

  • 1

    I am about 90% sure that if you use getattr(row, "c1") instead of row.c1, you lose any performance advantage of itertuples, and if you actually need to get to the property via a string, you should use iterrows instead.

    Aug 24, 2018 at 10:34

  • 3

    I have stumbled upon this question because, although I knew there’s split-apply-combine, I still really needed to iterate over a DataFrame (as the question states). Not everyone has the luxury to improve with numba and cython (the same docs say that “It’s always worth optimising in Python first”). I wrote this answer to help others avoid (sometimes frustrating) issues as none of the other answers mention these caveats. Misleading anyone or telling “that’s the right thing to do” was never my intention. I have improved the answer.

    – viddik13

    May 30, 2019 at 12:32