Categories
csv pandas python

open selected rows with pandas using “chunksize” and/or “iterator”

I have a large csv file and I open it with pd.read_csv as it follows:

df = pd.read_csv(path//fileName.csv, sep = ' ', header = None)

As the file is really large I would like to be able to open it in rows

from 0 to 511
from 512 to 1023
from 1024 to 1535
...
from 512*n to 512*(n+1) - 1

Where n = 1, 2, 3 …

If I add chunksize = 512 into the arguments of read_csv

df = pd.read_csv(path//fileName.csv, sep = ' ', header = None, chunksize = 512)

and I type

df.get_chunk(5)

Than I am able to open rows from 0 to 5 or I may be able to divide the file in parts of 512 rows using a for loop

data = []
for chunks in df:
data = data + [chunk]

But this is quite useless as still the file has to be completelly opened and takes time. How can I read only rows from 512*n to 512*(n+1).

Looking around I often saw that “chunksize” is used together with “iterator” as it follows

 df = pd.read_csv(path//fileName.csv, sep = ' ', header = None, iterator = True, chunksize = 512)

But after many attempts I still don’t understand which benefits provide me this boolean variable. Could you explain me it, please?