Categories
database hdf5 pandas pytables python

Pandas HDF5 as a Database

I’ve been using python pandas for the last year and I’m really impressed by its performance and functionalities, however pandas is not a database yet. I’ve been thinking lately on ways to integrate the analysis power of pandas into a flat HDF5 file database. Unfortunately HDF5 is not designed to deal natively with concurrency.

I’ve been looking around for inspiration into locking systems, distributed task queues, parallel HDF5, flat file database managers or multiprocessing but I still don’t have a clear idea on where to start.

Ultimately, I would like to have a RESTful API to interact with the HDF5 file to create, retrieve, update and delete data. A possible use case for this could be building a time series store where sensors can write data and analytical services can be implemented on top of it.

Any ideas about possible paths to follow, existing similar projects or about the convenience/inconvenience of the whole idea will be very much appreciated.

PD: I know I can use a SQL/NoSQL database to store the data instead but I want to use HDF5 because I haven’t seen anything faster when it comes to retrieve large volumes of the data.