Categories
database-design distributed hadoop mysql postgresql

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) – Loads of independent, structured data to be processed and queried

Hi there at SO,

I would like some ideas/comments on the following from you honorable and venerable bunch.

I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.

As I see it I have two main options:

Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)…

Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the “real time” query capabilities (I know you can use Hive, but that is not suggested for real time queries – which I need) – since I also have some normal sql queries to execute at times “select * from wine where colour=”brown””.

Note that in theory – if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also – my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.

Many Thanks