I have successfully completed the free online offering of ‘Introduction to Data Science’, offered by coursera.org. I just received my statement of accomplishment.
Last several months I was involved into the task of design and implementation of statistics/analytics system for the game in social network. There are a lot of users at the same time. All of them produce huge amount of events. One of the standards for analytics systems is providing fast queries for the collected data. Logically, I used OLAP cubes to collect all kinds of events needed for our team to analyze. Technically, the best way in our case is using column-based storage. I use Infobright. In our case regular RDBMS (SQL) storage or document-based DB like MongoDB is not enough because of performance. They should be used rather for OLTP, but not for MGD OLAP. From other hand, such cool gun as Hadoop-based solution would be overrun. So, Infobright is exactly the case. It was one of the best decisions I made as software architect for last several months:
- As it’s pure OLAP solution, so, I’m able to implement any ETL/Storage/Query scheme;
- As Infobright is column-based storage, all my even very sofisticated queries on even huge recordsets have extremely short execution time;
- As all huge functionality like aggregation/filtering is hidden in Inforbright’s internals, I concentrate on my business task, so, able to desing/implement/add new module/scheme/query very quickly.
If you have so huge amount of data as I’ve got for now, you can trap into performance pitfall even in so safe places like IO. Neither RDMS vs. NoSQL nor O(..), but regular ETL data copy process from one storage to another.
E.g., it needs to load ~1M records from the storage then process them and save output to CSV file. Simple straightforward implementation of such algorythm may be ok for small datasets. But in this case it tooks 20 minutes to load, process and save all data. The bottle neck was saving data into the .csv file row by row. It was without buffering. After buffering adding it took just several seconds.
So, 1-st: you better try your solution on huge test datasets before going in production with real big data.
2-nd: use buffered IO for processing the big data.