Big Data Сhronicles: Buffered IO

If you have so huge amount of data as I’ve got for now, you can trap into performance pitfall even in so safe places like IO. Neither RDMS vs. NoSQL nor O(..), but regular ETL data copy process from one storage to another.

E.g., it needs to load ~1M records from the storage then process them and save output to CSV file. Simple straightforward implementation of such algorythm may be ok for small datasets. But in this case it tooks 20 minutes to load, process and save all data. The bottle neck was saving data into the .csv file row by row. It was without buffering. After buffering adding it took just several seconds.

So, 1-st: you better try your solution on huge test datasets before going in production with real big data.

2-nd: use buffered IO for processing the big data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: