Big Data Сhronicles: Buffered IO

If you have so huge amount of data as I’ve got for now, you can trap into performance pitfall even in so safe places like IO. Neither RDMS vs. NoSQL nor O(..), but regular ETL data copy process from one storage to another.

E.g., it needs to load ~1M records from the storage then process them and save output to CSV file. Simple straightforward implementation of such algorythm may be ok for small datasets. But in this case it tooks 20 minutes to load, process and save all data. The bottle neck was saving data into the .csv file row by row. It was without buffering. After buffering adding it took just several seconds.

So, 1-st: you better try your solution on huge test datasets before going in production with real big data.

2-nd: use buffered IO for processing the big data.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: