Adventures in biggish data

This is going to be an evolving blog post retracing my current attempt at dealing with a dataset of 65 gigabytes. It will often look silly – that’s because I am not a programmer by training, and I make an effort at honestly recording the steps I took – including all mistakes and “doooohh!” moments.

See the bottom of the post for explanations on some questions. Add yours in the comments if you wish, I’ll do my best to respond.

I do this for the goal of exploring this dataset visually (an interesting methodological question I find) – and maybe foremost, to learn how to work with big datasets in practice. That’s harder than I thought.

The dataset:

This is the IRI dataset, which is documented in the journal Marketing Science (link to the pdf, link to the jounal website).

It is delivered by post on a external hard drive containing a hierarchy of folders containing csv files (in various formatting) and excel files containing weekly data on product purchases in drugstores and shopping malls – collected across 10 years in participating stores in the US. The size of each file ranges from a couple of megabytes (Mb) to ~ 800 Mb. In total they make ~ 160 Gb, of which only 65 Gb I’ll end up using.

COUNTER OF COSTS SO FAR:

80 euros (server rental costs)
70 euros (one Terabyte external hard drive).

TOTAL ———–> 150 euros.

ACHIEVED SO FAR:

The files have been imported into a database.

Early november 2013

– delivery of the dataset (160Gb) on a 500Gb hard drive.
– reading of the 75 pages pdf coming with the dataset. The datasets contains several different aspects, I realize I’ll start using a portion of it, making 65Gb.
– copy of the dataset on the hard drive of my laptop (450Gb, spinning disk). Note: the laptop has a 2nd hard drive where the OS runs (SSD, 120Gb, almost full).
– I write Java code to parse the files and import them into a Mongo database stored on my 450Gb hard drive, using the wonderfully helpful Morphia (makes the syntax so easy).
– First attempts at importing: I realize that the database will be much bigger in size than the original flat files. Why? I investigate on StackOverflow and get to reduce the size of the future db significantly.
– Still, I don’t know the final size of the db, so there is the risk that my hard drive will get full. I buy a 1 Terabyte / USB 3.0 external hard drive (Seagate, 70 euros at my local store).

Mid November 2013

– First attempts to import the Excel / csv files into MongoDB on this external hard drive. The laptop grinds to a halt after 2 hours of so: memory issues. What, on my 16Gb RAM laptop? The cause: by design, MongoDB will use all the memory available on the system if it needs it. It’s supposed to leave enough RAM for other processes but apparently it does not. I feel stuck. Oh wait, running MongoDB on a virtual machine would allow for allocating a specific amount of RAM to it? I tried Oracle’s Virtual Box but long story short, I can’t run a 64b virtual machine on my 64b laptop because a parameter in my BIOS should be switched on to allow for it, but my BIOS does not feature this parameter (and I won’t flash a BIOS, that’s beyond what I feel able to).

– At this point I realize that the external hard drive I bought won’t serve me here. I need a distant server for the database where Mongo willl sit alone. Or were there other options to keep the data locally?

End November 2013

– I try to rent a server from OVH (13 euros for a month + 13 euros setup costs: 1 Terabyte server with a small processor from Kimsufi, their low cost offer). I don’t get access to it in the following 3 days, and give up. Got a refund later.

– I rent a server (at ~ 40 euros per month, no setup cost) with 2 Terabyte hard drives, 24Gb of RM (!!) and a high performing processor (i9720) from Hetzner’s auction site. Sounds dodgy and too good to be true, yet I get access to it within 3 hours, install Debian and Mongo on it (easier than I thought, given that I am a Linux noob).

– Re-run my Java code on my laptop for importing the Excel/csv files onto this distant server. New bottleneck: it takes ages for the data to transfer from my wifi to the server. Of course…

– I rent a second server (at ~ 40 euros per month, still at Hetzner), in the same geographical region as the first, where I’ll put the data and run my Java code from.
– Start uploading the data to it: takes ages (more than two weeks at this pace).

Early December 2013

– Went to my university to benefit from their transfer speed. After some hicups I got the 65Gb to transfer from my laptop to one of the remote servers I rented in just a couple of hours.
– Starting the import of these 65Gb of csv / Excel files from this server to the Mongodb server. Monitoring the thing since the last 30 minutes, I see that already ~~60,000,000~~ 917,000,000 (close to 1 billion!!) weekly purchase data transactions have transferred to the db – and counting! (one transaction looks like “this week, 45 packs of Guiness were bought at the store XXX located at Austin, Texas for a total of 200$”). Big data here I come! For some reason the stores descriptions didn’t get stored yet though. I’ll see that later. Very excited about the 1 billion transaction thing. Also worried on how to query this. We’ll see.
– For some reason the database crashed after 1,1 billion transactions imported. Trying to relaunch the import where it stopped, I accidentally drop (delete) the database. Oooops.
– Before relaunching the import, I optimize a bit the code, clear a bug, and go!
– 14 hours that this new import has started. 2,949 stores found and stored, 138,985 products found and stored. And 1,3 billion transactions found and stored, and counting. Wow. No crash, looks good.
– 2 days after it started, the import has finished without a crash! 2.29 billion “weekly purchase data” entries were found and stored in the db. The csv / Excel files take 65Gb of disk space, but once imported in the db the same data takes 400 Gigabytes of space. Wow. Next step: building indexes and start a first query.

QUESTIONS:

– Why not using university infrastructures?

I am transitioning between two universities (from Erasmus University Rotterdam to EMLyon Business School) at the moment, that’s not the right moment to ask for the set-up of a server, which could take weeks anyway. When arriving at EMLyon I’ll reconsider my options. The other reason is that I want to learn how “big data” works in practice. My big dataset is still smallish, and I already run into so many issues. So I am happy to go through it, as it will give me a better comprehension of what’s involved in dealing with the next scale: terabytes. I feel that this first hand knowledge will give me to teach the students in a better way, and that I will make more informed choices when dealing with experts (IT admins from the university or the CNRS) when comes the moment to launch larger scale projects in big data.

– Why MongoDB?

I was just seduced by the easyness of their query syntax. That’s horrifying as a decision parameter, I know. Still, I stand by it. I feel that it is indeed a determining factor because if the underlying performance is good enough (I’ll see that), then as a coder I can choose the db system which is the less painful / nicest to use (though I don’t use it myself, the MongoDB javascript console is I think a main driver behind the adoption of Mongo as a default for the Node.js community, I think). And with the Morphia library added to it, Mongo for Java is just a breeze to use: create POJOs, save POJOs, query POJOs. That’s it:

Datastore ds = ...; // like new Morphia(new Mongo()).createDatastore("hr")
morphia.map(Employee.class);

ds.save(new Employee("Mister", "GOD", null, 0));

// get an employee without a manager
Employee boss = ds.find(Employee.class).field("manager").equal(null).get();

No table, rubbish query syntax or whatever.

Of course, I’ll see with this current experiment if Mongo fits the job or not in terms of performance. If it doesn’t, I’ll explore Neo4J or SQL (in this order).

– Why not Amazon services?

Yes, yes. I am constrained by my attachement to MongoDB here. I would have run MongoDB on Amazon and all would have been fine, maybe. But the instructions on how to run Mongo on Amazon EC2 got me scared.

Hi there,

Cool post here:)

I am currently dealing with large data like yours recently (50+ Go, a year of tweets). My first choice was as well to use Mongo to store everything (such a nice and straightforward tool !). After some time, I figured out that data+indexes on mongo was taking way too much space on my disk without any siginificant advantages.

So I deleted the db and got back at prototyping the complete workflow on smaller samples. It turns out that it was really a better approach, avoiding the flaws of large scale like spending hours on optimization and waiting for millions of rows to process before being able to debug.

What I ended up with was :
* Manipulate zipped csv files with Python (Pandas) to explore the data, write feature extraction routines and processing tasks http://pandas.pydata.org/
* Store in mongo only part of the data where relevant features were present (straight from csv to mongo, chunks by chunks)
* Use mongo aggregation frameworks (mostly map-reduce) to process datasets and get them ready for visualization http://docs.mongodb.org/manual/aggregation/

With this optimized approach, I can now get my features extracted in 10 min for a sample of 20 millions tweets (first approach with mongo was like 20 hours to do the same thing with even poorer results).

For the hardware, I bought a 256G SSD HDD and install a Mint/Debian on it. Worth every penny so far : it has saved me hundreds of hours of computing (it IS blazing fast). I did buy a VPS as well. I wanted to use it as a mongo server with a NodeJS API. Mongo was so slow over the network that I gave up entirely. I still asking myself if I want to run tasks on EC2 to get more computing power because I am not sure that it is worth the time it takes to set up the EC2 instance with all required softwares. Honestly, 60 Go is not much and scientific computing doesn’t require production-quality code or distributed computing. So I may as well avoid all the scripting and install process for multiple machines…

Finally a small note about indexes on mongo : building indexes on a large dataset after storage can be really costly (i.e. hours…) and even fails. On the opposite, if you index on-the-fly while you’re storing the data you won’t notice any difference. This is easily done with mongo with the function : ensure_index()

Hope it helps !
Cheers

5 thoughts on “Adventures in biggish data”

Ray G. Butler says:

November 29, 2013 at 10:52 am

Hi, Clement
We’re currently developing a data discovery tool which allows exploring and discovering hidden correlations within the data files of scientific experiments.

The tool is called AutoDiscovery and is explained in detail (including a real success case) in our website (www.butlerscientifics.com).

I’ll be glad to collaborate in this experience so feel free to contact me to go further with this.

Matt Weller (@mattyweller) says:

December 11, 2013 at 7:26 pm

I’m also working on this data set and interested in collaborating where mutually beneficial. My chosen tool is R, running on our HPC cluster at Lancaster University. Feel free to email me, I’ve responded to your posting on the Google Group regarding the number of records in the data set which I estimate to be 1.2bn, based on the import of 3 categories to date.

Clément Renaud says:

January 10, 2014 at 6:38 pm

Hi there,

Cool post here:)

I am currently dealing with large data like yours recently (50+ Go, a year of tweets). My first choice was as well to use Mongo to store everything (such a nice and straightforward tool !). After some time, I figured out that data+indexes on mongo was taking way too much space on my disk without any siginificant advantages.

So I deleted the db and got back at prototyping the complete workflow on smaller samples. It turns out that it was really a better approach, avoiding the flaws of large scale like spending hours on optimization and waiting for millions of rows to process before being able to debug.

What I ended up with was :
* Manipulate zipped csv files with Python (Pandas) to explore the data, write feature extraction routines and processing tasks http://pandas.pydata.org/
* Store in mongo only part of the data where relevant features were present (straight from csv to mongo, chunks by chunks)
* Use mongo aggregation frameworks (mostly map-reduce) to process datasets and get them ready for visualization http://docs.mongodb.org/manual/aggregation/

With this optimized approach, I can now get my features extracted in 10 min for a sample of 20 millions tweets (first approach with mongo was like 20 hours to do the same thing with even poorer results).

For the hardware, I bought a 256G SSD HDD and install a Mint/Debian on it. Worth every penny so far : it has saved me hundreds of hours of computing (it IS blazing fast). I did buy a VPS as well. I wanted to use it as a mongo server with a NodeJS API. Mongo was so slow over the network that I gave up entirely. I still asking myself if I want to run tasks on EC2 to get more computing power because I am not sure that it is worth the time it takes to set up the EC2 instance with all required softwares. Honestly, 60 Go is not much and scientific computing doesn’t require production-quality code or distributed computing. So I may as well avoid all the scripting and install process for multiple machines…

Finally a small note about indexes on mongo : building indexes on a large dataset after storage can be really costly (i.e. hours…) and even fails. On the opposite, if you index on-the-fly while you’re storing the data you won’t notice any difference. This is easily done with mongo with the function : ensure_index()

Hope it helps !
Cheers

Clement says:

January 12, 2014 at 8:09 pm

All good tips! As we discussed, I will first try the ElasticSearch solution! Watch here for progress!

Pingback: Un blog est un réseau

	Camaal Moten on Force Atlas 3D: New plugin to…
	Juan on New Gephi plugin: add backgrou…
	Enlaces interesantes… on Gephi – curated list of…
	Visualizar dinàmicam… on Gephi – curated list of…
	sam on Force Atlas 3D: New plugin to…

Insights @exploreyourdata

How to make the most of your data

Insights @exploreyourdata

Adventures in biggish data