How I scraped and stored over 3 million tweets

In the last entry I talked about how mapping Twitter sentiment can be a complete bitch. In this entry I’m taking a different approach by looking at the methods that I’ve used to scrape and store over 3 million tweets, while keeping performance and stability to a maximum.

This is part of a series of entries that will be taking a look behind the scenes of the Twitter sentiment analysis project.

Introduction

In the last entry I talked about how mapping Twitter sentiment can be a complete bitch. In this entry I'm taking a different approach by looking at the methods that I've used to scrape and store over 3 million tweets, while keeping performance and stability to a maximum.

Choosing a storage system

Before I even thought about how I was going to scrape the tweets, I needed to decide on a method for storing them. I was certain that I wanted to store them all in a database, as it would be the easiest method of getting the tweets back out again, but I had no idea which one to choose. All I knew was that I had two options.

Relational (MySQL)

The first option was to go with a standard MySQL database. These relational databases are what I'm used to, and are the type that I've developed with for years. However, the issue in my mind was that I wanted to perform some particularly rapid development on this project, so I didn't want to spend too much time setting up the database schemas, especially if it was likely to change in the future.

I also really wanted to play with some of the new database options that were doing the rounds, so I decided to take a look at them to see how they compared with MySQL.

Document-oriented (MongoDB or CouchDB)

MongoDB and CouchDB are probably two of the most well known document-oriented database systems around. They're new, they're cool, and they sound exciting to work with. The problem was that I'd never worked with either of them up until this point. In fact, I'd not even worked with any document-oriented database systems at all!

I was desperate to give this kind of database system go, and this project seemed like the perfect opportunity to do that. After all, both CouchDB and MongoDB use JSON to store their data (or at least a variation of), which is exactly the same format that the Twitter API data is returned as. It was just meant to be.

In the end I went with MongoDB for a few simple reasons; it looked easier to use (CouchDB forces you to uses a REST API), it has much better documentation, and it has an immensely prettier website. What can I say? A well designed website gives me the impression that the product has been well designed as well. For the record, I'm sure CouchDB is built to a high standard as well (the BBC use it, after all).

So I had the database system. Now all I needed was a way of accessing the Twitter Streaming API and pushing that data into MongoDB.

Starting out with Node

I've been using Node extensively with the Rawkets multiplayer game, so it seemed logical to use it for scraping the Twitter Streaming API and sending the tweets to MongoDB. What's even better is that Node has modules for both MongoDB (Mongoose) and the Twitter Streaming API (node-twitter), which makes things a lot easier for me.

What I started out with was a basic script that scraped the Twitter Streaming API, took the JSON response for each tweet as it came in, and added that JSON directly into to the MongoDB database. The great thing was that this worked like a charm. The bad thing was that is only ran while I had my computer on and had an Internet connection, so it wasn't 24/7 because my WiFi turns out to be really crap. The worst thing was that the script was prone to crashing (not Node), and I just didn't have the time to work out why, or how to reboot it automatically when it crashed.

I love Node, but I needed a more robust option that I could rely on without having to worry about stability of my script or the reliability of my home Internet connection.

Moving to Ruby

Ruby isn't a language that I use often, but it's definitely one that I seem to turn to when I'm looking for something reliable and powerful. Perhaps I'm crazy? Still, Ruby also had gems for both MongoDB (mongo) the Twitter Streaming API (tweetstream), so it was all systems go.

Turns out it took surprisingly little code in Ruby to achieve the same thing in Node. Plus it was more stable!

require "tweetstream"
require "mongo"
require "time"

db = Mongo::Connection.new("MY_DB_URL", 27017).db("MY_DB_NAME")
tweets = db.collection("DB_COLLECTION_NAME")

TweetStream::Daemon.new("TWITTER_USER", "TWITTER_PASS", "scrapedaemon").on_error do |message|
  # Log your error message somewhere
end.filter({"locations" => "-12.72216796875, 49.76707407366789, 1.977539, 61.068917"}) do |status|
  # Do things when nothing's wrong
  data = {"created_at" => Time.parse(status.created_at), "text" => status.text, "geo" => status.geo, "coordinates" => status.coordinates, "id" => status.id, "id_str" => status.id_str}
  tweets.insert({"data" => data});
end

That isn't absolutely everything, but you can see how little code it takes to scrape the Twitter Streaming API and send each tweet to MongoDB. Cool, ey?

Another cool thing about the Ruby implementation is that the tweetstream gem comes with the option to run the script as a daemon. This means that you can start up the script and have it run in the background on the computer, which turned out to be extremely useful for what I did next.

Optimising performance, uptime, and reliability

When I was using the Node script, I literally passed on the entire response from the Twitter Streaming API into MongoDB. This turned out to be a really bad idea, as each tweet was consuming about 2KB of disk space. That doesn't sound like much, but imagine if you had 60,000 tweets (what I was scraping a day) – that would be 180MB. Now imagine 3,000,000 tweets (what I've scraped in total so far) – that would be 6000MB (6GB)!

As you can see in the Ruby code that I included above, I tried reducing the disk spaced required by only sending a small amount of data about each tweet to MongoDB – just the data that was necessary, like the tweet text, its ID, and the date that it was sent. The difference was instant and astounding, with each tweet roughly consuming about 0.3KB – they were around six times smaller! A simple change, but one which had a dramatic affect. 60,000 tweets is now 18MB, and 3,000,000 tweets is now just 900MB (0.9GB).

Now that the tweets were compacted, it was time to move the scraper and database to a remote server to increase the uptime and reliability. To do this I moved the MongoDB database to MongoLab, who have fantastic customer service, and the Ruby script to my own private Web server. The result of this is that I now have effectively 100% uptime for the Internet connection, and the script has proven to be incredibly reliable (running 24/7 for weeks at a time).

Moving the data to MongoLab has also proven to be a fantastic decision because the databases are automatically backed up on to my own Amazon S3 account every hour, allowing me to quickly and easily download copies to my local computer without causing any problems on the live database.

Still, I need to come up with a system that handles failure better. Right now the Ruby script won't be restarted automatically if it fails, or if the server is rebooted. This results in gaps where tweets are not collected during the period between the scraper stopping and me restarting it manually. It's a situation that I'm ok with for now, but in an ideal world I would take the time out to fix it, probably with a monit script.

Summary

So as you can see, my journey with scraping the Twitter Streaming API and storing the resulting tweets in a simple database has proven to be an eventful one. I've learnt an incredible amount during this process; from simple programming tricks, to full blown solutions for consistently gathering huge amounts of data.

Hopefully you've been able to take something from this entry. I'd love to hear about similar issues that you've had, and how you overcame them.