How to build a Scalable Crawler on the cloud, that can mine thousands of data points, costing less than a Tim Hortons box of Timbits a month.

Motivation and some context

As someone who just moved from Rio de Janeiro (Brazil) all the way to Vancouver (Canada), the first thing that hits you right in the face (aside from the beautiful scenery and the Tim Hortons) are the rental prices. As some of you may know, Vancouver is currently constantly ranked among the top 5 most expensive cities to live in the world. It is known that the rental price of a property is a function of how expensive it is to actually own and mortgage that same property (aka Price-To-Rent-Ratio).

With that in mind I decided to start a little side-project that could mine a decent number of housing listings and crunch this data so that I could come up with my own conclusions about the current real state market in Vancouver. Truth to be told, there’s a bunch of well formatted data living on these listing websites on the web, so why not just go ahead and grab it ? This is how this project was born.

This post will walk you through the architecture, costs, pros and cons and more about the first crawler I’ve built using no servers at all, living 100% on the cloud, using only AWS services. The code is open on Github if you want to check it out.

Wait, did you say ‘No Server’?

Sure enough, everything you run on the cloud, is backed up by servers at the end of the day. What I meant by Serverless is that you won’t have to actually maintain any server or virtual machine of any kind yourself. The trick behind is is to build your architecture around cloud-native services such as AWS Lambda, DynamoDB, RDS MySQL and Cloudwatch, and make them work together in a clever way.

Shall we start?

Project Architecture

ServerlessCrawler_Diagram

In case you’re not familiar with what these service are, I will summarize them for you:

  • AWS Lambda: Short lived functions that run on the cloud. Whenever these are invoked (or triggered) they will spin up, run the code you wrote into it, and shutdown as soon as it’s done running. You’ll only pay for the seconds each function is actually doing something, meaning that there’s no idle state where the function is sitting there doing nothing, and you end up paying for this (unlike EC2).
  • DynamoDB: Fully managed NoSQL database on the cloud. You can feed it with JSON records and they will be stored for you (on a server you won’t have to maintain). You can easily scale your read/write throughput on the fly in seconds, and as of early 2017, they started supporting a Time To Live (TTL) mechanism, allowing your objects to be deleted automatically after it’s TTL is reached.
  • RDS MySQL: Fully managed RDS MySQL database on the cloud. Scale up/down, take backups as you wish. We just recently announced a new Start and Stop feature, that allows you to keep your instance stopped for up to 7 days in a row (while you’re only paying for the instance volume, instead of paying for it’s instance hours as well)
  • CloudWatch: Logs on the cloud. You basically get this for free since every “.log” message ran from Python on Lambda logs it straight to a CloudWatch stream.

Project Goals

Starting off this project I’ve had a few goals in mind, and I started improvising as I was going along. The ideal project to me would have to:

Timbits Box

How can you not love this?

  • Be fully managed by AWS on the Cloud and require no server
  • Scale up and down  according to load (Elastic)
  • Capable of processing tens of thousands of listings to start with
  • Cost less then a box of Timbits (which currently costs 4.80 CAD for 40 pieces)

Costs Breakdown

You can safely rely on Lambda and CloudWatch free tiers for this project, making them totally free (unless you’re running this constantly, non-stop, then the bill will come).

For the storage layers (DynamoDB and RDS MySQL) you will be paying under 3 bucks a month since you can stop your RDS database for up to 7 days in a row, and scale your DynamoDB tables down to 1 read + 1 write units when you’re not using them. This brings your total costs to an estimate of 2.40 USD a month. (Check my documentation for a more detailed breakdown).

The Journey

From start to finish the whole project took me about 19 hours of work. Sure enough your millage may vary according to your previous knowledge of AWS and Python (I am fairly familiar with both, just not Dynamo and Lambda services specifically).

The setup of Lambda functions takes time to get used to, and it’s definitely sub-par with other AWS services when it comes to usability and metrics. Once you get used to the whole Lambda developing dance: Edit Python files locally -> create a .zip package -> upload that to replace your Lambda Function -> Save and Test , it gets better. The integration with CloudWatch is definitely a plus (and it comes for free, really), and really comes in handy when you’re trying to understand why did your lambda failed after that HTTP Request, or during that other loop you forgot to indent properly. Making use of Environment Variables, adjusting function resources/timeouts and enabling and disabling triggers for testing works smoothly and blends in really well and doesn’t require you to redeploy your functions. Also, I’ve noticed that the spin-up of the Lambda functions works really fast, having almost unnoticeable delay (I assume they’re using some sort of smart-cached ECS under the hood, but I wouldn’t know)

Setting up DynamoDB tables couldn’t be easier, really. We’re talking about a one prompt setup, where you only have to fill 2 boxes (your table name and the partition key for your table) and that’s it. Configuring TTL for each table works just fine, but you can’t do and undo it often (it will prevent you from toggling it on and off, probably due to exploits of this feature, since it deletes your records without charging you for these operations). Inserting dynamoDB records manually on each table (for testing purposes) works perfectly, and each insert (or batch) triggers the lambda functions instantly, with little to no delay. Tweaking each table’s capacity up and down (read and write units) was a breeze and allow you to adjust them on the fly, with only a few seconds of delay to apply the new configuration.

Getting to configure RDS MySQL was definitely easier than Lambda, but has more steps then DynamoDB. You get more options too as you can pick instance type, volume sizes and types, redundancy, maintenance windows, backup retention periods etc. Once you set it up, you’ll have your shiny MySQL instance in about 10 minutes give or take, ready to rock.

After the setup and test phase ends, I’ve had a contemplation moment as the listings were making their way to MySQL while I could sit back, relax and have a beer while the capture was happening. Or maybe 3 beers. Maybe take a nap? Shit, this thing is slow.

Rough Edges

Performance was never my goal here (tinkering with the technologies available and building something cool was), but DAMN, I didn’t expect it to be this slow. In the end, It ended up being able to capture around 11.000 listings every 6 hours (which translates to about one listing every ~2 seconds). I’ve wrote distributed crawlers with rates easily thirtyfold faster then this (maybe not as exciting, though).

Sure, each HTTP Request for a page takes between 0.7 and 1.1 seconds to return on average. Factor in the time it takes to spin up each lambda container, plus connecting to MySQL across the wire and inserting each record, you have your 2 seconds right there. Each lambda receives a batch (or stream) of 5 Dynamodb records, so the average life-span of each lambda function was of about 7 seconds (for the parsing lambdas).

A few optimizations that could be done here would be to perform the HTTP requests for each batch in parallel and performing batch inserts in MySQL.

Speaking of parallelism, the coldest bucket of water for me was the fact that Lambda will not scale horizontally that well. In my head, every stream inserted into Dynamo would immediately trigger one lambda function to process it, meaning that Lambda would always be catching up with the pace of inserts on Dynamo, and I would have tens of Lambda functions running at any given time, all in parallel, beautifully. I was wrong.

What actually happens is that Lambda has a limit of concurrent executions that’s tied to (among other resources) how many shards your DynamoDB table has. Since my table had only one shard, there was only one Lambda Function Running at all times (terrible, I know). What ended up happening is that, even though the inserts on one of the dynamodb table were done in a couple minutes, the second layer of lambda was being triggered slowly, one after the other, as is there was some sort of internal queue storing my dynamo streams, and feeding them to lambda, one by one, serializing my execution (instead of parallelizing it).

Another point is worth mentioning is that every change on a DynamoDB table’s content, will trigger your lambda functions set to trigger. The catch here is that these changes may not be only Inserts (but also updates, and some deletes triggered when the TTL collector kicks in and starts wiping your set-to-expire records). Luckily enough, each DynamoDB stream contains, for each record in the stream, an attribute that you can use to tell whether that object was inserted, updated or deleted. In my case, I was receiving everything (because there’s no way to set Lambda otherwise), but only processing the inserts.

Pros and Cons

Pros Cons
Cheap Slow
Fully Managed / Serverless Once it starts, you can’t pause it and restart from where it left
Bleeding-Edge Technology (?) Only possible to tweak so far (code-wise changes)
Flexible Infrastructure Testing specific parts requires you to constantly disable/enable lambda triggers
If you find a bug, you can change your lambdas on the fly to fix every following batch

Final Verdict

Despite the initial appeal, on it’s current state, I wouldn’t recommend this architecture for something that requires performance and the flexibility to change architecture easily and tweak more than just the code that’s running. On the other hand, this setup is cheap, and for something small, it works just fine. It may not be the easiest to setup, but once you’re past that part, the maintenance is roughly zero.

I’ve had fun, for sure, writing this and gluing all these pieces together to build this small Frankenstein, and I would do it again. I still checked the boxes of all my initial goals for this project, but yes, performance could be (way) better.

In the end, I’ve managed to download data of over 40k listings by running this process a few times. With that in hand, I plan on writing the code to crunch this data, but as of for now, this is still a WIP.

I can only thank you if you made it this far. I’ve put together a guide on how to set it up on your own AWS Account, and since the code is open-source anyway, go hack it !

Feel free to reach out to me through any contact at my personal page, in case you have any questions or simply want to chat.

See you on the next one 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s