Map rendering on EC2

Over the last two years I've been running the OpenCycleMap tileserver on Amazon's EC2 service. Plenty of other people do the same, and I get asked about it a lot when I'm doing consulting for other companies. I thought it would be good to take some time to say a bit about my experiences, and maybe this will be useful to you at some point.

OpenCycleMap tileEC2 is great if you have a need for lots and lots of computing power, and your need for using CPUs fluctuates. At its best, you have a task that needs hundreds of CPUs, but only for a few hours. So you can spin up as many instances as you like, do your task, and switch them back off again. Map rendering, and here I'm talking about mapnik/mod_tile rendering of OpenStreetMap data, initially seems to hit that use-case - generating map tiles involves lots of processing of the map data, and then you have your finished map images which are trivial to serve.

But that's not really the case, it turns out. After you've finished experimenting with small areas and start moving to a global map, you find that disk IO is by far the most important thing. There are two stages to the data processing - import and rendering. During import you take a 10Gb openstreetmap planet file and feed it into PostGIS with osm2pgsql. You want to use osm2pgsql --slim (to allow diff updates), but that involves huge amounts of writing and reading from disk for the intermediate tables. It can take literally weeks to import. When you're rendering, renderd lifts the data from the database, renders it, writing the tiles back to disk, and then mod_tile reads the disk store to send the tiles to the client. All in all, lots of disk activity. And hugely more if you mention contours or hillshading.

Which wouldn't be too bad, except the disks on EC2 suck. It's not a criticism, since it's an Elastic Compute Cloud, not an Elastic Awesome-Disks Cloud. It's a system designed for doing calculations, not handling reading and writing huge datasets to and from disk. So their virtual disks are much slower than you would like or expect from the rest of the specs. On the opencyclemap "large" EC2 instance, roughly one core is being used for processing, and the rest is all blocked on IO. Although it's marked as having "high" IO performance on their instance types page, I'd suggest for "moderate" and "high" you should read "dreadful" and "merely poor" respectively.

Amazon's S3 is their storage component of their Web Services suite. So instead of thrashing the disks on EC2, how about storing tiles on S3? It's possible, but the main drawback is that it makes it much, much harder to generate tiles on-the-fly. If you point your web app at an S3 bucket there's no way that I know of to pass 404s onto an EC2 instance to fulfil. If you're happy with added latency, then you could still run a server that queries S3 before deciding to render, and copy the output to S3, but I can't imagine that being faster than using EC2's local storage. You can certainly use S3 to store limited tilesets, such as limited geographical areas or a limited number of zooms. But pre-generating a full planet's worth of z18 tiles would take up terabytes of space, and only a vanishingly small number of tiles would ever be served.

Finally, there is the cost of running a tileserver. Although Amazon are quite cheap if you want a hundred servers for a few hours, the costs start mounting if you have only one server running 24 hours a day - which is what you need from a tileserver or any other kind of webserver. $0.34 per hour seems reasonable until you price for the first four weeks uptime, where all kinds of non-cloud providers come into play, simply paying monthly rent on a server instead. Factoring in bandwidth costs for a moderately well-used tileserver can make it mightily expensive. Any extras can be added too - EBS if you want your database to survive the instance being pulled, or S3 storage.

EC2 is, more or less, exactly not what you want from a tileserver. Expensive to run, slow disks. So why is it popular? First off is buzzwords - cloud, scalable and so on. If you aren't careful you can easily empty the piggybank on running a handful of tileservers long before you're running enough to do proper demand-based scaling changing from hour to hour during the day. If you're trying to "enterprise" your system you'll worry about failovers long before you need such elastic scaling, and you need your failovers and load balancers running 24x7 too. Second is for capacity planning - if you want to do no planning whatsoever, then EC2 is great! But it's much cheaper to rent a few servers for the first couple of months, and add more to your pool when (if?) your tileserver gets popular. But a there is a third reason that is quite cool - for people like Development Seed's TileMill - you can give your tileserver image to someone else extremely easily, and it's their credit card that gets billed, and they can turn on and off as many servers as they like without hassling you.

CambridgeI've been setting up a new tileserver for OpenCycleMap that's not on EC2, and I'll post here again later with details of how I got on. I'm also working on another couple of map styles - with terrain data, of course, and if you're interested in hearing more then get in touch.

So in summary

Any thoughts? Running a tileserver on EC2 and disagree? Let me know below.

This post was posted on 5 July 2010 and tagged OpenStreetMap