14 Dec 2010

Why should you use a non-relational database?

I wrote this a while ago, I'm not sure if I still totally agree with everything I said.

Sometimes I hear people asking about what non-relational databases are and examples of what they can be used for. A common answer is that relational databases are not flexible enough / difficult to work with in some cases. I believe that non-relational databases are becoming increasingly popular not (only) because they offer more flexible ways to store data, but mostly because they can provide a performance boost.

Let’s start with what’s wrong with relational databases from a scaling point of view. It is difficult to split them accross multiple servers. Most of the time the ‘relational’ part only works on rows within the same database, hence the same server. So when a single server isn’t enough, typically ‘sharding’ is used. Sharding means that an algorithm is used to decide in which database a row (in the case of a relational database) should be stored depending on its ID for example. There are multiples algorithms to do this. A basic, non-practical one, would be to store a row in database A if its ID is even, otherwise in database B. This is a simplistic algorithm that is only good for explaining the concept of sharding. There are better ways to do this such as consistent hashing, and in many cases you can exploit the nature of your data to shard it efficiently. So far so good, but as you can imagine, if rows are stored in multiple databases, relational operations become more complicated/inefficient. Turns out in many cases you can simply do without them. Sometimes it involves a bit of redundancy in your data, but still you can live without them.

I believe relational databases are great and very useful in concept, but if you’re not going to use the ‘relational’ part for scaling purposes, then why would you use one? Well one reason would be that the relational databases have been around for a long time and are known to be reliable. But I think the recent hype for non-relational databases happened because people realized they had to do without relations in order for their applications to scale well, as opposed to the new features they provide. In fact this is old news. Google stores everything in a non relational way accross multiple servers. But since people started using Rails and other frameworks that make it so easy to use relational databases, it seems collective consciousness sort of forgot about non-relational databases for a while. But now the topic of scaling and performance seems to be back in place.

When you leave the world of relational databases, you get different options. There is more than one approach to non-relational databases. Some such as MongoDB or CouchDB are so called ‘document oriented’, which really means storing data in some sort of JSON-like tree form. Some people really like this, but I haven’t come across a case where I found it very useful yet. The other approach, the one I prefer, is the key-value storage. You specify a key, a value of some kind, and that’s almost about it. My personal favorite is Redis, which has some useful value types and operations. Since the key-value storage is so elementary, it is really easy to shard data, hence it scales very well across multiple servers (though Redis has a few ‘relational’ operations such as unions and inters which can’t be sharded). It is fast, simple, and your code deals with structuring data.

14 Dec 2010

Why should you use a non-relational database?

See also...