How to choose a database in 2018
I met a friend for lunch the other day and we were chatting about the jungle of databases and data storage solution out there- and what a jungle it is: At the time of writing DB Engines Ranking lists more than 340 actively developed and up to date databases to choose from - all of which essentially present slightly different solutions to the same problem: storing and retrieving text-based information. To help you navigate this jungle, this article tries to offer a simple framework for deciding which database is right for your project.
What do we mean by database?
For this article, we talk about Database-Management-Systems (the correct term) that store text-based information on a Harddrive/SSD for immediate (a couple of seconds) access.
That means we are NOT talking about:
- Pure in-memory caches such as Memcached that loose data as soon as the process stops
- File or other means of storage, such as AWS S3 or Azure Managed Disks
- Long-term archives and data-warehousing solutions such as AWS Glacier
So how do you choose the right database for your project?
Proprietary or Open Source?
Let's start with a mundane business question: Are you prepared to pay for your DB? This doesn't involve payment for cloud-based offerings (we'll talk about those in the next paragraph), but licenses for self-hosted on-premise DBs.
But wait - aren't there fantastic open-source databases out there that you can just use for free? Well - yes, but many companies prefer the peace of mind that comes with a 99.999% uptime SLA and a support number one can call at five in the morning. In fact, four of the top ten most popular databases come with a commercial license (Oracle, Microsoft SQL Server, IBM DB2 and Microsoft Access), including the number one spot (Oracle).
Cloud or Self Hosted?
Running a database is dead-easy: you just click an executable or start a service and your database server is humming along nicely. But actually RUNNING a database in production is really hard. Sharding, creating read and write replicas, continuous backups, updates without downtime - phew - I guess DB-Admin is a job profile for a reason.
If this is not for you, you might be best of using a hosted database as part of a cloud-service instead. Here again, we need to distinguish between two categories - let's call them "managed" and "serverless":
A managed DB is a cloud service that offers you an often popular open source DB as a fully managed service. AWS RDS is an example, running MySQL/MariaDB/Postgres etc. instances. The problem here is that you still need to make decisions such as "how big a database/how many instances do I need?", "how often do I want to take backups and where do I store them?" and so on.
If you don't want to do that there's now an ever-growing offering of "serverless" or "as-a-service" databases. These are proprietary creations of cloud providers that let you simply connect to an endpoint and store your data. AWS DynamoDB, Azure CosmosDB, and Google Firebase are examples of this.
Now for what's probably the most important question: Which model fits your data-structure best? Your options are:
Relational / Table
Relational Databases store rows of data in tables with multiple columns, each with their own datatype (text, number, date etc). Each row is identified by a unique id, called a "primary key". Relations between data are specified by referencing this key from other rows.
Data is usually created and read using an expressive "Structured Query Language" (SQL) that comes with a rich vocabulary of features.
Why choose a relational DB?
Relational databases are the most common and widely used kind - and for good reasons. The model, albeit inflexible, is solid, avoids duplication of data and enables developers to work with save and predictable results. In addition, this category holds many of the most mature databases available, such as PostgreSQL or MySQL - and maturity and the resulting stability and security is probably the most important factor when choosing any database.
Why not to choose a relational DB?
The pre-defined table structure of a relational database is usually fairly static. While it is possible to modify columns of existing tables this does come with some overhead. Likewise, each column can only store a certain type of data. This helps the Database Management System reduce storage space and avoid duplication, but proponents of the NoSQL movement (see next paragraph) argue that storage space has become so cheap and abundant that this tradeoff for flexibility doesn't make much sense anymore.
Object / Document / NoSQL Database
Some exceptions aside, these databases store all information related to a given object (e.g. the user of an application or an album in a record catalog) in a single, loosely structured blob of JSON data, referred to as an "object" or a "document". Multiple documents of the same type are arranged in "lists" or "collections". Object Databases vary widely in how to create, read and search for data, ranging from RESTful APIs (CouchDB) to functional programming concepts such as Map-Reduce (MongoDB).
Why choose an object DB?
A few years back, the NoSQL hype swept like a wave through the usually so serene database landscape. NoSQL wasn't a concept, it as a movement: "Down with relational databases!" was all the rage. Today, countless DevOps emergencies and overly hasty migrations later, NoSQL has found its rightful place among storage solutions - as a great tool for the right task, rather than a panacea to cure all diseases. The flexibility that comes with just throwing differently structured documents into the same collection, not having to worry about datatypes and strong relations and leaving it to the DBMS to sort it all out are great ways to make a developer's life easier.
Why not to choose an object DB?
A bit of chaos can be nice - but an object-DB's lack of enforcement needs to be offset by developer-discipline and vigilance - not an easy requirement in large companies. Likewise, highly relational data is still better stored in a relational or graph database, only a subset of use cases with fairly isolated datasets are good fits for the document/collection model.
Graph DBs offer another way of thinking about the relationship within your data and can offer a great compromise between strict tables and loose documents. In a graph database each entity (node) is an isolated document listing a set of nestable properties. These nodes are connected by "edges" that specify their relationship. Edges can have properties in their own right. That might sound complicated, but think of it this way: A statement like "John has four red apples" can be broken down into John (node A) has (edge) four( edge property) red(node B property) apples (node B)"
Most GraphDBs come with rich toolsets to query, traverse and evaluate complex networks of nodes and edges, e.g. to find all connected nodes, the node with the most connections, the nearest neighbor and so on.
Why choose a GraphDB?
GraphDBs are a great way to structure highly interconnected data. If you are building a payment system with money changing hands between many accounts in individual transactions a GraphDB will significantly reduce the complexity of e.g. tracing a single payment through a number of steps.
Why not to choose a GraphDB?
GraphDBs can come with quite a bit of overhead for simpler use-cases. Especially linear data-series, e.g. a log of sessions, are poorly suited for a graph model. Furthermore thinking in graphs is probably the most challenging model in this list, putting additional strain on the developers.
Sometimes, all you need is a quick way to store some data, identified by a key - and Key-Value stores let you do just that. Their simplicity makes access very fast which makes them a great fit for non-relational data - think e.g. an URL shortener like bit.ly that just has to look up the original URL for a shortcode without any need for elaborate querying.
When to use a key-value store?
If your data's simplicity permits it, a key-value store can help you scale quickly and cheaply.
When not to use a key-value store?
If your data is relational, has complex structures or requires querying and searching.
Wide Column Stores
Imagine a key-value store with a second dimension - or, better yet, an infinite excel spreadsheet with any number of dynamic columns and rows. Wide column stores make it quick to find rows, columns or their intersection and allow to quickly write or read massive amounts of data.
When to use a wide-column store?
If you could imagine organizing all your data in one giant excel sheet. A WCS' simple data model allows it to scale well and makes replication and sharding easy.
When not to use a wide-column store?
Whilst some stores such as Apache Cassandra support basic querying, higher level functions such as aggregates (sums, counts, averages) might still be off limits. Likewise many wide column stores provide eventual rather than strong consistency, making them a poor choice for transactional applications.
Time-series DBs do one thing and do it well. They store linear data series with two dimensions, usually time and one or more values. Time-series are frequently used for server monitoring or financial recordings but are by no means general purpose solutions.
Why choose a time-series DB?
If you have to efficiently store and query timeseries data. If there are only some time series aspects to your app, a relational DB might suffice. Many time series DBs also offer useful higher level functions, such as alerts when a certain value is breached.
Why not to choose a time-series DB?
If your data is not a time-series or you only have to store a small amount of time-series data as part of a larger app.
What else is there?
This covers the major categories of databases. There are however a number of more specialized types, e.g. search engines that replicate an underlying DB's data such as Elasticsearch or special format stores for XML or YAML. In addition, there is a number of multi-concept databases that e.g. combine relational and graph data or allow dynamic table structures that create almost object-like flexibility. Depending on your requirements they might present a good alternative.
What else to consider?
Once you've decided on commercial/open-source, cloud or on-premise and your data-model, there are still a few things to take into account:
Scaling and consistency
Many databases offer "strong" consistency - a guarantuee that all writes are completed and all your data is up to date at the time of reading - even if some nodes within a DB-Cluster fail or other chaotic events occur. This comes with a significant performance overhead, but for many applications, it is a strict necessity. To find out if a database satisfies your consistency requirements you might want to look for its "ACID compliance" (Atomicity, Consistency, Isolation, Durability). For a more in-depth test and understanding of its behavior in failure situations have a look at its related Jepsen report.
Realtime Queries, Alerting, Triggers and other features
Many databases offer additional features beyond the pure storage and retrieval of data. Realtime querying (queries that update their resultset as the underlying data changes) are e.g. offered by RethinkDB, others such as Elasticsearch or Algolia provide complex full-text search, including word-stemming and synonyms and even others such as PostgreSQL (the author's favorite :-) are effectively fully programmable data-environments, complete with trigger functions and pub-sub messaging.
Where to go from here?
Armed with the above I'd recommend deciding on your criteria - and once you have to head over to DB Engines Ranking which lists every imaginable database, scored by its popularity. Once there, pick the one or two most popular options that match your criteria and take your project to the stars!