Relational databases have been around for a long time, developers tend to use them often, and provided feature set is familiar. It is enough to be familiar with SQL to use them. The design of relational databases is doing a great job in hiding the internals from users. Questions, like how data is stored on disk, how the write path really works underneath, is database caching, and how, are often too advanced for regular database usage.
We live in a data-intensive world, and in the last couple of years we have witnessed an expansion of social networks and IoT, everything is connected to the internet and everything emits and receives events. We see more and more requests to store unstructured data just for the case, and value emerging from it will come later on. Both the amount of data produced/consumed and data stored is rising exponentially. That further means that the huge percentage (around 90%) of it was created in just last two-three years.
The above implies that total processed data volume (kept for only a short period of time), the total amount of data stored, and the versatility of data type/structure is ever increasing figures. All this renders the traditional relational databases in this context not an optimal choice, but even an impossible choice.
Those are just some of the reasons why we need a shift in thinking about storage and why a single database storage implementation is not an option any longer.
In the NoSQL world, things are a bit different. Databases are not general-purpose, they are often built to solve a specific use case. They work great with that use case but not so good with other problems. They are often a choice for Big Data projects where we have high non-functional requirements and it is almost impossible to work with NoSQL databases without knowing the internals.
We have four major types in the NoSQL space: document databases, key-value stores, column families, and graphs.
Document databases are probably the most popular ones, mainly because of MongoDB which was one of the first NoSQL databases to enter the top 10 storage engines on DB-Engine rankings. MongoDB is rich in features, has master/slave replication with built in sharding and uses memory mapped files for data storage. It is great for prototyping because of flexibility when storing and querying data and it is a good substitute in most places where PosgreSQL or other RDBMS would fit in but predefined schema holds you back. Other databases of this type are CouchDB document store with master replication and eventual consistency and Couchbase which is both key-value and document store, provides JSON API which is useful as the storage engine for client-server apps.
When you think of a key-value store, you can think of the distributed implementation of Map. Redis and HazelCast are databases of this type. Redis is a blazing fast database written in C. It has master-slave replication with automatic failover. It is the best choice for rapidly changing data with a foreseeable database size (should fit mostly in memory). An example of usage could be session storage, shopping cart for e-commerce sites, and real-time analytics. HazelCast is an in-memory distributed storage and computing platform. Offering out-of-the-box distributed implementations for many APIs that Java developers are familiar with, such as Map, Set, List, Semaphore, Executor, etc., it can be seen as a plug-in replace for tools like EhCache, Redis, and JCache.
Column family databases contain columns of related data, stored as key-value pairs. The most popular databases of this type are Cassandra and HBase. Cassandra originated from Amazon’s DynamoDB and Google BigTable and it is great for storing huge amounts of data. All nodes are equal, it is shared-nothing architecture, masterless with no single point of failure. CQL (Cassandra Query Language) is similar to SQL. It is best used for web analytics, hit counting, transaction logging, and storage for sensor data. HBase is a part of the Hadoop framework and uses its HDFS file system as storage. It is best used for Hadoop’s Map/Reduce, analyzing log data and in any place where scanning huge, two-dimensional join-less tables is a requirement.
In graph databases, relations between two entities are more important than entities themselves. The main representatives of this type are Neo4J and JanusGraph. Neo4J is a graph database written in Java, it uses pattern-matching-based query language (“Cypher”) but can use “Gremlin” graph traversal language as well. It is best used for graph-style, rich or complex, interconnected data such as searching routes in social relations, public transport links, road maps, or network topologies. JanusGraph is a scalable graph framework optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. It can use various storage backends: Cassandra, HBase, and BerkeleyDB and provide great support in integration with mainstream frameworks (Hadoop, Spark, ElasticSearch, Solr).
Polyglot Persistence Approach
It is not uncommon these days to see a combination of relational and nonrelational databases within the same project, even a combination of a couple of different databases of the same type. Microservice architecture influenced this a lot since each small service is the owner of its data, everything is behind API, and the separation of services is based upon the use case, so it is only natural to choose the best storage for the job.
Here is an example of an application with multiple storage engines taken from the Martin Fowler blog. User session, as temporary fast data optimized for frequent reads, is placed in Redis. Financial data, which is naturally relational is stored in RDBMS together with reports. The shopping cart has a similar nature to the session, it is temporary fast access data so it is stored in key-value store (Riak this time). Neo4J is a graph database has great support for recommendation engines, user similarities are an important factor for giving recommendations to customers, so this is the prime reason for its choice. The product catalog has the nature of a document, there are a lot of dynamic criteria for querying this data and MongoDB is great with its flexibility and speed for this use case. Cassandra’s killer feature is time-series data, it has great integration with Spark and Solr so it is the best choice for analytics and user activity logs.
Polyglot persistence is not a free ride, it comes with a price and the price is the complexity of the system. Years ago it was enough to know one programming language and one storage engine in order to build a system, but nowadays you need to be a polyglot in every aspect. That’s why you need to be familiar with the cost of each choice and to introduce a new technology only when the benefits from it are much higher than the complexity it brings to the table. The scariest thing now when everybody talks about the polyglot approach, is to use the new technology just for the sake of it.
Because of the nature of NoSQL, which has been built to solve a specific use case, be prepared to dig a bit deeper and learn how a database functions underneath. The problems of Big Data and distributed systems are much different than single instance problems and being unfamiliar with the technology can lead to problems at a much bigger scale.
Do your homework, explore the possible solutions to your problem, make a decision and get familiar with the tool you choose – do not just follow the crowd..
Written by: Nikola Ivančević
June 9, 2021