Ashutosh Bijoor Jan 5, 2023 12:00:00 AM 7 min read

How to choose a NoSQL database looking at the technology capabilities and limitations

The big data technology landscape consists of such a large number of choices, that often the most critical step in successfully implementing a solution is choosing the right platform that will address the requirements of the problem at hand, and that is sustainable in the long term.

With such a large number of choices though, doing a feature-wise comparison of all individual platforms is just too complex and time-consuming. However, it is possible to group these options based on the data models they support.

The following technology comparison matrix compares different predominant data models, their capabilities, typical applications as well as limitations.

DATA MODEL EXAMPLE: KEY-VALUE BERKLEYDB, MEMCACHEDB, REDIS,DYNAMODB

Capabilities

The simplest model where each object is retrieved with a unique key, with values having no inherent model

Utilize in-memory storage to provide fast access with optional persistence

Other data models built on top of this model to provide more complex objects

Applications

Applications requiring fast access to a large number of objects, such as caches or queues

Applications that require fast-changing data environments like mobile, gaming, online ads

Limitations

Cannot update a subset of a value

Does not provide querying

As the number of objects becomes large, generating unique keys could become complex

DATA MODEL EXAMPLE: DOCUMENT-ORIENTED MONGODB, COUCHDB, APACHE SOLR,

Capabilities

Extension of a key-value model, where the value is a structured document

Documents can be highly complex, hierarchical data structures without requiring pre-defined “schema” Support queries on structured documents

Search platforms are also document-oriented

Applications

Applications that need to manage a large variety of objects that differ in structure

Large product catalogs in e-commerce, customer profiles, content management applications

Limitations

No standard query syntax

Query performance not linearly scalable

Join queries across collections not efficient

DATA MODEL EXAMPLE: COLUMN-ORIENTED CASSANDRA, BIGTABLE, HBASE, APACHE ACCUMULO

Capabilities

Extension of the key-value model, where the value is a set of columns (column-family)

A column can have multiple time-stamped versions

Columns can be generated at run-time and not all rows need to have all columns

Applications

Storing a large number of time-stamped data like event logs, sensor data

Analytics that involve querying entire columns of data such as trends or time series analytics

Limitations

No join queries or sub-queries

Limited support for aggregation

Ordering is done per partition, specified at table creation time

DATA MODEL EXAMPLE: GRAPH-ORIENTED NEO4J, ORIENTDB, APACHE GIRAPH, ALLEGROGRAPH

Capabilities

Models graphs consisting of nodes and edges with properties (meta-data) describing them

Implement very fast graph traversal operations

Also, support indexing of metadata to enable graph traversal combined with search queries

Applications

Applications that deal with objects with a large number of inter-relations

Applications like social networking friends-networks, hierarchical role-based permissions, complex decision trees, maps, network topologies

Limitations

Difficult to scale for large data sets for generic graphs

Giraph uses the Bulk Synchronous Parallel model to overcome some of the scalability limitations

DATA MODEL EXAMPLE: RELATIONAL - MYSQL, POSTGRESQL, MARIADB, ORACLE,

Capabilities

Conventional RDBMS structure consisting of a fixed schema with

ACID properties provides well documented and widely supported SQL syntax

Capable of complex queries including sub-queries and joins

Applications

Transactional data applications like ERP, CRM, Banking etc.

Applications where data volume is limited and schema are by and large fixed

Limitations

Lacks horizontal scalability and hence limited in handling “big data”Not efficient at handling complex multi-level nested data

Cannot handle “unstructured” data where the structure is not known at design time

Choosing the right platform would involve mapping the requirements in terms of the data model, and the type of querying and data access patterns required. The table above includes a brief overview of common features supported by various engines and it can help shorten the list of options. However, it surely requires a much finer analysis of the individual engines to make a final choice for any application.

It may seem like there is a glaring omission of Hadoop in the above table. Hadoop is so widely known in the context of big data and NoSQL, that often it is mistaken for a database.

Hadoop fundamentally consists of:

Hadoop Distributed File System (HDFS) – a distributed file storage system with built-in replication and fault tolerance, Hadoop YARN – a framework for job scheduling and cluster resource management Map Reduce – a distributed programming model to process a large number of objects

There are database engines that are built on top of Hadoop such as HBase, Hive, Giraph etc. that provide different database models, and these are included in the above table.