Choosing a NoSQL Database – Technology Comparison Matrix

Ashutosh Bijoor

The big data technology landscape consists of such a large number of choices, that often the most critical step in successfully implementing a solution is choosing the right platform that will address the requirements of the problem at hand, and that is sustainable in the long term.

With such a large number of choices though, doing a feature-wise comparison of all individual platforms is just too complex and time-consuming. However, it is possible to group these options based on the data models they support.

The following technology comparison matrix compares different predominant data models, their capabilities, typical applications as well as limitations.

 

Data Model Example: Key-Value BerkleyDB, MemcacheDB, Redis,DynamoDB

Capabilities

  • The simplest model where each object is retrieved with a unique key, with values having no inherent model
  • Utilize in-memory storage to provide fast access with optional persistence
  • Other data models built on top of this model to provide more complex objects

Applications

  • Applications requiring fast access to a large number of objects, such as caches or queues
  • Applications that require fast-changing data environments like mobile, gaming, online ads

Limitations

  • Cannot update a subset of a value
  • Does not provide querying
  • As the number of objects becomes large, generating unique keys could become complex

 

Data Model Example: Document-oriented MongoDB, CouchDB, Apache Solr,

Capabilities

  • Extension of a key-value model, where the value is a structured document
  • Documents can be highly complex, hierarchical data structures without requiring pre-defined “schema” Support queries on structured documents
  • Search platforms are also document-oriented

Applications

  • Applications that need to manage a large variety of objects that differ in structure
  • Large product catalogs in e-commerce, customer profiles, content management applications

Limitations

  • No standard query syntax
  • Query performance not linearly scalable
  • Join queries across collections not efficient

 

Data Model Example: Column-Oriented Cassandra, BigTable, HBase, Apache Accumulo

Capabilities

  • Extension of the key-value model, where the value is a set of columns (column-family)
  • A column can have multiple time-stamped versions
  • Columns can be generated at run-time and not all rows need to have all columns

Applications

  • Storing a large number of time-stamped data like event logs, sensor data
  • Analytics that involve querying entire columns of data such as trends or time series analytics

Limitations

  • No join queries or sub-queries
  • Limited support for aggregation
  • Ordering is done per partition, specified at table creation time

 

Data Model Example: Graph-oriented Neo4J, OrientDB, Apache Giraph, AllegroGraph

Capabilities

  • Models graphs consisting of nodes and edges with properties (meta-data) describing them
  • Implement very fast graph traversal operations
  • Also, support indexing of metadata to enable graph traversal combined with search queries

Applications

  • Applications that deal with objects with a large number of inter-relations
  • Applications like social networking friends-networks, hierarchical role-based permissions, complex decision trees, maps, network topologies

Limitations

 

Data Model Example: Relational - MySql, PostgreSQL, MariaDB, Oracle,

Capabilities

  • Conventional RDBMS structure consisting of a fixed schema with
  • ACID properties provides well documented and widely supported SQL syntax
  • Capable of complex queries including sub-queries and joins

Applications

  • Transactional data applications like ERP, CRM, Banking etc.
  • Applications where data volume is limited and schema are by and large fixed

Limitations

  • Lacks horizontal scalability and hence limited in handling “big data”Not efficient at handling complex multi-level nested data
  • Cannot handle “unstructured” data where the structure is not known at design time

Choosing the right platform would involve mapping the requirements in terms of the data model, and the type of querying and data access patterns required. The table above includes a brief overview of common features supported by various engines and it can help shorten the list of options. However, it surely requires a much finer analysis of the individual engines to make a final choice for any application.

It may seem like there is a glaring omission of Hadoop in the above table. Hadoop is so widely known in the context of big data and NoSQL, that often it is mistaken for a database.

Hadoop fundamentally consists of:

Hadoop Distributed File System (HDFS) – a distributed file storage system with built-in replication and fault tolerance,
Hadoop YARN – a framework for job scheduling and cluster resource management
Map Reduce – a distributed programming model to process a large number of objects

There are database engines that are built on top of Hadoop such as HBase, Hive, Giraph etc. that provide different database models, and these are included in the above table.