At the present day it is clear that an enormous load is imposed by the processing and maintenance of the information available on line and, above all, by the need to use it effectively. An appropriate analysis of such data may provide fundamental conclusions for a range of purposes. Thus, businesses with commercial interest try to work out their customers’ patterns of behaviour with the aim of improving sales. Similarly, but directly linked to security, observation and analysis applied to the lines of thinking, actions and behaviour of humans play a leading role in questions of defence and the protection of a nation.
The Unstoppable Growth of Data on the Internet.
A few figures will bring home the reality of the amount of information handled over the Internet:
According to the person who was Chair and Director of Google between 2001 and 2011, the volume of data generated every two days on the Internet is equivalent to the total of all data accumulated there up to 2003.
For its part, IBM estimates 40 zetabytes of data will be generated in 2020, as against the current 3.2. This equates to 43 trillion Gigabytes, around 300 times more than in 2005.
Google already permits searches that involve the consultation of databases of immense dimensions.
It is clear that the growth of the Internet and also of the volumes of data, of storage and of processing bring with them a requirement for effective solutions capable of facing up to figures of this size. These needs have stimulated the consolidation of non-relational or Non-SQL databases, as their characteristics of portability, scalability and distribution over multiple systems make the use of such technologies almost unavoidable.
SQL versus Non-SQL Databases.
In the past, relational databases concentrated on the reliability of transactions following the well-known ACID principle, this acronym standing for Atomicity, Consistency, Isolation and Durability. The first of these properties (atomicity) attempted to ensure that any transaction on a database was always fully completed or not completed at all. The second (consistency) aimed at keeping data in a valid state at all times. The third (isolation) tried to make sure that data were independent or isolated, while the fourth (durability) was intended to keep data unharmed, even if there was a system failure. The ACID principle brings robustness, but it also downgrades performance and operational effectiveness as volumes of data increase.
- The ACID Principle, Typical in Relational Databases -
When the magnitude and dynamism of data grow beyond a certain size, the ACID principle of relational models is downplayed relative to performance, availability and scalability, which are characteristics more typical of Non-SQL databases. Nowadays, modern data systems on the Internet tend to conform more to another well-known principle: BASE, the acronym of Basic Availability (availability as the priority), Soft State (the consistency of data is delegated to a manager outside the database engine), and Eventual Consistency (attempts are made to achieve convergence towards a consistent state)
- The BASE Principle, Common in Non-SQL Databases -
The principal difference between a relational database (a relational database management system or RDBMS) and one not based on SQL (Non-SQL) lies in the concept of data structure. In a Non-SQL model, information is stored in more flexible ways that do not suffer the restriction of having to adopt a predefined format, as would be the case for relational databases (tables whose structure follows a fixed scheme). Thanks to this, it becomes much simpler to distribute data among systems without having to have a complex migration mechanism. Moreover, such a design works well for sideways or horizontal scalability simply by adding nodes to distribute the load, which is more advantageous than the upwards scalability (increasing processing power and memory) needed in relational databases. These are the principal reasons that make Non-SQL databases the system of choice when it is necessary to handle large volumes of data of very varied sorts.
Types and Classification of Non-SQL Databases
There are some 150 different types of Non-SQL databases with their varying data structures (based on documents, on keys and values, on objects, on graphs, on columns, and so forth). Among the better known, Cassandra, Hadoop, MongoDB, CouchDB or Redis would be the most outstanding. Giants like Oracle also have a Non-SQL implementation available.
The Increasingly Widespread Non-SQL Database
Despite this plethora of systems, it is possible to group Non-SQL databases into four main categories, on the basis of the type or model of data storage that they adopt:
Key and Value.
Data are stored, located and identified by using a unique key and a value (a piece of data or a pointer to data). Examples include DynamoDB, Riak, and Redis. Amazon and BestBuy, among other enterprises, use this sort of implementation
These are similar to the key and value form, but the key is based on a combination of column, row and time frame, this being used to reference sets of columns (families). This implementation is the closest to relational databases. Instances would be Cassandra, BigTable, and Hadoop/HBase. Companies like Twitter and Adobe use this kind of model.
The data are stored in documents that encapsulate the information in XML, YAML or JSON format. The documents have auto field names contained in the document itself. The information is indexed by utilizing these field names. MongoDB and CouchDB would be examples. One instance of the use of this technology would be Netflix, a business providing audiovisual content on line.
This uses a format of graphs extending over multiple machines. This is a model that is suited to data whose relationships match this sort of representation, such as transport networks, maps, and the like. Examples include Neo4J and GraphBase
Security Challenges in Non-SQL Databases
In view of the large variety of Non-SQL databases, there is a need to pay attention to the generic weaknesses of these models. Additionally, in each specific case the necessary measures for this particular implementation must be applied.
In comparison with relational databases, the following security areas can be highlighted:
Strength of authentication is one of the fields where many Non-SQL implementations show weak points. It is common to find that Non-SQL databases incorporate default credentials, or even do without obligatory authentication or disable this (and example being Redis). In many instances, they rely on environments of trust rather than on user authentication. This is always a crucial point to check, the extent of the difficulty depending on the specific software in question.
The adoption of a philosophy in which availability and performance take priority has a negative impact on the integrity of data. Consequently, it is often necessary to use complementary mechanisms outside the database engine to ensure such integrity.
Confidentiality and Encrypted Storage
In general, data are stored as plain text. With a few exceptions, for example Cassandra and its Transparent Data Encryption technology, no integrated encryption mechanisms are in place. In most instances, it is necessary to delegate any enciphering to processes in the application layer or to the file system itself.
Most Non-SQL databases lack any robust mechanisms of their own for auditing data. This is a considerable gap when it comes to detecting possible attacks by means of observation of events affecting specific entries, as would be done in relational databases.
Security of Communications
Enciphering and SSL protocols are normally used in relational databases. In contrast, in Non-SQL systems these are generally disabled by default, are optional (for example in Cassandra), or may need to be specifically configured during installation (MongoDB).
Classic Database Vulnerabilities. Even More Injection
Finally, to look at one of the most widely exploited features, the injection of commands, it must be stressed that in Non-SQL databases, requests and calls are executed by a call to the corresponding API, formatted in accordance with a common convention, normally JSON or XML. At this point, inaccurate verification of the entry parameters may allow the execution of commands as they are evaluated and processed in the call to the API in question. The possibilities for injection and the risks when using an API with a procedural programming language are even greater than in the case of relational databases, where the SQL language, typically declarative and much more constrained, is used. Non-SQL injection and Java script code injection are new vectors opening up a wide front for attacks on databases of this sort.
Non-SQL approaches are more and more present in present-day database technologies. They face considerable challenges in fighting problems of security, which sooner or later they will have to reinforce.