The middle of the 2000s was marked by rapid growth of columnar DBMS. Vertica, ParAccel, Kognito, Infobright, SAND and others joined the club columnar DBMS and diluted proud solitude Sybase IQ, founded it in the early 90s. In this article I will discover the reasons for the popularity of the idea in a column of data storage, operation and use of the area columnar DBMS.
Let's start with the fact that popular nowadays relational databases - Oracle, SQL Server, MySQL, DB2, Postgresql and etc. Based on the architecture, its history is counted more since 1970s, when transistor radios were sideburns long, flared trousers, a database in the world dominated by hierarchical and network data management system. The main task of the database, then, is to support beginning in the 1960s, a massive shift from paper records of economic activity to the computer. A wealth of information from paper documents was transferred to a database accounting systems that were supposed to securely store all incoming information and, if necessary, to quickly find them. These requirements led to the architectural features of relational databases, the remaining hitherto virtually unchanged: progressive storage, indexing records and logging operations.
Under the interline storage is usually understood the physical storage of all rows in the table in the form of a record in which the fields are sequentially one after the other, and the last field recordings, in general, is the first next record. Something like this:
[A1, B1, C1], [A2, B2, C2], [A3, B3, C3] ...
where A, B and C - a field (columns), and 1.2 and 3 - record number (row).
Such storage is extremely convenient for frequent operations to add new rows in the database, usually stored on a hard drive - in fact in this case, a new record can be added entirely in just one pass of the drive heads. Significant speed restrictions imposed by the HD, caused the need to conduct specific indexes, which would allow to find the desired item on the disc in the minimum number of passes head HDD. Typically, several indices formed, depending on which fields required to make the search, which increases the volume of the database on a disk is sometimes several times. For resiliency traditional DBMS automatically duplicate transactions in the log, which leads to even more space occupied on disk. As a result, for example, an Oracle database average takes up 5 times more space than the volume of payload data therein. For srednepotolochnoy database to DB2, this ratio is even more - 7: 1.
However, in the 1990s, the spread of information systems analysis and storage of data used for the analysis of management accounting systems in the accumulated data, it became clear that the nature of the load in these two types of systems is radically different.
If transactional applications characterized by very frequent small transactions add or change one or more records (insert / update), in the case of analytical systems opposite picture - the largest load is created relatively rare but serious samples (select) the hundreds of thousands and millions of records, often with groups and calculation of totals (so-called units). Write operations wherein a low, often less than 1% of the total. And often you record large blocks (bulk load). It should be noted that the analytical sample is one important feature - they usually contain only a few fields. On average, in the analytic SQL-user request them rarely more than 7-8. This is due to the fact that the human mind is unable to properly absorb information more than 5-7 sections.
However, what happens if you choose, for example, only 3 fields from the table, in which there are only 50? Due to progressive data storage in traditional DBMS (required, as we remember, for frequent operations to add new entries to the accounting systems) will be read completely all lines completely with all fields. This means that no matter whether we need only 50 fields or 3, with the disc in any case they are read entirely, passed through the disk controller, input and output to the processor, which has only necessary to take away request. Unfortunately, the channels of disk IO are usually the main limiter performance analytical systems. As a result, the effectiveness of traditional RDBMS in the performance of this query can be reduced by 10-15 times because of the imminent reading unnecessary data. And Moore's Law on the rate of input-output disk drives much weaker than the speed of processors and amount of memory. So, apparently, the situation will only get worse.
To solve this problem are called columnar DBMS. The basic idea of columnar database - it can store data in rows, as do traditional DBMS, and on columns. This means that from the point of view of SQL-client data is typically represented as a table, but a table is physically these plurality of columns, each of which is essentially a table of one field. At the same physical disk space value of one field are stored one after the other - something like this:
[A1, A2, A3], [B1, B2, B3], [C1, C2, C3], etc.
This data organization leads to the fact that when the select which featured only three fields of 50 fields in the table, the disc will be physically read only 3 columns. This means that the load on the input-output channel will be about 50/3 = 17 times smaller than if the same in traditional database query.
Resume
Columnar DBMS designed to solve the problem of inefficiency of traditional databases in analytical systems and systems in the vast majority of operations such as 'read'. They allow for cheaper and less powerful hardware to get a speed boost query performance at 5, 10 and sometimes even 100 times, while, thanks to compression, the data on the disk will take 5-10 times less than in the case of traditional DBMS.
In columnar DBMS there are disadvantages - they are slow to write, not suitable for transactional systems, and as a rule, because of the "youth" have a number of limitations to the developer who is used to the development of traditional DBMS.
Columnar DBMS usually used in analytical systems class business intelligence (ROLAP) and analytical data warehouse (data warehouses). And the amount of data can be quite large - there are examples on 300-500TB and even cases with> 1PB of data.
Let's start with the fact that popular nowadays relational databases - Oracle, SQL Server, MySQL, DB2, Postgresql and etc. Based on the architecture, its history is counted more since 1970s, when transistor radios were sideburns long, flared trousers, a database in the world dominated by hierarchical and network data management system. The main task of the database, then, is to support beginning in the 1960s, a massive shift from paper records of economic activity to the computer. A wealth of information from paper documents was transferred to a database accounting systems that were supposed to securely store all incoming information and, if necessary, to quickly find them. These requirements led to the architectural features of relational databases, the remaining hitherto virtually unchanged: progressive storage, indexing records and logging operations.
Under the interline storage is usually understood the physical storage of all rows in the table in the form of a record in which the fields are sequentially one after the other, and the last field recordings, in general, is the first next record. Something like this:
[A1, B1, C1], [A2, B2, C2], [A3, B3, C3] ...
where A, B and C - a field (columns), and 1.2 and 3 - record number (row).
Such storage is extremely convenient for frequent operations to add new rows in the database, usually stored on a hard drive - in fact in this case, a new record can be added entirely in just one pass of the drive heads. Significant speed restrictions imposed by the HD, caused the need to conduct specific indexes, which would allow to find the desired item on the disc in the minimum number of passes head HDD. Typically, several indices formed, depending on which fields required to make the search, which increases the volume of the database on a disk is sometimes several times. For resiliency traditional DBMS automatically duplicate transactions in the log, which leads to even more space occupied on disk. As a result, for example, an Oracle database average takes up 5 times more space than the volume of payload data therein. For srednepotolochnoy database to DB2, this ratio is even more - 7: 1.
However, in the 1990s, the spread of information systems analysis and storage of data used for the analysis of management accounting systems in the accumulated data, it became clear that the nature of the load in these two types of systems is radically different.
If transactional applications characterized by very frequent small transactions add or change one or more records (insert / update), in the case of analytical systems opposite picture - the largest load is created relatively rare but serious samples (select) the hundreds of thousands and millions of records, often with groups and calculation of totals (so-called units). Write operations wherein a low, often less than 1% of the total. And often you record large blocks (bulk load). It should be noted that the analytical sample is one important feature - they usually contain only a few fields. On average, in the analytic SQL-user request them rarely more than 7-8. This is due to the fact that the human mind is unable to properly absorb information more than 5-7 sections.
However, what happens if you choose, for example, only 3 fields from the table, in which there are only 50? Due to progressive data storage in traditional DBMS (required, as we remember, for frequent operations to add new entries to the accounting systems) will be read completely all lines completely with all fields. This means that no matter whether we need only 50 fields or 3, with the disc in any case they are read entirely, passed through the disk controller, input and output to the processor, which has only necessary to take away request. Unfortunately, the channels of disk IO are usually the main limiter performance analytical systems. As a result, the effectiveness of traditional RDBMS in the performance of this query can be reduced by 10-15 times because of the imminent reading unnecessary data. And Moore's Law on the rate of input-output disk drives much weaker than the speed of processors and amount of memory. So, apparently, the situation will only get worse.
To solve this problem are called columnar DBMS. The basic idea of columnar database - it can store data in rows, as do traditional DBMS, and on columns. This means that from the point of view of SQL-client data is typically represented as a table, but a table is physically these plurality of columns, each of which is essentially a table of one field. At the same physical disk space value of one field are stored one after the other - something like this:
[A1, A2, A3], [B1, B2, B3], [C1, C2, C3], etc.
This data organization leads to the fact that when the select which featured only three fields of 50 fields in the table, the disc will be physically read only 3 columns. This means that the load on the input-output channel will be about 50/3 = 17 times smaller than if the same in traditional database query.
In addition, when columnar data storage appears a great opportunity to greatly compress the data as a single column of the table data is generally of the same type can not be said about the line. Compression algorithms may be different. Here is an example of one of them - the so-called Run-Length Encoding (RLE):
If we have a table with 100 million records, made within one year, in the column "Date" will actually be stored for no more than 366 possible values, as of the year no more than 366 days (including leap years). So we can 100 million sorted values in this field is replaced by 366 pairs of values of the form <date, the number of times> and store them on disk in this form. Thus they will occupy approximately 100 thousand. Times less space, which also contributes to the query performance.
From a developer's perspective, columnar DBMS usually correspond ACID and support largely SQL-99 standard.
Resume
Columnar DBMS designed to solve the problem of inefficiency of traditional databases in analytical systems and systems in the vast majority of operations such as 'read'. They allow for cheaper and less powerful hardware to get a speed boost query performance at 5, 10 and sometimes even 100 times, while, thanks to compression, the data on the disk will take 5-10 times less than in the case of traditional DBMS.
In columnar DBMS there are disadvantages - they are slow to write, not suitable for transactional systems, and as a rule, because of the "youth" have a number of limitations to the developer who is used to the development of traditional DBMS.
Columnar DBMS usually used in analytical systems class business intelligence (ROLAP) and analytical data warehouse (data warehouses). And the amount of data can be quite large - there are examples on 300-500TB and even cases with> 1PB of data.