пятница, 26 декабря 2014 г.

Using bitmap indexes in databases

Kelly Sommers posted a great article on the internals of bitmap indexes:
"There are many different data structures used in databases to create indexes used to quickly evaluate queries. Each one has different strengths and weaknesses based on the tradeoffs they make on memory, cpu and storage (if persisted). One of these types of indexes is called a bitmap index. For the purpose of this post I’m going to be using relational database terminology but the same techniques can apply to different database types like column oriented databases or many others non-relational databases.
A bitmap index represents membership. A bitmap index is an array of bits, a bit represents each row where each bit set to 1 represents membership in the set.
To illustrate this better lets create an example table with 4 rows but consider it could be a table with 100 million rows.
Creating a bitmap index on the country column would give us 3 bitmap indexes (3 arrays of bits) that looks like the following.
[1, 0, 0, 0]
[0, 1, 0, 1]
[0, 0, 1, 0]
Now a bitmap index for the Name column which creates 4 bitmap indexes.
[1, 0, 0, 0]
[0, 1, 0, 0]
[0, 0, 1, 0]
[0, 0, 0, 1]
In addition to the bitmap indexes, we need a map that maps index positions to rows in the database (more on why later).
[100, 101, 102, 103]

Types of data

Bitmap indexes do well at indexing categorial data with well defined values like countries, dates and ages as a few examples. Bitmap indexes don’t do well with numerical data with continuous values like prices or kilometres per hour that you may want to sum or average. A 1 in the array of bits represents membership in the set for a unique term which makes them suitable for WHERE A = B type evaluations and not WHERE A > B evaluations.

Memory and storage

If you have 100 million rows in your database, the storage for the Country column index would be 3 bitmap indexes with 100 million bits (12 megabytes) each taking a total 36MB.
Using bitmap indexes means managing a lot of indexes because you need a new array of bits per unique term. That’s one of the cons of a bitmap index. If you have 100 million rows and each has a unique term (say a timestamp) you would create bitmap indexes for each timestamp where only 1 bit is set out of 100 million bits.
Fortunately there are ways to tackle this problem called Compressed Bitmaps which I will cover later.

Evaluating a query

With the indexes and row map defined we can jump back to the high level and explain how a query like this works.
WHERE Name = ‘Julie’ AND Country = ‘USA’
There’s a lot of material and research on how to optimize query operations such as choosing which order to do things based on cardinality and many other aspects. In these examples I’m not explaining the process of optimizing a query execution plan, just explaining the basics.
To evaluate which rows match the criteria we can do bitwise operations like bitwise AND and OR’s.
STEP 1: Evaluate Name = ‘Julie’ AND Country = ‘USA’
We take the 2 bitmap indexes that represent the terms “Julie” and “USA” and we bitwise AND them.
    0001 (Julie)
AND 0101 (USA)
  = 0001
STEP 2: Map results to row map
Take all the set bits and map the row key.
[0, 0, 0, 1]
= rowmap[3]
= rows["103"]
STEP 3: Select results
With the results from the where clause evaluated into row keys now we can use however our database storage works to fetch the rows and the columns selected in the query to return.


Depending on the database implementation, deleting a row in a database may cause a bunch of index rebuilds to occur but luckily with bitmap indexes you can still respond to queries with responses that exclude the deleted rows of data while these indexes rebuild in the background. Create a new bitmap index that represents deleted rows then bitwise AND against this bitmap index with any query results. In a bitmap index that represents deletes, a bit set to 1 represents a valid row and a bit set to 0 represents a deleted row.
Using the query mentioned above, if we deleted Julie the operations would look something like this.
    0001 (Julie)
AND 0101 (USA)
AND 1110 (Deletes)
  = 0000 (No matching rows!)
The bitmap index representing deletes is temporary and can be removed once the indexes affected by the row deletion have been rebuilt.

Group By and Joins

Bitmap indexes can be used for evaluating a lot of different parts of a query such as GROUP BY and JOIN clauses. Evaluating a join is a multi-step process because you need to evaluate the bitmap indexes from the right side join table and then take the resulting bitmap and bitwise AND it against the query result on the left side like we did on the deletes example. One to many to one relationships get even more complicated because there is an intermediary table facilitating the relationship. Explaining joins is a much more involved example I will leave for another blog post.

Bitmap indexes in OLAP databases

An OLAP (online analytical processing) database is typically used for reporting and business intelligence where there is a need to query data in many different ways (dimensions) with quick response times and is sometimes called an “OLAP Cube”. OLAP data is usually stored with a star schema where there is a fact table and multiple dimension tables. If we were analyzing sales, the sales table would be the fact table. Country and day of week would be dimension tables. This star schema would allow us to do analytical queries answering questions like “top 10 sales by country on Saturday” giving you insight on which countries you may want to increase promotions.
Some OLAP databases use bitmap indexes. When the OLAP cube is being generated it creates pre-calculated results so that queries can be extremely fast. When the cube is being generated it’s going through a materialization stage where it is evaluating every possible bitmap index combination for the dimension tables and creates all the resulting bitmaps.
The nice thing about using bitmap indexes in an OLAP database, since everything is an array of bits, you can choose not to materialize everything. If a pre-calculated result explodes the cube due to the size of possible combinations you can do what is called partial materialization where you don’t materialize everything but you can still evaluate bitmaps from parts that are materialized with bitmaps that aren’t materialized.

Accelerating with SIMD & GPU’s

Included in many modern processors is SSE2 which provides SIMD (Single instruction, multiple data). Using SIMD to evaluate the bitwise AND or bitwise OR operations would give the ability to evaluate (depending on CPU) 128 bits at once. With Intel AVX (Advanced Vector Extensions) in Sandy Bridge and AMD BullDozer up to 256 bits is supported. This would be a significant increase in performance for the operations explained earlier.
Using GPGPU could also accelerate processing even further but this poses additional challenges around getting data in and out of the GPU VRAM. If your indexes are larger than the VRAM available you will need to swap data in and out of the GPU many times.

Compressed Bitmap Index

One of the cons to using bitmap indexes is the amount of bitmap indexes that get created. You need to create one bitmap index the size of the row count per unique term. Luckily run-length encoding fits this problem really well. If you have really high cardinality columns, you will likely have most bits set to 0 and these can be compressed very well. Compressed bitmaps allow you to do the bitwise AND and bitwise OR operations in the compressed form avoiding having to decompress these large bitmaps before using them. Sparsity of membership when using run-length encoding will dictate how effective the compression rate is.
You can find some really good compressed bitmap implementations at the following links.
Compressed bitmap indexes can be really useful for evaluating queries on really large datasets quickly and offer a good compromise between memory efficiency and processing speed since in many cases they can be faster than uncompressed bitmaps. I only cover some basic uses of bitmap indexes, there are a lot of different ways to approach data indexing and query execution plans."