Use bitmap indexes for indexing elements in the buffer

Description

Migrated from NovaTec Jira. Original ticket: INSPECTIT-1164

On Friday I spent some time checking out what can be alternative to our current index structure that has a big problem with size in memory (although it's quite fast).. And I stumbled to some Oracle documentation (https://docs.oracle.com/database/121/CNCPT/indexiot.htm#CNCPT1182) on the type of indexes they have and it was very interesting to read about the bitmap indexes.

The idea is simple, for a index on a column (in our case one field) you create a bitmap for each distinct value. Later on when querying you just take the appropriate bitmap and and read the appropriate rows (in our case that would be an array of elements).

However, they said that the bitmap index is suggested for columns that can not have many distinct values, as the number of distinct values rises you need more bitmaps, thus index get too big. However, there is a guy who is questioning this on account of compressed bit sets (http://lemire.me/blog/archives/2008/08/20/the-mythical-bitmap-index/). He says that in fact the size of all bit sets used in one index is approximate to the size of elements indexed in bits: "you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!"

I also found a nice Java library that uses RLE compression scheme (https://code.google.com/p/javaewah/). I would really opt to check this out and see what would be the index size for 1 million elements with several distinct values for example.

The general idea how we could do this I depicted on the attached picture.

Considering the values we have right now, where for ~500.000 elements we have 30MB indexing structure with 3-level indexes, this would be a big gain. Cause then we would actually need only three bitmap indexes each in size of 500Kbits. That's not even one 1 MB..

Biggest challenge for us would be the problem of updating the index when elements are kicked-out from the buffer and maintaining it size to correspond to the number of elements currently in the buffer..

Sure this is for discussion.

Environment

None

Assignee

Unassigned

Reporter

Ivan Senic

Labels

None

Integrator

None

Components

Priority

Low
Configure