Blog - comments

MOXX offers mobile internet access when you travel in France. Simply rent a Pocket WiFi device and s...
wifi france
Hello David,I think Oracle is behaving differently with 12.1.0.2.x. At least in my case I was able t...
David D'Acquisto

Bravo, il fallait y penser Mery Christmas

Eric

Great !! Thank you very much.

SEK

Great !! Thank you very much.

SEK
Blog David Barbarin SQL Server 2014: Hekaton memory optimized tables, hash indexes, and bucket counts

dbi services Blog

Welcome to the dbi services Blog! This IT blog focuses on database, middleware, and OS technologies such as Oracle, Microsoft SQL Server & SharePoint, EMC Documentum, MySQL, PostgreSQL, Sybase, Unix/Linux, etc. The dbi services blog represents the view of our consultants, not necessarily that of dbi services. Feel free to comment on our blog postings.

  • Home
    Home This is where you can find all the blog posts throughout the site.
  • Categories
    Categories Displays a list of categories from this blog.
  • Tags
    Tags Displays a list of tags that have been used in the blog.
  • Bloggers
    Bloggers Search for your favorite blogger from this site.

SQL Server 2014: Hekaton memory optimized tables, hash indexes, and bucket counts

For my first blog post at dbi services, I think it could be a good opportunity to begin by discussing around SQL Server 2014, Hekaton memory optimized tables, and hash indexes. When you create a memory optimized table you have to consider the number of buckets that you have to reserve for its associated hash index. It’s an important aspect of configuration because after creating your table, you cannot change the number of buckets without recreating it.

For the SQL Server database administrators world, it is not usual because with other ordinary indexes (clustered, nonclustered indexes) or other special indexes (xml or spatial indexes) this parameter does not exist. Furthermore, Microsoft recommends to reserve a number of buckets equal at least two times the number of distinct values in the table! The first time I read the documentation I asked myself a few questions:

  • Why reserve a fixed number of buckets?
  • Why round the number of buckets to the power of 2?

In order to find the answers, I first had to understand how a hash index works.

Let's start from a basic definition of a hash table: it is a data structure used to implement an unsorted associative data array that can map a key to a value. To compute this key a hash function is used. The hash result is then stored into an array of buckets as shown to figure 1 (simplified schema).

 

bucket_hash_index

 

Notice that the hashing keys are not sorted. For lookup operations it does not matter, but for range value scans the hash index does not perform well. In the hash table world, the ideal situation is one key for one unique bucket to allow constant time for lookups, but in reality it is not the case because of hash collisions (two distinct pieces of data can have the same hash value). However, the hash function should be efficient enough to spread keys into buckets uniformly to avoid this phenomenon as much as possible.

There are several strategies to resolve hash collisions and the lack of buckets. SQL Server uses a method that consists of a separated chaining. When a hash collision occurs or when the number of buckets is not sufficient, the row data is added to the row chain for the selecting bucket. In consequence, the cost of the table operation is that of scanning the number of entries in the row chain of the selecting bucket for the concerned index key. If the distribution is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket also called the load factor. The load factor is the number of total entries divided by the number of buckets. The larger the load factor, the more the hash table will be slow. It means for a fixed number of buckets the time for lookups will grow with the number of entries.

 

bucket_hash_collision

 

Now let’s start with a practical example of a memory-optimized table :

 

CREATE TABLE [dbo].[HekatonTableDurable]
(
       [c1] [int] NOT NULL,
       [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL
   CONSTRAINT [pk_HekatonTableDurable] PRIMARY KEY NONCLUSTERED HASH
       (
             [c1]
       ) WITH ( BUCKET_COUNT = 1000)
)
WITH
(
       MEMORY_OPTIMIZED = ON ,
       DURABILITY = SCHEMA_AND_DATA
)
GO

 

We create a simple memory optimized table with two columns and 1000 buckets to store the hash index entries. After creating this table, we can see the number of buckets allocated and we notice the final number is not the same, but rounded to the nearest power of two.

 

SELECT
       OBJECT_NAME(object_id) AS table_name,
       name AS index_name,
       index_id,
       type_desc,
       [bucket_count]
FROM sys.hash_indexes
GO

 

number_of_buckets_after_creating_hk_table

 

This fact is very interesting because using a power of two to determine the position of the bucket pointer in the array is very efficient. Generally, with hash tables the hash function process is done in two steps to determine the index in the buckets array:

  • hash value = hash function (key)
  • position in the array = hash value % array size

The modulo operator in the case could be very expensive and can fortunately be replaced by a bitwise AND operator (because the array size is a power of two) which reduces the operation to masking and improves speed. To verify, we can do a simple test with a console application in C#:

 

using System;
using System.Diagnostics;
 
namespace ConsoleApplication1
{
   class Program
   {
       static void Main(string[] args)
       {
           int i = 0;
           int result = 0;
           int maxloop = 1000000000;
 
           Stopwatch timer = new Stopwatch();
 
           // Modulo operator
           timer.Start();
 
           while (i < maxloop)
           {
               result = modulo_operation(i, 4);
              
               i++;
           }
 
           timer.Stop();
 
           Console.WriteLine("With modulo : (ms) ");
           Console.WriteLine(timer.ElapsedMilliseconds);
 
           i = 0;
 
           // Bitwise AND
           timer.Restart();
 
           while (i < maxloop)
           {
               result = bitwiseand(i, 4);
 
               i++;
           }
 
           timer.Stop();
 
           Console.WriteLine("With bitwise and : (ms)");
           Console.WriteLine(timer.ElapsedMilliseconds);
 
           Console.ReadLine();
       }
 
       // modulo
       public static int modulo_operation (int x, int y)
       {
           return x % y;
       }
 
       // bitwise
       public static int bitwiseand (int x, int y)
       {
           return (x & (y -1));
       }
   }
}

 

By executing this console application on my dev environment (Intel Core i7-3610QM 2.30 GHz), we can notice that the bitwise and function performs faster than the modulo function for an iteration of 1 billion:

 

csharp_test

 

Then we can play with different combination of data rows and hash index buckets to understand why the number of bucket is very important.


Test 1: Filling up the table with 500 distinct rows and 1024 buckets


declare @i int = 0;
 
while @i < 500
begin
       insert HekatonTableDurable values (@i, 'test_' + cast(@i as varchar(50)));
       set @i += 1;
end

 

By using the DMV sys.dm_db_xtp_hash_index_stats we can see useful information about buckets usage.

 

hash_index_stats

 

We retrieve the total number of buckets allocated for the index hash equal to 1024. Furthermore, we notice that we have 545 empty buckets and in consequence 1024 – 545 = 479 buckets in use. Normally, with a perfect distribution of key values in the bucket array, we should have 500 buckets in use, but we notice that we certainly already have some hash collisions. We can verify this in the value of the column max_chain_length which tells us there is at least one bucket with a row chain length of 2. In other words, we have at least one bucket that contains two rows with the same index hash. Notice that the load factor is equal to 500 / 1024 = 0.49


Test 2: Filling up the memory-optimized table with a load factor of 3. We will insert 3072 rows for a total of bucket count equal to 1024


 

number_of_buckets_after_with_load_factor_3_hk_table


 

As expected, all buckets are used (empty_bucket_count = 0). We have an average row chain length of 3 (avg_chain_length = 3) in the bucket array equal to the load factor. Some of the rows chain lengths are equal to five, certainly due to a hash collision (max_chain_length = 5).


Test 3: Filling up the memory-optimized table with duplicate rows

 

Before filling up the table, some changes in the schema are required. Let's create a second non-unique hash index (ix_OrderDate):

 

IF OBJECT_ID('dbo.HekatonTableDurable') IS NOT NULL
       DROP TABLE [dbo].[HekatonTableDurable];
GO
 
CREATE TABLE [dbo].[HekatonTableDurable]
(
       [c1] [int] NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH ( BUCKET_COUNT = 1024) ,
       [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL INDEX ix_OrderDate HASH WITH (BUCKET_COUNT=1024)
)
WITH
(
       MEMORY_OPTIMIZED = ON ,
       DURABILITY = SCHEMA_AND_DATA
)
GO

 

Then we can fill up the table with 500 rows that contain some duplicate rows:

 

declare @i int = 0;
 
while @i < 500
begin
       insert dbo.HekatonTableDurable values (@i, case @i % 4 WHEN 0 THEN 'test_' + cast(@i as varchar(50)) ELSE 'test' END);
 
       set @i += 1;
end

 

Here is an overview of the number of duplicates rows in the table:

 

select
       c2,
       count(*)
from dbo.HekatonTableDurable
group by c2
order by count(*) desc

 

number_of_duplicate_rows_after_loading_hk_table

 

We can now check how the buckets are used for the two hash indexes:

 

select
       object_name(i.object_id) as table_name,
       i.name as index_name,
       his.total_bucket_count,
       his.empty_bucket_count,
       his.avg_chain_length,
       his.max_chain_length
from sys.dm_db_xtp_hash_index_stats as his
       join sys.hash_indexes as i
             on his.index_id = i.index_id
                    and his.object_id = i.object_id
where i.object_id = object_id('dbo.HekatonTableDurable')

 

We retrieve the same number of used buckets for the primary hash index key (545) as first tests. This is because the hash function used by SQL Server is deterministic. The same index key is always mapped to the same bucket in the array. The index key is relatively well distributed. However, we can see a different result for the second hash index. We only have 119 buckets in use to store 500 rows. Furthermore, we notice that the maximum row chain length is very high and corresponds in fact to the number of duplicates values we inserted (375).


Test 4: Filling up the table with an insufficient number of buckets to uniformly store the hash indexes

 

For this last test, we will use 10 buckets to store 8192 distinct rows (a load factor = 819).

 

IF OBJECT_ID('dbo.HekatonTableDurable') IS NOT NULL
       DROP TABLE [dbo].[HekatonTableDurable];
GO
 
 
CREATE TABLE [dbo].[HekatonTableDurable]
(
       [c1] [int] NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH ( BUCKET_COUNT = 10) ,
       [c2] [char](100) COLLATE Latin1_General_100_BIN2 NOT NULL INDEX ix_OrderDate HASH WITH (BUCKET_COUNT=10)
)
WITH
(
       MEMORY_OPTIMIZED = ON ,
       DURABILITY = SCHEMA_AND_DATA
)
GO

 

number_of_buckets_worse_case_hk_table

 

In this case, only one bucket is used (no empty buckets) and the value of the avg_chain_length and the max_chain_length column are pretty close. This is an indication of a lack of buckets with a big row chain.

In this post, we have seen the basic concepts of the hash table, the importance of a correct configuration of the buckets and how to use the dynamic management view sys.dm_db_xtp_index_hash_stats to get useful statistics for hash indexes.

Rate this blog entry:
4

David Barbarin is Consultant at dbi services. He has more than ten years of experience in Microsoft solutions. He is specialized in SQL Server technologies and associated topics such as installation, migration, security audits, troubleshooting of performance issues, or high availability architectures etc. Furthermore, he has many years of experience in .NET development, SSIS packages deployment, and database design in several sectors like retail, health sector, and other industries. David Barbarin is SQL Server MVP (since 2010), Microsoft Certified Master (MCM) for SQL Server, and Microsoft Certified Trainer (MCT). He holds an BTS in electronic from France and has a degree in computer sciences from CNAM in Lyon. His branch-related experience covers Public Sector, Financial Services / Banking, Automotive, Health Sector, IT, Watch Industry, etc.

Comments

  • Guest
    praveen rathore Friday, 12 September 2014

    using System;
    using System.Diagnostics;

    namespace ConsoleApplication1
    {
    class Program
    {
    static void Main(string[] args)
    {
    int i = 0;
    int result = 0;
    int maxloop = 1000000000;

    Stopwatch timer = new Stopwatch();

    // Modulo operator
    timer.Start();

    while (i maxloop)
    {
    result = bitwiseand(i, 4);

    i++;
    }

    timer.Stop();

    Console.WriteLine("With bitwise and : (ms)");
    Console.WriteLine(timer.ElapsedMilliseconds);

    Console.ReadLine();
    }

    // modulo
    public static int modulo_operation (int x, int y)
    {
    return x % y;
    }

    // bitwise
    public static int bitwiseand (int x, int y)
    {
    return (x & (y -1));
    }
    }
    }
    syntax error....
    correct me if i am wrong

Leave your comment

Guest Monday, 22 December 2014
AddThis Social Bookmark Button
Deutsch (DE-CH-AT)   French (Fr)

Contact

Contact us now!

Send us your request!

Our workshops

dbi FlexService SLA - ISO 20000 certified.

dbi FlexService SLA ISO 20000

Expert insight from insiders!

Fixed Price Services

dbi FlexService SLA - ISO 20000 certified.

dbi FlexService SLA ISO 20000

A safe investment: our IT services at fixed prices!

Your flexible SLA

dbi FlexService SLA - ISO 20000 certified.

dbi FlexService SLA ISO 20000

ISO 20000 certified & freely customizable!

dbi services Newsletter