dbi services Blog
Welcome to the dbi services Blog! This IT blog focuses on database, middleware, and OS technologies such as Oracle, Microsoft SQL Server & SharePoint, EMC Documentum, MySQL, PostgreSQL, Sybase, Unix/Linux, etc. The dbi services blog represents the view of our consultants, not necessarily that of dbi services. Feel free to comment on our blog postings.
Collation changes in Microsoft SQL Server are not that simple
Lots of DBAs are changing the collation of their SQL Server databases without knowing exactly what they do and the consequences of this change. And as they don't know what they do, they often only perform the small and easiest part of it, changing only the collation on the database. In this blog I will explain the methodology that allows to change the collation of all objects (not only the database) and the consequences of these changes on the data. You will see that this is not so easy, it can generate lots of problems and in some cases, information can be lost (wrong convertions).
And this is not all, if you mix collations between objects, databases and instances, you can expect very bad query performances. This is due to internal collation convertions, but this is also a very big topic and will not be explained here in detail. So keep in mind the best practices, try to keep the same collation definition on all objects, databases and instance.
Let's start with a few questions:
What does a Collation exactly define?
The collation determines the rules SQL Server takes to sort and to compare the characters in an object (generaly a table column). But it will also define the underline “code page” used to store physically the data of this object.
SQL Server 2008R2 supports more than 2000 different collations, but these are only stored in 17 different code pages.
Why are different code pages used?
For performance and space (disk-memory) reasons, 16 code pages are single-byte (called non-Unicode) and only one is multi-bytes (Unicode using UCS-2).
Only column types "char", "varchar" and "text" can use non-Unicode code pages (only one of the 16 single-byte code pages).
That means that all language characters (and signs) will use only one byte to be stored on disk (or in the memory). But this has a drawback, because a single-byte can only store 256 (2^8) different values, which means that only 256 different characters can be defined (one value=one character).
For a standard language or when you mix different similar languages (same code page) this is not a problem, but this becomes a problem as soon as you need more than 256 different characters in the different languages you want to use (that use different code pages). In this case only the Unicode (multi-bytes) can resolve the problem.
This is also the reason why the 17th code page (CodePage 0) is multi-byte and is always used for languages that include more than 256 different characters. Keep also in mind that each different character that is stored must be returned exactly as it was given. If this is not the case, then we will lose information...
This is why the underlining code page must make a difference between all characters, and keep the upper/lower case difference and all stored accents.
This means that all code pages (storage) are always case sensitiv and accent sensitiv!
The column types "nchar", "nvarchar" and "ntext" are always using the Unicode code page. And as said before, this is also used for "char", "varchar" and text when a collation uses the code page "0".
For Unicode, SQL-Server uses a “UCS-2” code page (Universal Character Set, using only 2 bytes), the underlining code page is identical for all collations and version of SQL Server (since version 7.0) and data do not need to be convert when the collation is changed as this is Unicode and is indipendent from the collation.
In this particular case the collation is only used for sort and comparison rules.
The drawback of multi-bytes is very easy to imagine, because each character will be stored in 2 bytes, this will need more disk and memory space and more CPU to read, write and compare the characters.
How to find out which collation uses wich code page?
To list all collations supported in a SQL-Server instance with their corresponding code pages use the following query:
> SELECT name, COLLATIONPROPERTY(name, 'CodePage') AS CodePage, description
Here below part of the result of the query (2397 rows), with the different collations we could use in Western Europe:
Definition of the collation name
As you can see, the first part of the collation name design the language that will be used. This will also define the underlining code page. And the last part is only used by the sort and comparison rules.
As said before, the sensitivity or insensitivity will only be used in rules (sorts or comparisons, also used in the indexes):
Case sensitiv "_CS" or Case Insensitiv "_CI"
If the Case Sensitive rule (_CS) is used:
- in comparisons, the upper case character like "A" is not equal to the lower case character like "a"
For example if you search all name starting with "A" you will not get all name starting with "a"!
- and in sort operations all upper cases characters come before all lower cases characaters.
"A" to "Z" and only after "a" to "z".
If the Case Insensitive rules (_CI) is used:
- in comparisons, the character "a" will be treated as "A"
- and in sort operations "a" will be next to the "A"
Accent sensitiv "_AS" or Accent Insensitiv "_AI"
If the Accent Sensitive rule (_AS) is used:
- in comparisons the same character with different accents are not equal, like "a", "à", "â", etc
- and in sort operations all accent characters will come after all other characters.
If the Accent Insensitive rules (_AI) is used:
- in comparisons the accent character like "à" will be treated as a non accent character like "a"
- and in sort operations the accent character like "à" will be next to the non accent character like "a"
Kana Sensitiv "_KS" (or Kana Insensitiv for the default)
This is only used for Japanese, kana characters Hiragana and Katakana are treated differently with "_KS".
Width Sensitiv "_WS" (or Width Insensitiv for the default)
When a single-byte character (Non-Unicode) and the same character represented as a multi-byte character (Unicode) are treated differently then it is width sensitive.
Some more important information about the collations
In the last versions of SQL Server, two different types of collations can be used: “SQL Server” or “Windows” collations.
SQL Server collations with names starting with SQL_* were used in the old SQL Server versions (until 2005).
These collations have different rules for Unicode and non-Unicode data. This means that the sort of comparison operations on Unicode and non-Unicode data are giving different results!
Many Windows collations have been introduced in SQL Server 2008 to support the Windows system locales available for the operating system on which SQL Server instances are installed. These collations provide sorting and comparison behaviors consistent with other applications that are running on Windows computers using the same local settings.
The Windows collations also provide the same rules (sort and comparison) for Unicode and non-Unicode data, which ensure data consistency.
Where is the collation used and what are the consequences
The collation is defined on 3 different levels: instance, database, and object (generaly table-column).
The collation specified during the instance creation will be used for all system databases and their objects. This will also be used as default value when new a databases is created.
If a different collation from the instance collation is specified during the database creation, this will be used for all system objects stored in this particular database, this will also be used as default value for all objects in this database.
If a different collation from the database collation is specified for a column during the table creation, this will be used on this particular column, but must also be used by the indexes using this column.
Change the collation
Now that you have the big picture and the details about the collations, we can start to explain what a collation change will do and what are the consequences. As you have seen before, we can change the collation on 3 different levels: instance, database or column. Changing the collation on the instance level is very complicated, because all system databases and all objects and columns in these system-databases have to change their collation...
It needs a rebuild of the master database and many other steps are needed. We strongly recommand to NOT do that ...
We prefer to recreate the instance with a new collation, reattach the user-databases and recreate all logins. This is faster and will avoid lots of problems!
Attention, as explained before, this will only change the collation on the instance, all system databases, and their objects (objects and columns). It will not change the collation on the user-databases and their objects!
Changing the collation on the database level seems very easy, but this will only change the collation on all internal objects (functions, procedures, etc) and on system-objects of the database! It will not change the user-table-columns and their indexes.
If you want to change the collation on a table-column, all indexes referencing this column must also be changed, because the sort order of the index has to respect the new collation that will be applied. This cannot be done automatically by SQL Server, the only method here is to drop all concerned indexes before the column-collation change, and then to recreated them.
Consequences and problems with collation changes
I want to explain the two different kinds of consequences and problems we face when a collation is changed. The first kind is only related to the change of rules (sort - comparison), where no "code page" change will happen.
Here is an example: when the collation is changed from "French_100_CI_AS" to "Latin1_General_CI_AS", both of them use the code page "1252".
The second kind is more complex, because this will not only change the rules (as the first kind), but it will also implicitely change the underlining code page and convert the included data!
Here is an example: when the collation is changed from "Latin1_General_CI_AS" to "SQL_Latin1_General_CP850_CI_AS" on a column, all characters stored in code page "1252" have to be converted in the second code page "850".
I will start with the second kind, but will not go very deep and in details here. As you will see, this strongly depend on the stored data and the convertion that will take place. I will just give you here below a very small example to show you the complexity of this kind of conversion and maybe give you some usefull solutions.
As an example I will take a collation change from "Latin1_General_CI_AS" to "SQL_Latin1_General_CP850_CI_AS". In one of the original tables, I had a "char" column containing lots of "€" characters, this character is defined with the ASCII value 128 in the code page 1252.
In the new collation "SQL_Latin1_General_CP850_CI_AS" with code page 850, the character "€" is not defined at all! SQL Server will automatically convert all values without sending any warning. And this character will be converted in a "not defined" character under code page 850, in this particular case, the not-defined character in code page 850 uses the ASCII value 63, and this value corresponds to the character "?".
This means that all "€" characters will be converted in "?" and afterwards it will be impossible to make out a difference between them, because this information is lost forever. A possible solution to this problem can be to change all “€” characters with the 4 characters “Euro” before the collation change!
As said, I will stop the example here with the conclusion that if a underlining code page change is needed, a deep analysis must be made, to see all undefined character conversions and check if these are present in the columns or objects that will be converted. Attention: You also need to check the system-objects of your database.
Let's go back to the first kind, when only the rules are changing. I will start here with the lower level, the collation change on a table-column. As soon as one object depending on this column exists, the change will not be allowed by SQL Server.
You must first drop the dependent object (that does not contains data) before changing the column collation. It is only when the collation has been changed that the dependent object can be recreated. These objects can be indexes, functions, procedures, triggers, constraints, user defined types, primary keys, foreign keys, etc.
And on the higher level, the collation on the database can only be changed when:
- all system-table-columns can be changed, this includes that all depend objects can be dropped and recreated.
- and that all database objects (procedures, functions, constraints, etc) can be dropped and recreated with the new collation.
This will be done automatically, but very often leads to errors as soon as procedures, functions, or constraints are used in the database. The "ALTER DATABASE COLLATE" command that is used to change the collation on the database will internaly list all objects that need to be changed and will analyze the dependencies.
If the analysis doesn't see problem, the command will execute the change. If some dependencies lead to problems, the command will return all errors (not only the first one) on all objects that cannot be changed and gives also some information on the problem.
As for the collation change on columns, the problematic objects need to be dropped before the change and should be recreated afterwards. As you can see, depending on the complexity of your database, a lot of prep work (dropping and recreating objects) must be performed before the collation change.
But that is not all, because you can also encounter some problems while recreating the objects or adding/losing some relations (PK-FK).
It is easier if I give you some exemples:
We have a table with a varchar column containing the following values: éric; Eric; Éric; eric; eric€; eric?
On this column, we can have a unique key index (or Primary key index) if the collation is "_CS" and "_AS".
But if the collation is "_CI" or "_AI", the same index cannot be created. This is due to the fact that your index will see duplicated values.
In the collation "_CI" "eric" is equal to "Eric" and "éric" is equal to "Éric", and in the collation "_AI" "eric" is equal to "éric" and "Eric" is equal to "Éric".
In addition, this problem can also come when the undelining code page is changed, because there can be some suprises during the conversion, such as "eric€" converted to "eric?" and other duplicated values.
And this is not all, because all joins between tables can now provide different results.
Take two different tables with the values of the example I gave before and make a join on these two tables on the same column and you will see different results depending on the collation!
If you have _CS_AS, you will get only 4 rows back, where the value is exactly identical.
But with _CI_AI, you will get 16 rows back! A cartesian product of the 4 different values, but all 4 are identical if you take them "Case and Accent Insensitiv"!
This can generate some not so funny behaviors in your application!
After all these explanations, the conclusion seems simple: Changing the collation is not a 5 minutes job. Especially if you have lots of tables and objects in your database. The change has to be well tested and many application tests should be performed.
In the next blog postings, I will write about the methodology used to change the collation properly and I will give some examples of problems that you can test by yourself.