Defining a UCA Collation using LDML Syntax
To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see , "LDML Syntax Supported in MySQL".
The example adds a collation named utf8_phone_ci
to the utf8
character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:
+7-12345-67 +7-12-345-67 +7 12 345 67 +7 (12) 345 67 +71234567
The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.
- Choose a collation ID, as shown in , "Choosing a Collation ID". The following steps use an ID of 1029.
- To modify the
Index.xml
configuration file. This file will be located in the directory named by thecharacter_sets_dir
system variable. You can check the variable value as follows, although the path name might be different on your system:mysql>
SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+ | Variable_name | Value | +--------------------+-----------------------------------------+ | character_sets_dir | /user/local/mysql/share/mysql/charsets/ | +--------------------+-----------------------------------------+ - Choose a name for the collation and list it in the
Index.xml
file. In addition, you'll need to provide the collation ordering rules. Find the<charset>
element for the character set to which the collation is being added, and add a<collation>
element that indicates the collation name and ID, to associate the name with the ID. Within the<collation>
element, provide a<rules>
element containing the ordering rules:<charset name='utf8'> ... <collation name='utf8_phone_ci' id='1029'> <rules> <reset>\u0000</reset> <i>\u0020</i> <!-- space --> <i>\u0028</i> <!-- left parenthesis --> <i>\u0029</i> <!-- right parenthesis --> <i>\u002B</i> <!-- plus --> <i>\u002D</i> <!-- hyphen --> </rules> </collation> ... </charset>
- If you want a similar collation for other Unicode character sets, add other
<collation>
elements. For example, to defineucs2_phone_ci
, add a<collation>
element to the<charset name='ucs2'>
element. Remember that each collation must have its own unique ID. - Restart the server and use this statement to verify that the collation is present:
mysql>
SHOW COLLATION LIKE 'utf8_phone_ci';
+---------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +---------------+---------+------+---------+----------+---------+ | utf8_phone_ci | utf8 | 1029 | | | 8 | +---------------+---------+------+---------+----------+---------+
Now test the collation to make sure that it has the desired properties.
Create a table containing some sample phone numbers using the new collation:
mysql>CREATE TABLE phonebook (
->name VARCHAR(64),
->phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
->);
Query OK, 0 rows affected (0.09 sec) mysql>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)
Run some queries to see whether the ignored punctuation characters are in fact ignored for sorting and comparisons: