Defining a UCA Collation using LDML Syntax

To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see , "LDML Syntax Supported in MySQL".

The example adds a collation named utf8_phone_ci to the utf8 character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:

+7-12345-67
+7-12-345-67
+7 12 345 67
+7 (12) 345 67
+71234567

The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.

Choose a collation ID, as shown in , "Choosing a Collation ID". The following steps use an ID of 1029.

To modify the Index.xml configuration file. This file will be located in the directory named by the character_sets_dir system variable. You can check the variable value as follows, although the path name might be different on your system:

mysql> SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------+-----------------------------------------+
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+

Choose a name for the collation and list it in the Index.xml file. In addition, you'll need to provide the collation ordering rules. Find the <charset> element for the character set to which the collation is being added, and add a <collation> element that indicates the collation name and ID, to associate the name with the ID. Within the <collation> element, provide a <rules> element containing the ordering rules:
```
<charset name='utf8'>
 ...
 <collation name='utf8_phone_ci' id='1029'>
 <rules>
 <reset>\u0000</reset>
 \u0020 
 \u0028 
 \u0029 
 \u002B 
 \u002D 
 </rules>
 </collation>
 ...
</charset>
```
If you want a similar collation for other Unicode character sets, add other <collation> elements. For example, to define ucs2_phone_ci, add a <collation> element to the <charset name='ucs2'> element. Remember that each collation must have its own unique ID.

Restart the server and use this statement to verify that the collation is present:

mysql> SHOW COLLATION LIKE 'utf8_phone_ci';
+---------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+------+---------+----------+---------+
| utf8_phone_ci | utf8 | 1029 | | | 8 |
+---------------+---------+------+---------+----------+---------+

Now test the collation to make sure that it has the desired properties.

Create a table containing some sample phone numbers using the new collation:

mysql> CREATE TABLE phonebook (
 ->  name VARCHAR(64),
 ->  phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
 -> );
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)

Run some queries to see whether the ignored punctuation characters are in fact ignored for sorting and comparisons:

Defining a UCA Collation using LDML Syntax

Retornar