Defining a UCA Collation using LDML Syntax


To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see , "LDML Syntax Supported in MySQL".

The example adds a collation named utf8_phone_ci to the utf8 character set. The collation is designed for a scenario involving a Web application for which users post their names and phone numbers. Phone numbers can be given in very different formats:

+7-12345-67
+7-12-345-67
+7 12 345 67
+7 (12) 345 67
+71234567

The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.

  1. Choose a collation ID, as shown in , "Choosing a Collation ID". The following steps use an ID of 1029.
  2. To modify the Index.xml configuration file. This file will be located in the directory named by the character_sets_dir system variable. You can check the variable value as follows, although the path name might be different on your system:
    mysql> SHOW VARIABLES LIKE 'character_sets_dir';
    +--------------------+-----------------------------------------+
    | Variable_name | Value |
    +--------------------+-----------------------------------------+
    | character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
    +--------------------+-----------------------------------------+
    
  3. Choose a name for the collation and list it in the Index.xml file. In addition, you'll need to provide the collation ordering rules. Find the <charset> element for the character set to which the collation is being added, and add a <collation> element that indicates the collation name and ID, to associate the name with the ID. Within the <collation> element, provide a <rules> element containing the ordering rules:
    <charset name='utf8'>
     ...
     <collation name='utf8_phone_ci' id='1029'>
     <rules>
     <reset>\u0000</reset>
     <i>\u0020</i> <!-- space -->
     <i>\u0028</i> <!-- left parenthesis -->
     <i>\u0029</i> <!-- right parenthesis -->
     <i>\u002B</i> <!-- plus -->
     <i>\u002D</i> <!-- hyphen -->
     </rules>
     </collation>
     ...
    </charset>
    
  4. If you want a similar collation for other Unicode character sets, add other <collation> elements. For example, to define ucs2_phone_ci, add a <collation> element to the <charset name='ucs2'> element. Remember that each collation must have its own unique ID.
  5. Restart the server and use this statement to verify that the collation is present:
    mysql> SHOW COLLATION LIKE 'utf8_phone_ci';
    +---------------+---------+------+---------+----------+---------+
    | Collation | Charset | Id | Default | Compiled | Sortlen |
    +---------------+---------+------+---------+----------+---------+
    | utf8_phone_ci | utf8 | 1029 | | | 8 |
    +---------------+---------+------+---------+----------+---------+
    

Now test the collation to make sure that it has the desired properties.

Create a table containing some sample phone numbers using the new collation:

mysql> CREATE TABLE phonebook (
 ->  name VARCHAR(64),
 ->  phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
 -> );
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)

Run some queries to see whether the ignored punctuation characters are in fact ignored for sorting and comparisons:

Retornar