I. What are Character Sets and Collations?¶
A character set (Character Set) can be understood as the “encoding rule” used by MySQL to store characters. Examples include the familiar utf8mb4 and latin1, which define how MySQL converts text (such as Chinese, English, emojis) into binary data for storage.
Collation (Collation) is the “sorting rule” of a character set, determining how text is compared and sorted. For example, utf8mb4_general_ci sorts alphabetically without case sensitivity, while utf8mb4_bin strictly differentiates case based on binary values.
II. Why Configure Character Sets and Collations?¶
Improper configuration of character sets or collations leads to garbled text or incorrect sorting:
- Garbled text: Storing Chinese with the latin1 character set causes MySQL to convert Chinese into unrecognizable binary data, resulting in garbled text when read.
- Incorrect sorting: Choosing the wrong collation might place “Zhang San” before “Li Si” (if sorting by pinyin initials, verify the order matches expectations).
- Compatibility issues: The legacy utf8 character set only supports partial Unicode characters (excluding emojis and rare characters), while utf8mb4 provides full Unicode support. Always use utf8mb4 by default.
III. Configuration Hierarchy of MySQL Character Sets and Collations¶
MySQL character sets and collations have multiple effective levels, with priorities from highest to lowest: Column-level > Table-level > Database-level > Server-level. If no specific settings exist at the column, table, or database levels, the server’s default configuration is used.
- Server-level: Globally effective, modifiable via configuration files (e.g.,
my.cnf), controlling the database server’s default character set. - Database-level: Specified when creating a database, applying to all tables in the database (unless overridden by table or column settings).
- Table-level: Specified when creating a table, applying to all columns in the table (unless overridden by column settings).
- Column-level: Specified individually for specific columns when creating a table, with the highest priority, overriding table-level configurations.
IV. How to View Current Configurations?¶
Newcomers should first learn to check current configurations to avoid errors. Common commands:
- View server-level character sets:
SHOW VARIABLES LIKE 'character_set_%';
Common results: character_set_server (default utf8mb4), character_set_database (default database character set), etc.
- View server-level collations:
SHOW VARIABLES LIKE 'collation_%';
Common results: collation_server (default collation, e.g., utf8mb4_general_ci).
- View database character set:
SHOW CREATE DATABASE your_database;
Results show the character set and collation specified when the database was created.
- View table character set:
SHOW TABLE STATUS LIKE 'your_table';
The Collation column in the results shows the table’s collation.
- View column character set:
SHOW COLUMNS FROM your_table;
The Character_set column shows the column’s character set (inherits table-level settings if not specified).
V. Essential Configuration Methods for Beginners¶
1. Use utf8mb4 Character Set¶
utf8mb4 is MySQL’s full Unicode encoding (supports Chinese, English, emojis, rare characters), while the legacy utf8 (actually utf8mb3) only supports 3-byte characters and cannot store emojis or special characters.
- Server-level configuration (modify the configuration file):
Inmy.cnf(Linux) ormy.ini(Windows), add:
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci
Restart the MySQL service after saving (systemctl restart mysql or net stop mysql && net start mysql).
2. Database-level Configuration¶
Specify the character set when creating the database (recommended):
CREATE DATABASE your_database
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;
To modify an existing database:
ALTER DATABASE your_database
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;
3. Table-level Configuration¶
Specify the character set when creating the table (to override server defaults):
CREATE TABLE your_table (
id INT,
name VARCHAR(50)
) ENGINE=InnoDB
DEFAULT CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;
For case-sensitive sorting, use utf8mb4_bin (compares strictly by binary values):
COLLATE utf8mb4_bin
4. Column-level Configuration (Special Cases)¶
For columns requiring separate character sets (e.g., mixed English and Chinese with different collation needs):
CREATE TABLE your_table (
id INT,
english_name VARCHAR(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_english_ci,
chinese_name VARCHAR(50)
) ENGINE=InnoDB;
Note: Column-level configurations are only necessary for overriding table-level settings; prefer table-level uniformity.
VI. Common Issues and Solutions¶
-
Garbled Text:
- Cause: Mismatch between column/table/database character sets and the encoding of inserted data (e.g., storing Chinese withlatin1).
- Solution: Ensure data sources (e.g., application code) useutf8mb4, and verify consistent character set across all levels. -
Incorrect Chinese Sorting:
- Scenario:utf8mb4_general_cimay not match expectations for Chinese pinyin sorting (e.g., “Li” and “Wang” order).
- Solution: Useutf8mb4_unicode_cifor more precise sorting (supports complex languages like German/French) but with slightly lower performance thangeneral_ci. -
Emoji Not Displaying:
- Cause: Using the legacyutf8(utf8mb3) character set.
- Solution: Switch toutf8mb4, which supports all Unicode characters including emojis.
VII. Summary and Best Practices¶
- Always use
utf8mb4instead of the legacyutf8. - Choose
utf8mb4_general_ciorutf8mb4_unicode_cifor collation: The former is faster, the latter is more precise (select based on needs). - Avoid column-level overrides unless necessary; table-level uniformity suffices.
- Regularly check configurations: Use
SHOW VARIABLESandSHOW CREATE DATABASE/ TABLEto confirm character sets are effective.
Mastering character set and collation configuration prevents data garbling and sorting errors, a foundational MySQL skill. Remember: Uniform character sets and collations are critical for data consistency!