I. What are Character Sets and Collations?

A character set (Character Set) can be understood as the “encoding rule” used by MySQL to store characters. Examples include the familiar utf8mb4 and latin1, which define how MySQL converts text (such as Chinese, English, emojis) into binary data for storage.

Collation (Collation) is the “sorting rule” of a character set, determining how text is compared and sorted. For example, utf8mb4_general_ci sorts alphabetically without case sensitivity, while utf8mb4_bin strictly differentiates case based on binary values.

II. Why Configure Character Sets and Collations?

Improper configuration of character sets or collations leads to garbled text or incorrect sorting:
- Garbled text: Storing Chinese with the latin1 character set causes MySQL to convert Chinese into unrecognizable binary data, resulting in garbled text when read.
- Incorrect sorting: Choosing the wrong collation might place “Zhang San” before “Li Si” (if sorting by pinyin initials, verify the order matches expectations).
- Compatibility issues: The legacy utf8 character set only supports partial Unicode characters (excluding emojis and rare characters), while utf8mb4 provides full Unicode support. Always use utf8mb4 by default.

III. Configuration Hierarchy of MySQL Character Sets and Collations

MySQL character sets and collations have multiple effective levels, with priorities from highest to lowest: Column-level > Table-level > Database-level > Server-level. If no specific settings exist at the column, table, or database levels, the server’s default configuration is used.

  • Server-level: Globally effective, modifiable via configuration files (e.g., my.cnf), controlling the database server’s default character set.
  • Database-level: Specified when creating a database, applying to all tables in the database (unless overridden by table or column settings).
  • Table-level: Specified when creating a table, applying to all columns in the table (unless overridden by column settings).
  • Column-level: Specified individually for specific columns when creating a table, with the highest priority, overriding table-level configurations.

IV. How to View Current Configurations?

Newcomers should first learn to check current configurations to avoid errors. Common commands:

  1. View server-level character sets:
   SHOW VARIABLES LIKE 'character_set_%';

Common results: character_set_server (default utf8mb4), character_set_database (default database character set), etc.

  1. View server-level collations:
   SHOW VARIABLES LIKE 'collation_%';

Common results: collation_server (default collation, e.g., utf8mb4_general_ci).

  1. View database character set:
   SHOW CREATE DATABASE your_database;

Results show the character set and collation specified when the database was created.

  1. View table character set:
   SHOW TABLE STATUS LIKE 'your_table';

The Collation column in the results shows the table’s collation.

  1. View column character set:
   SHOW COLUMNS FROM your_table;

The Character_set column shows the column’s character set (inherits table-level settings if not specified).

V. Essential Configuration Methods for Beginners

1. Use utf8mb4 Character Set

utf8mb4 is MySQL’s full Unicode encoding (supports Chinese, English, emojis, rare characters), while the legacy utf8 (actually utf8mb3) only supports 3-byte characters and cannot store emojis or special characters.

  • Server-level configuration (modify the configuration file):
    In my.cnf (Linux) or my.ini (Windows), add:
  [mysqld]
  character-set-server = utf8mb4
  collation-server = utf8mb4_general_ci

Restart the MySQL service after saving (systemctl restart mysql or net stop mysql && net start mysql).

2. Database-level Configuration

Specify the character set when creating the database (recommended):

CREATE DATABASE your_database 
CHARACTER SET utf8mb4 
COLLATE utf8mb4_general_ci;

To modify an existing database:

ALTER DATABASE your_database 
CHARACTER SET utf8mb4 
COLLATE utf8mb4_general_ci;

3. Table-level Configuration

Specify the character set when creating the table (to override server defaults):

CREATE TABLE your_table (
  id INT,
  name VARCHAR(50)
) ENGINE=InnoDB 
DEFAULT CHARACTER SET utf8mb4 
COLLATE utf8mb4_general_ci;

For case-sensitive sorting, use utf8mb4_bin (compares strictly by binary values):

COLLATE utf8mb4_bin

4. Column-level Configuration (Special Cases)

For columns requiring separate character sets (e.g., mixed English and Chinese with different collation needs):

CREATE TABLE your_table (
  id INT,
  english_name VARCHAR(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_english_ci,
  chinese_name VARCHAR(50)
) ENGINE=InnoDB;

Note: Column-level configurations are only necessary for overriding table-level settings; prefer table-level uniformity.

VI. Common Issues and Solutions

  1. Garbled Text:
    - Cause: Mismatch between column/table/database character sets and the encoding of inserted data (e.g., storing Chinese with latin1).
    - Solution: Ensure data sources (e.g., application code) use utf8mb4, and verify consistent character set across all levels.

  2. Incorrect Chinese Sorting:
    - Scenario: utf8mb4_general_ci may not match expectations for Chinese pinyin sorting (e.g., “Li” and “Wang” order).
    - Solution: Use utf8mb4_unicode_ci for more precise sorting (supports complex languages like German/French) but with slightly lower performance than general_ci.

  3. Emoji Not Displaying:
    - Cause: Using the legacy utf8 (utf8mb3) character set.
    - Solution: Switch to utf8mb4, which supports all Unicode characters including emojis.

VII. Summary and Best Practices

  1. Always use utf8mb4 instead of the legacy utf8.
  2. Choose utf8mb4_general_ci or utf8mb4_unicode_ci for collation: The former is faster, the latter is more precise (select based on needs).
  3. Avoid column-level overrides unless necessary; table-level uniformity suffices.
  4. Regularly check configurations: Use SHOW VARIABLES and SHOW CREATE DATABASE/ TABLE to confirm character sets are effective.

Mastering character set and collation configuration prevents data garbling and sorting errors, a foundational MySQL skill. Remember: Uniform character sets and collations are critical for data consistency!

Xiaoye