Categories
character-encoding mariadb mysql unicode utf-8

Trouble with UTF-8 characters; what I see is not what I stored

98

I tried to use UTF-8 and ran into trouble.

I have tried so many things; here are the results I have gotten:

  • ???? instead of Asian characters. Even for European text, I got Se?or for Señor.
  • Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
  • Black diamonds, such as Se�or.
  • Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
  • Even when I got text to look right, it did not sort correctly.

What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

    170

    This problem plagues the participants of this site, and many others.

    You have listed the five main cases of CHARACTER SET troubles.

    Best Practice

    Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

    utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

    Outside of MySQL, “UTF-8” refers to all size encodings, hence effectively the same as MySQL’s utf8mb4, not utf8.

    I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

    Overview of what you should do

    • Have your editor, etc. set to UTF-8.
    • HTML forms should start like <form accept-charset="UTF-8">.
    • Have your bytes encoded as UTF-8.
    • Establish UTF-8 as the encoding being used in the client.
    • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
    • <meta charset=UTF-8> at the beginning of HTML
    • Stored Routines acquire the current charset/collation. They may need rebuilding.

    UTF-8 all the way through

    More details for computer languages (and its following sections)

    Test the data

    Viewing the data with a tool or with SELECT cannot be trusted.
    Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
    So, pick a table and column that has some non-English text and do

    SELECT col, HEX(col) FROM tbl WHERE ...
    

    The HEX for correctly stored UTF-8 will be

    • For a blank space (in any language): 20
    • For English: 4x, 5x, 6x, or 7x
    • For most of Western Europe, accented letters should be Cxyy
    • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
    • Most of Asia: Exyyzz
    • Emoji and some of Chinese: F0yyzzww
    • More details

    Specific causes and fixes of the problems seen

    Truncated text (Se for Señor):

    • The bytes to be stored are not encoded as utf8mb4. Fix this.
    • Also, check that the connection during reading is UTF-8.

    Black Diamonds with question marks (Se�or for Señor);
    one of these cases exists:

    Case 1 (original bytes were not UTF-8):

    • The bytes to be stored are not encoded as utf8. Fix this.
    • The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
    • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

    Case 2 (original bytes were UTF-8):

    • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
    • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

    Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

    Question Marks (regular ones, not black diamonds) (Se?or for Señor):

    • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
    • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
    • Also, check that the connection during reading is UTF-8.

    Mojibake (Señor for Señor):
    (This discussion also applies to Double Encoding, which is not necessarily visible.)

    • The bytes to be stored need to be UTF-8-encoded. Fix this.
    • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
    • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
    • HTML should start with <meta charset=UTF-8>.

    If the data looks correct, but won’t sort correctly, then
    either you have picked the wrong collation,
    or there is no collation that suits your need,
    or you have Double Encoding.

    Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.

    é should come back C3A9, but instead shows C383C2A9
    The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
    

    That is, the hex is about twice as long as it should be.
    This is caused by converting from latin1 (or whatever) to utf8, then treating those
    bytes as if they were latin1 and repeating the conversion.
    The sorting (and comparing) does not work correctly because it is, for example,
    sorting as if the string were Señor.

    Fixing the Data, where possible

    For Truncation and Question Marks, the data is lost.

    For Mojibake / Double Encoding, …

    For Black Diamonds, …

    The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

    9

    • If the client, database and tables are in utf8mb4 I seem to be able to store emojis fine. Some blogs suggest also setting collation-server and character-set-server in mysqld. Do I really need to change mysqld what difference does server setting make?

      Jan 15, 2018 at 16:50

    • 1

      @david_adler – There are multiple ways to get the effect of those settings. The best is to use the clients connection parameters. Second best by executing SET NAMES utf8mb4 right after connecting. After all, this is declaring the encoding in the client.

      Jan 15, 2018 at 18:47


    • With MySQL 8.0 (now released), the default is utf8mb4 and utf8mb4_0900_ai_ci. Most users should use them without considering the other charsets and collations.

      May 10, 2018 at 3:56

    • Tips on configuring Python, PHP, and about 40 other languages

      Jun 29, 2018 at 17:06

    • Another note: If a FUNCTION or STORED PROCEDURE is involved, you may have not been using the desired charset when you created it. DROP it, SET NAMES; re-CREATE it.

      Jan 3, 2019 at 22:35

    10

    I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:

    mysqli_set_charset($con,"utf8mb4");
    

    After adding this line to my configuration file, everything works fine!

    I found this solution for MySQLiPHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.

    1

    • Yes, that is one of several things that can cause character set problems. Note: that syntax is valid for PHP, not other app languages, and only if using mysqli, not PDO.

      Mar 26, 2019 at 2:05

    2

    I was also searching for the same issue. It took me nearly one month to find the appropriate solution.

    First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.

    For Java:

    while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.

    For Python:

    Before querying into the database, try enforcing this over the cursor

    *
    cursor.execute('SET NAMES utf8mb4')
    cursor.execute("SET CHARACTER SET utf8mb4")
    cursor.execute("SET character_set_connection=utf8mb4")
    *

    If it does not work, happy hunting for the right solution.

    3

    • 1 month? That was fast. It took me more than a year to formulate this Q&A. Java looks right. SETs are not the ‘right’ way for Python; see mysql.rjweb.org/doc.php/charcoll#python Many other languages are discussed elsewhere in that blog.

      Dec 19, 2019 at 17:24

    • 1

      @RickJames But this issue exist with Mysql-Python below 1.2.4, so the SET statements are basically a work around.

      Jul 28, 2020 at 6:33

    • How is the content near “cursor.execute” supposed to be formatted? Separate lines for each one? Or something else? Is “*” literate or intended as italics formatting?

      Apr 12 at 23:40