Archive

Archive for August, 2010

Mysql, load data infile, utf8

August 24, 2010 Leave a comment

We have a need to sync data from oracle to mysql. The extract-data and the dump-data servers run on JBoss.
Steps
1. Dump-data server connects to extract-data server asking for the data
2. Extract-data server connects to oracle db, makes a CSV based file and sends it to dump-data server
3. Dump-data server extracts information from the CSV file for diff tables.
4. Dump-data server uses LOAD DATA INFILE to upload data into the corresponding mysql tables.

Issues
1. UTF8 characters like \u20ae was inserted into mysql as u20ae.
Example row in the CSV file.
“a”,”b”,”\u20ae”
2. Japanese/Chinese etc characters showed up as ? in the CSV files on the dump-server side.

Solutions
1. The default escape character in MySQL is “\”. So, If you have \n its interpreted as ‘newline’ , \t as tab etc etc. Any character that doesnt not fall in the same category will be treated as the normal character. For eg \u will be inserted into the table as ‘u’.
This issue does not occur when you would be inserting via java prepared statement/statement. The MySQL connector takes care of the escape sequencing.
You can turn off the default escape character feature in mysql by using NO_ESCAPE parameter.

2. International characters : There are two ways to solve this issue
a. The character set indicated by the ‘character_set_database’ system variable is used to interpret the information in the CSV file. Set character_set_database=’utf8′ and saving the CSV file in utf8 encoded format does the trick.
b. Save the file in utf8 encode format and use ‘character set’ in load data file command. This feature worked for me in 5.1.45 and doesnt work in earlier versions.