Vyoms OneStopTesting.com - Testing EBooks, Tutorials, Articles, Jobs, Training Institutes etc.
OneStopGate.com - Gate EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopMBA.com - MBA EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopIAS.com - IAS EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopSAP.com - SAP EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopGRE.com - of GRE EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
Bookmark and Share Rss Feeds

Harmonizing Character Encoding Between Imported Data and MySQL | Articles | Recent Articles | News Article | Interesting Articles | Technology Articles | Articles On Education | Articles On Corporate | Company Articles | College Articles | Articles on Recession
Sponsored Ads
Hot Jobs
Fresher Jobs
Experienced Jobs
Government Jobs
Walkin Jobs
Placement Section
Company Profiles
Interview Questions
Placement Papers
Resources @ VYOMS
Companies In India
Consultants In India
Colleges In India
Exams In India
Latest Results
Notifications In India
Call Centers In India
Training Institutes In India
Job Communities In India
Courses In India
Jobs by Keyskills
Jobs by Functional Areas
Learn @ VYOMS
GATE Preparation
GRE Preparation
GMAT Preparation
IAS Preparation
SAP Preparation
Testing Preparation
MBA Preparation
News @ VYOMS
Freshers News
Job Articles
Latest News
India News Network
Interview Ebook
Get 30,000+ Interview Questions & Answers in an eBook.
Interview Success Kit - Get Success in Job Interviews
  • 30,000+ Interview Questions
  • Most Questions Answered
  • 5 FREE Bonuses
  • Free Upgrades

VYOMS TOP EMPLOYERS

Wipro Technologies
Tata Consultancy Services
Accenture
IBM
Satyam
Genpact
Cognizant Technologies

Home » Articles » Harmonizing Character Encoding Between Imported Data and MySQL

Harmonizing Character Encoding Between Imported Data and MySQL








Article Posted On Date : Wednesday, February 3, 2010


Harmonizing Character Encoding Between Imported Data and MySQL
Advertisements

By Rob Gravelle

My last DatabaseJournal article,  All About the Crosstab Query, described how to formulate an SQL statement for generating a cross tabulation query. My original intention for today's follow up was to explore the use of stored procedures to make crosstab generation more dynamic.  That was until a bug sent me on a search for answers.  What I found was intriguing enough to make me set aside my original topic, so that I could now relate what I discovered about the handling of import data encoding.  I think that you'll agree that it's a journey well worth taking!
Character Encoding and Collation Described

A character encoding is a way of mapping a character (the letter 'A') to an integer in a character set (the number 65 in the US-ASCII character set). With a limited character set, such as US-ASCII, which includes the twenty-six letters of the English alphabet, both lowercase and uppercase, numbers from 0 to 9, and some punctuation, fitting this into a single byte is not a problem. But when dealing with other languages like German, Swedish, Hungarian, and Japanese, you start to hit the boundaries of the 8-bit byte. This can happen when you try to create a character set to represent two languages, or even a single language like Japanese.

In an effort to account for the profusion of languages and scripts in the modern world, a number of different character encodings have been ascribed for mapping different characters to integers. For character sets that wouldn't fit in a single byte, double-byte character sets were created, along with multi-byte character sets that use a special character to signal a shift between single-byte and double-byte encoding.

The Unicode Consortium came together to create a specification for a character encoding that would be able to encompass the characters in all written languages. The result was the Unicode character set. The two most common are UCS-2, which encodes everything as two-byte characters, and UTF-8, which uses a multi-byte encoding scheme that extends US-ASCII.

ISO-8859-1 is the most common character set used for Western languages, and it is extended by the Windows-1252 character set to include some other characters, such as the Euro () and trademark symbol (). Because Windows-1252 is a superset of ISO-8859-1, the character set is known as latin1 to MySQL. (It does not recognize ISO-8859-1 as being a distinct character set.)
So Why Can't We Just Use UCS-2 or UTF-8 for Everything?

The main reason that it isn't practical to always use Unicode is that it wastes bandwidth when using only a single language. In TIS-620, the single byte code page for Thai, all characters takes up one byte, whereas in UTF-8, Thai characters take up three bytes each. Many people think UTF-8 is efficient because ASCII characters take up only one byte, but in reality, UTF-8 can be highly inefficient when most of your file consists of characters outside of the ASCII set. For example, if half your file consists of ASCII, and the other half is Thai, then saving the file in UTF-8 makes it take up twice as much space than TIS-620 would.

In situations where storage space is at a premium, and ASCII plus one script is used, using an older one byte character set can make a lot of sense. Unicode is necessary when using multiple scripts in one file, and two or more of the scripts use different code pages (e.g.: you can mix Thai and English, because TIS-620 also includes the ASCII symbols, but you cannot mix Thai and Greek without using Unicode, because Thai and Greek require different code pages).

A collation comprises the rules governing the proper use of characters for either a language, such as Greek or Polish, or an alphabet, such as Latin1_General. The collation attribute is used by MySQL for the sorting of characters in relation to one another, and not for encoding specifically.

Each SQL Server collation specifies two properties:

    * The sort order to use for nchar, nvarchar, and ntext Unicode data types as well as for non-Unicode character data types (char, varchar, and text). A sort order defines the sequence in which characters are evaluated in comparison operations.
    * The code page used to store non-Unicode character data.

If you'd like to read up more on collation, there's an informative DatabaseJounal article on collation by Muthusamy Anantha Kumar (aka The MAK).
A Tale of Two Character Sets

MySQL ships with a Latin-1 as the default encoding - actually latin1_swedish_ci presumably because it is used by the majority of MySQL customers. In MySQL 4.1.12 or greater, data is imported in UTF8 encoding.  While this does allow the maximum number of characters codes, it can still present problems when the incoming data is in a different character encoding.  What's worse is that you may not know about the problem until a lot of work is required to undo the damage!

I discovered this first-hand when I imported data from an Access database to test my crosstab SQL code in MySQL.  The transfer only ran into one snag, when the date formats were inconsistent.  I fixed that problem by formatting the dates in the universal "yyyy-mm-dd" format.  Here is the Access query used to extract the data:

SELECT TA_CASES.FEE_NUMBER,
       TA_CASES.CASE_TYPE,
       Format([CREATION_DATE],"yyyy-mm-dd") AS [CREATION DATE],
       TA_CASES.REGION_CODE
FROM TA_CASES;


To my surprise, running the query on the imported data produced the following bizarre results, as seen in this screenshot of my HeidiSQL Windows client:

HeidiSQL Windows client

It seemed that the MONTHNAME function was misbehaving and returning a HEX number instead of a string.  Upon further experimentation, I concluded that any function on the dates were returning HEX values. A few Internet searches later, I came to understand that this sort of thing was quite common as I read account after account of people having to jump through hoops to translate character codes to the correct number.

In my case, the discrepancy between the two encodings was caused by Access when I saved the data to a .csv (comma-separated values) file. MS Access uses the "Windows-1252" character encoding when exporting to text.  You can test this by exporting data as an HTML page.  In it, there will be a META tag that declares the character encoding for the page:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Windows-1252">

The simplest solution that I found was to use the MySQL CONVERT function in the query.  It accepts a value and translates it to the encoding format that you specify following the USING keyword.  Here is how I used the CONVERT function in my SQL code to fix the encoding problem:

Mysql>SELECT CONVERT(MONTHNAME(CREATION_DATE) USING latin1) AS 'Month',
...

The solution that I employed only affected the output of the query.  Other solutions can be utilized at import time or applied to the entire server, database, table, or field level.  We'll be looking at these after we get to our previously scheduled crosstab stored proc, in the next article.







Sponsored Ads



Interview Questions
HR Interview Questions
Testing Interview Questions
SAP Interview Questions
Business Intelligence Interview Questions
Call Center Interview Questions

Databases

Clipper Interview Questions
DBA Interview Questions
Firebird Interview Questions
Hierarchical Interview Questions
Informix Interview Questions
Microsoft Access Interview Questions
MS SqlServer Interview Questions
MYSQL Interview Questions
Network Interview Questions
Object Relational Interview Questions
PL/SQL Interview Questions
PostgreSQL Interview Questions
Progress Interview Questions
Relational Interview Questions
SQL Interview Questions
SQL Server Interview Questions
Stored Procedures Interview Questions
Sybase Interview Questions
Teradata Interview Questions

Microsof Technologies

.Net Database Interview Questions
.Net Deployement Interview Questions
ADO.NET Interview Questions
ADO.NET 2.0 Interview Questions
Architecture Interview Questions
ASP Interview Questions
ASP.NET Interview Questions
ASP.NET 2.0 Interview Questions
C# Interview Questions
Csharp Interview Questions
DataGrid Interview Questions
DotNet Interview Questions
Microsoft Basics Interview Questions
Microsoft.NET Interview Questions
Microsoft.NET 2.0 Interview Questions
Share Point Interview Questions
Silverlight Interview Questions
VB.NET Interview Questions
VC++ Interview Questions
Visual Basic Interview Questions

Java / J2EE

Applet Interview Questions
Core Java Interview Questions
Eclipse Interview Questions
EJB Interview Questions
Hibernate Interview Questions
J2ME Interview Questions
J2SE Interview Questions
Java Interview Questions
Java Beans Interview Questions
Java Patterns Interview Questions
Java Security Interview Questions
Java Swing Interview Questions
JBOSS Interview Questions
JDBC Interview Questions
JMS Interview Questions
JSF Interview Questions
JSP Interview Questions
RMI Interview Questions
Servlet Interview Questions
Socket Programming Interview Questions
Springs Interview Questions
Struts Interview Questions
Web Sphere Interview Questions

Programming Languages

C Interview Questions
C++ Interview Questions
CGI Interview Questions
Delphi Interview Questions
Fortran Interview Questions
ILU Interview Questions
LISP Interview Questions
Pascal Interview Questions
Perl Interview Questions
PHP Interview Questions
Ruby Interview Questions
Signature Interview Questions
UML Interview Questions
VBA Interview Questions
Windows Interview Questions
Mainframe Interview Questions


Copyright © 2001-2018 Vyoms.com. All Rights Reserved. Home | About Us | Advertise With Vyoms.com | Jobs | Contact Us | Feedback | Link to Us | Privacy Policy | Terms & Conditions
Placement Papers | Get Your Free Website | IAS Preparation | C++ Interview Questions | C Interview Questions | Report a Bug | Romantic Shayari | CAT 2018

Fresher Jobs | Experienced Jobs | Government Jobs | Walkin Jobs | Company Profiles | Interview Questions | Placement Papers | Companies In India | Consultants In India | Colleges In India | Exams In India | Latest Results | Notifications In India | Call Centers In India | Training Institutes In India | Job Communities In India | Courses In India | Jobs by Keyskills | Jobs by Functional Areas

Testing Articles | Testing Books | Testing Certifications | Testing FAQs | Testing Downloads | Testing Interview Questions | Testing Jobs | Testing Training Institutes

Gate Articles | Gate Books | Gate Colleges | Gate Downloads | Gate Faqs | Gate Jobs | Gate News | Gate Sample Papers | Gate Training Institutes

MBA Articles | MBA Books | MBA Case Studies | MBA Business Schools | MBA Current Affairs | MBA Downloads | MBA Events | MBA Notifications | MBA FAQs | MBA Jobs
MBA Job Consultants | MBA News | MBA Results | MBA Courses | MBA Sample Papers | MBA Interview Questions | MBA Training Institutes

GRE Articles | GRE Books | GRE Colleges | GRE Downloads | GRE Events | GRE FAQs | GRE News | GRE Training Institutes | GRE Sample Papers

IAS Articles | IAS Books | IAS Current Affairs | IAS Downloads | IAS Events | IAS FAQs | IAS News | IAS Notifications | IAS UPSC Jobs | IAS Previous Question Papers
IAS Results | IAS Sample Papers | IAS Interview Questions | IAS Training Institutes | IAS Toppers Interview

SAP Articles | SAP Books | SAP Certifications | SAP Companies | SAP Study Materials | SAP Events | SAP FAQs | SAP Jobs | SAP Job Consultants
SAP Links | SAP News | SAP Sample Papers | SAP Interview Questions | SAP Training Institutes |


Copyright ©2001-2018 Vyoms.com, All Rights Reserved.
Disclaimer: VYOMS.com has taken all reasonable steps to ensure that information on this site is authentic. Applicants are advised to research bonafides of advertisers independently. VYOMS.com shall not have any responsibility in this regard.