Wednesday, May 7, 2008

CSCONVERT To The Rescue!

Three under-appreciated tasks: Converting data from one application database to another, building a good application installation or "setup.exe", and dealing with different character sets.

In the world of application development these are often regarded as dull, uninteresting tasks. In management's eyes that means they are unimportant, and therefore easy, and therefore they get left to the last minute, or assigned to the intern, or both.

With, as they say, predictable results.

I am pretty good at the first task, converting data between systems, where "pretty good" means "lots of bitter experience over many years". In some cases a MobiLink setup is really a glorified data conversion exercise, especially when the databases being synchronized have vastly different schema.

On the second task, building a good setup, I can code my way out of a wet paper bag but not much more. What I do know is this: A good setup is hard to build, which may be why (until recent years) there have been so few good ones.

On dealing with character sets, code pages, locales and collations, I'm absolutely hopeless. I can read the documentation (heaven knows I've read it enough times), I just can't put the docs together with real world situations.

That's what this post is about: One teeny tiny real-world problem involving character sets. That I couldn't fix without help. Where "help" means an extended round of emails with my good friend Super Duper Expert.

Here's where it started: "French characters aren't being displayed properly. They look OK in the input XML file, but all messed up when the application displays them."

Here's what "OK" means...

And here's what "messed up" means...

The problem was this:

  • The SQL Anywhere database was created on Windows with the default character set called "cp1252". Or maybe that's a code page. Whatever. It's the default, and defaults are not something to be fiddled with. Especially not THIS default: I never fiddle with defaults I don't understand.

  • The input XML file was (apparently) created with a character set called "utf8".

  • The input utf8 characters were not being converted to the cp1252 character set, but just stored in the database as is. So "OK" becomes "messed up" when the data was displayed as if it was in the cp1252 character set.

  • The XML data was directly read by xp_read_file() inside a stored procedure, not passed across the client-server boundary from an application program. This means SQL Anywhere's automatic client-server character set translation feature didn't apply. Not even when I added CHARSET=UTF8 to the connection string... that was email number three. Or was it email number six?
The solution was to call SQL Anywhere's CSCONVERT() function to force the character set conversion to be performed inside the stored procedure:
SET @xml 
= CSCONVERT
( xp_read_file ( 'filespec-for-XML-data' ),
'cp1252',
'utf8' );

If that code looks familiar, it's from the earlier post OpenXML() Rocks!

No comments: