OEMCP vs. ACP

One of the trepidations I had when starting to work on client software three years ago was dealing with globalization and localization.  I had heard of horror stories of how much busy work localization involved (they have teams that deal with that issue alone), and it was a black box for me.  (Yes, I realize that they’re just strings in resource files.)  Fortunately enough, since I have been working on a client platform, instead of the UX on top of the platform, I never had to deal with localization at all.  In fact, I was pleasantly surprised when I first heard that sync’ing files like 好嗎.txt actually worked.  Effortless!

My naïvité lasted up until a month ago, when I investigated a bug report from a customer.  I sat on the bug for a while, as I was a little perplexed by the whole report.  It was most certainly some sort of localization issue, and it took some head-banging before I figured it out.  The details aren’t particularly important here, so let me just get to the gist of the bug.

At the command prompt: (Yes, the command prompt.  Does that still surprise you?  Seven of my 10 previous “code” posts have to do with the command prompt.)

C:>echo Comment ça va?
Comment ça va?

I took French in high school.  That’s a cedilla under the ‘c’.  Anyway, there’s nothing unexpected here.  Echo does its job as asked.  Now for some redirection:

C:>echo Comment ça va? > salut.txt

C:>type salut.txt
Comment ça va?

Again, nothing unexpected here.  Yawn.

But wait!  (There’s more!)

C:>notepad salut.txt

Notepad will open up.  And what does it show?

Comment ‡a va?

What the heck?!  Blink.  Twice.  Huh?  What’s going on?  I did a bunch of digging and found all sorts of interesting information online about code pages.  From Wikipedia:

Code page is the traditional IBM term used to map a specific set of characters to numerical code point values.  …  [T]he term is most commonly associated with the IBM PC code pages. Microsoft, a maker of PC operating systems, refers to these code pages as OEM code pages, and supplements them with its own "ANSI" code pages.

It turns out that on an EN-US operating system, the OEM code page is CP437, whereas the default ANSI code page (ACP) is CP1252.  Apparently the command prompt uses a different code page than do GUI programs, so this must be an encoding issue.  If you look at the file in a hex editor, you’ll see that the character ‘ç’ is encoded as 0xe7, as expected from the OEMCP: 87 = U+00E7 : LATIN SMALL LETTER C WITH CEDILLA.  But in the ACP, 0xe7 is ‘‡’: 87 = U+2021 : DOUBLE DAGGER!

So does that mean that the issue goes away if you change the code page the command prompt is using to the one UI apps use?

C:>chcp /?
Displays or sets the active code page number.

CHCP [nnn]

  nnn   Specifies a code page number.

Type CHCP without a parameter to display the active code page number.

C:>chcp
Active code page: 437

C:>chcp 1252
Active code page: 1252

C:>echo Comment ça va? > salut.txt

C:>notepad salut.txt

Comment ça va? (in Notepad)

Check that out!  Ah, oui!  Bien sûr!

Raymond Chen has a bit more on the details and the historical reasons behind the schism.  Thanks to Michael Kaplan for a series of excellent posts on the subject.  I still know next to nothing about localization, but it’s pretty cool when you figure things out, even if those things were discussed ad nauseam years before you even came across it.  (And just to wrap this up, the Win32 function GetACP() will allow you to get the current ANSI code page, but that isn’t scriptable unless you wrap it in an exe.)

I’m floating the idea of doing a “Bugs of Live Mesh/Framework” series.  (Bonus points for anyone that can figure out how this bug relates to Live Mesh.  Hint: It’s related to a previous post I made here over a year ago.)  I was thinking of covering some of the more ‘interesting’ bugs: why they exist, what fixes were/will be made, what workarounds there are, or even to solicit feedback.  What say you?

Post a comment or leave a trackback: Trackback URL.

Leave a comment