In the first tip, I tried to establish why you, the ONIX file sender, have to test your file, and that’s simply to ensure that the file’s content—all the characters—would be recognized by the aggregator’s software. The “encoding” declaration in the first line of the file tells the recipient what to expect—and your job is to ensure that the file matches that.
If you’re trading files in English speaking North America you’ve got a choice of three encodings that will almost certainly be considered acceptable by aggregators. (There are lots of others, but my assumption is that you’re trading files largely in English, with some French and/or Spanish thrown in).
The default encoding in ONIX is UTF-8. It’s the most commonly used in English North America for XML and the most supported by XML software. It’s more-or-less what was called ASCII (but not extended ASCII)—the English language keyboard characters. Any text document in English will almost certainly be largely in UTF-8 encoding without any work on your part.
The other common encoding is ISO-8859-1, what might be called ‘extended ASCII’ or Latin-1. It supports the common accented characters in French, Spanish and German. BISG has identified this as the preferred encoding for the US supply chain. We in Canada are more demure and think it slightly impolite to discuss, but are OK with it too.
And then there is “windows-1252.” This is what, in desperation, your trading partners will use when they hope you’re on the Windows operating system and your file is screwing up when they load it. It’s the Windows version of ISO-8859-1. I think. I don’t really know… Who could possibly care about this?!
Here’s the dummy version: When you hit a computer key some code is generated and interpreted and appears on your screen. There’re conventions and standards that control all this and when you bought your computer if the sales person was awfully knowledgeable, they might have been able to tell you what conventions your computer follows. If you’re on a PC with a number pad try this: Hold the ALT key down and on the number pad key 80. If you did that you made a big pee, and I’m really, really pleased with myself for getting you to do it. My only point is that there really isn’t a way to know what your computer is doing, except that:
- If you bought your computer in English speaking North America;
- and no one said it wasn’t an standard keyboard;
- and you’ve not really thought much about it;
then what happens when you make simple keystrokes is almost certainly UTF-8 (unless some piece of software is screwing with what you type). Can you cut and paste into a text document or email and it (usually) doesn’t turn to gibberish? Then it’s more or less UTF-8.
XML software doesn’t care. It’s up to you to tell it what your characters are, and as a start assume that you’re typing largely in UTF-8. You don’t really have a choice. But here’s a quick solution to testing your ONIX and it’s not loading properly because of unrecognized characters. Change the encoding declaration to encoding=”iso-8859-1” and hope. It may be all that you need, but more likely you’ll have a small number of unrecognized types of characters in your file.
To summarize: You must test all XML files before sending them, and the initial point of testing XML files is to ensure that the contents are recognized and defined. There are some secondary data quality and validation issues that will come up when the actual ONIX standard is discussed, but the first step is always a coherent recognized file acceptable to XML software.
The next post is some practical tips on cleaning files, and the one after that is on what to do with special characters outside of your encoding statement, so don’t worry about your weekly excitement just yet.