Most publishers have more than 100 books on their list, and a few of them will be by non-European authors or reference some atypical symbol. And wasn’t the production manager proud when the cover copy spelled it right?

ONIX is a bibliographic data exchange standard, and it behooves us who toil in publishing to spell the author’s name, book title or their review journal correctly. Really foreign scripts will (probably) allow use of some transliteration system, so it’s unlikely you’ll need to use Chinese, Arabic or Cyrillic scripts. I have arbitrarily decreed this to be beyond my scope (or knowledge)—but “similar” alphabets like Norwegian or Romanized Slavic are not part of iso-8859-1. And even for the Western European languages it does cover (French, Spanish and German) there are missing letters. There are common symbols missing like trademark and copyright… At some point you’ll need to put something in an ONIX file that’s outside of the common encoding schemes, and for that you’ll use an escaped entity.

I’m going to make one of my daring generalizations here to help you recognize an entity: It starts with an ampersand, “&”, has simple keyboard characters in between and ends with a semicolon, “;”. It’s recognized by XML software and rendered as characters by browsers. Here’s a link to one of my favorite sites with lists of entities:

http://htmlhelp.com/reference/html40/entities/

For example an e with an acute accent—é—can be “escaped” as é or é or é—and further, an entity is special in XML because the ampersand should not be itself escaped. You should never see a “double escaped” entity like &#233; in an ONIX file.

A file encoded as utf-8 has everything that can’t be expressed as a simple keyboard character escaped while iso-8859-1 can have characters like é, ç, à, è, ô, ö, û, ñ, etc. but not characters like ů, ũ, or š, etc. Neither of these encodings can accept “smart” characters, m or n dashes, etc. although there are escaped versions of these. Where is the dividing line? When the software complains, and just like the previous post on file cleaning the solution is simple substitution by find and replace. What? You haven’t kept track of how your characters are stored in your source file? Oh dear… understanding what’s in your source file is the first step. But basically: the XML software complains and you fix a problem, just like the encoding problems.

There are considerations: First, while a basic validation in most XML software accepts all entity types, XML schema languages are stricter (I don’t know why), and they don’t accept the ‘html’ type like é. So using html style entities will cause you a problem with BookNet Canada’s BiblioShare and ONIX 3.0. Schema validation is the future, so the prudent administrator should avoid html entities. The ‘decimal’ style é is the most common one supported by schema languages, and the one I recommend. I seldom see files using the ‘hex’ escaped entities like é so I suggest not using it but I can’t defend that prejudice. Ideally you should use one system, consistently, in your file, and if you need to change it at some later point it won’t be hard.

A second consideration is the on-line companies that get your data. In the descriptions and biographies you can certainly spell things correctly as the entities will appear correctly in browsers, but what about “searchable” fields like Contributor and Title? What happens to your author Hélènne Ővēn if you submit her name ‘correctly’ (according to who? me?) in encoding=’iso-8859-1’ and escaped as ” Hélènne Ővēn” or in utf-8 as “Hélènne Ővēn”? It may render properly in a browser, but will it affect how easily you can search for her name in Amazon or Indigo? Can a consumer search the obvious keyboard bastardization of “Oven” and find the book? It’s a problem, that’s about all I can tell you. The on-lines are way better about this than they were a few years ago when anything outside of simple keyboard characters weren’t acceptable in a searchable field but there are no guidelines here. I’d say that “Hélènne” with it’s pretty normal “special characters” within iso-8859-1 wouldn’t be much of a stretch, but “Ővēn ” is likely to cause problems. it might matter if you submitted “Hélènne”, “Hélènne” or “Hélènne”. They are all different, clearly, and programmers at every aggregator or on-line would have to set up to process all of these to index. Did they? Will your own website? Oh dear!

This is, sort of, what it means when BISG says iso-8859-1 is the recommended encoding for the US supply chain: Aggregators should accommodate at least the special characters in it. And maybe they do more, maybe they do less, but it’s reasonable to think they’ll do that much. And when I say Canada hasn’t made a recommendation it means, well, we haven’t gone that far.

If you really have a lot of special characters that are critical and you don’t yet know what you and your trading partners are doing, well, that’s beyond the scope of this blog. I’m trying here for practical help to largely English language ONIX producers. But mostly I want to say: Take advantage of ONIX!! You can, and should, update your records. So spell the name right, submit your data as early as you can and then check the on-line records. If it doesn’t look right or the searches fail, then ask them about it or judiciously misspell the name to compensate and re-submit your data. You should have 6 months before publication to work it out. Maybe Amazon and Indigo will be OK but it’ll be wrong on Barnes & Noble. Maybe it’s only Walmart who can’t get it right. And maybe Walmart is the only one that matters to you. It’s your call but the author will probably understand why you made your choice. Try again in 2 years and the answers will have changed.

And that advice should make anyone who cares even a little about the accuracy of their records cringe.

Data Exchange Tip #4: Escaping Entities—Påvøl Breaches Checkpoint Charlie

Listen to our latest podcast episode