[UPDATE: The most recent version of the ONIX codelists may be found here.]
I can be genuinely excited by the oddest things in ONIX, but at the moment what has me going is that EDItEUR has been dropping codes from ONIX Code List 34 Text format. Eleven since Issue 37 and finalized in Issue 42! This is good news, and while technically this change applies only to ONIX 3.0 (ONIX 2.1 being “locked” at Issue 36), this is a change everyone can profitably implement to either version.
Everyone should use this list of 5 codes for Text format.
ONIX 3.0’s schema – the XML definition document used by computers for validation – made the change with Issue 37. This is something I am familiar with as ONIX 3.0 files regularly fail validation because of what appear to be random format codes. This suggests to me that software developers should make note of this code list change as it applies to the XML attributes their programming inserts into ONIX. An example of an XML attribute would be <Text textformat=”05”><p>blurb</p></Text> .
I see evidence in BiblioShare that makes me think that software may be creating "default" values, which is also part of the problem. An example of that would be files failing ONIX 3.0 because of the “01” code being used as Text format. That's the code for SGML and it's safe to say it's never been a format used in anyone's ONIX feed. Ever. Could it be the "01" is a programmer's placeholder value? I only know what's in the file and no client has explained.
ONIX Code Lists Issue 42, July 2018
List 34: Text format
Value | Description | Notes |
02 | HTML | Other than XHTML |
03 | XML | Other than XHTML |
05 | XHTML | |
06 | Default text format | Default: text containing no tags of any kind, except for the tags & and < that XML insists must be used to represent ampersand and less-than characters in text, and in the encoding declared at the head of the message or in the XML default (UTF-8 or UTF-16) if there is no explicit declaration |
07 | Basic ASCII text | Plain text containing no tags of any kind, except for the tags & and < that XML insists must be used to represent ampersand and less-than characters in text, and with the character set limited to the ASCII range, i.e. valid UTF-8 characters whose character number lies between 32 (space) and 126 (tilde) |
Why is this change important?
The transition from ONIX 2.1 to ONIX 3.0 should be used to make the metadata better, and a core function of ONIX is to allow publishers to provide online retailers with content to display for consumers. The format of this content is important, therefore the code that describes the text format is also important.
This highlights an important difference between use of attributes in the two versions. Attributes have seldom been applied in 2.1, partly because they were expected to be used in limited and specific ways, yet they were broadly defined within the XML schema, which made them hard to monitor. This sounds confusing, but it just means that the 2.1 schema definition (used by computers) allowed most attributes to be applied to most elements when the specification (used by people) was more specific. ONIX software implementation typically followed the computer's definitions and allowed broad and unnecessary use. It's possible that the data entry staff may not have had control or were presented with unnecessary choices.
I can confirm (based on what I've seen over the years) that attributes would be used in elements they didn't belong in, with values of dubious or neutral meaning. Even if you feel I'm overstating things (I like a good story), you'd agree that lack of use contributed to poor data and lack of adoption by retailers. This describes a problem of a decade ago but the lack of good and intentional use of attribute lingers to this day.
The point is attributes are a normal part of XML data exchange and ONIX 3.0 is designed to use them. And it fully controls where they can be used and limits them to appropriate elements. The change to code list 34 really clarifies the use of Text format. It's genuine improvement and I am thrilled.
The Text format attribute has a primary purpose: to say something about the format of the contents of XHTML enabled elements in ONIX. That's true of both 2.1 and 3.0 and I've provided the appropriate lists below as Appendix 2 and 3. It's just copied from the manual, so I give thanks to EDItEUR and their detailed and endlessly helpful, accurate documentation.
The reason you provide this and all the other attributes is to trigger processing choices by end users. Let me restate that for Text format: Display requires formatting support and while we rightly spend a lot of time talking about correct use of HTML in ONIX, Code List 34 enables you to warn end users how you expect them to process it. The Text format code should match the formatting of the text's contents, the same way that the Text type code identifies what it is – the description, bio, and so on. Retailers should be able to base their processing on either value and can only do so if the values are accurate. You should use the same care in labeling the Text format as you do in you crafting the actual formatting you supply.
What happens when it's done badly and not implemented by retailers? Publishers complain about lack of adoption of new ONIX structure and an inability by retailers to accommodate changes in their files. One reason for that is because their processing is often done by accounts – you handled your data this way, not that way, and when you improve it, it only screws things up for them. Do you see where this is going? Text format should be a part of the solution. The retailer's programming should be triggered by proper coding, not by previous poor practice. Are they ready now to do that? I don't know. Do your text format codes mean anything? Can we sing "Chain of Fools" together?
That's why this change is important – it makes it clear how simple it really is. The old List 34 with 16 formats contained codes that shouldn't be used in ONIX and it made it hard to see the simplicity of your choices – and the simplicity of its processing.
First off, the Text format code is used to identify the type of markup provided in the XHTML enabled fields. Retailers should expect to find formatting there, but not everyone provides it, so none is also an option:
NO MARKUP in the XHTML Enabled element? Use the Code 06 Default text format
MARKUP USED in the XHTML Enabled element? Use Codes 02 or 05 – HTML or XHTML
I'll ignore codes 03 XML and 07 Basic ASCII for the moment as they aren't typically needed, but if it's not obvious XML is a type of markup as well.
Second – and this should explode your mind – the only place you should need to consider whether you need to use CDATA tags, or if you should be escaping (X)HTML tags, is where you've used the Text format code for markup tagging. Let me put that another way: the only place retailers should expect to see CDATA or escaped (X)HTML is where you've used "02" or "05" as a Text format attribute.
If you don't understand CDATA or escaping, check out Appendix 1 on HTML resources. However, the easiest way to do this is without either CDATA or escaping, but instead by using the simple XHTML (textformat="05") openly as allowed by the ONIX schema and letting the XML processing confirm the integrity of your data. The rest of the ONIX file (excluding the Content composite that no one uses) is delivered in the file's character encoding.
If there isn't any markup used in one of the XHTML enabled fields, provide a Text format attribute of "06" to say that. Retailers will understand there's no special processing required.
Isn't this easier than you thought? Simple (X)HTML delivered in a limited number of fields properly identified should trigger a processing choice.
ONIX 3.0 is simpler than ONIX 2.1. The choices haven't changed that much, but the structure makes the simplicity of your choices more clear, while the XML schema – what the computer and software understands – supports that.
I also promised an explanation on codes "03" XML and "07" ASCII. Essentially, there's no practical reason for them to be used without an agreement from the end user. They have value for identification in a proprietary feed (and they might have value in other markets where a convention may be agreed to). But in North America? Don't use them in generic feeds because there's no practical use for them. If you would like more information, you should pose the question to the ONIX Implementation Group organized by EDItEUR or contact me and we can nerd out.
Reality check
I think most retailers don't use attribute metadata because I think most data senders provide it without any intent. But attributes should be used because the people processing data need cues.
The transition from ONIX 2.1 to ONIX 3.0 means using the data better and more meaningfully. Weird things that didn't make sense in ONIX 2.1 have been fixed and improved, and this improvement continues.
What we need now is more meaningful use to promote better processing.
Appendix 1
HTML resource:
EDItEUR: HTML markup in ONIX
BookNet's page: HTML in ONIX
Appendix 2
The ONIX 3.0 data elements within which (X)HTML markup may be used are:
<AncillaryContentDescription> | <MarketPublishingStatusNote> |
<AudienceDescription> | <PrizeJury> |
<BiographicalNote> | <PromotionCampaign> |
<BookClubAdoption> | <PromotionContact> (deprecated) |
<CitationNote> | <PublishingStatusNote> |
<CopiesSold> | <ReissueDescription> (deprecated) |
<ConferenceTheme> (deprecated) | <ReligiousTextFeatureDescription> |
<ContributorDescription> | <ReprintDetail> |
<ContributorStatement> | <SalesRestrictionNote> |
<EditionStatement> | <Text> |
<FeatureNote> | <TitleStatement> |
<IllustrationsNote> | <WebsiteDescription> |
<InitialPrintRun> |
Appendix 3
The ONIX 2.1 data elements within which (X)HTML markup may be used are:
<Annotation> | <MainDescription> |
<BiographicalNote> | <PrizeJury> |
<DownloadCaption> | <ProductWebsiteDescription> |
<DownloadCopyrightNotice> | <ReviewQuote> |
<DownloadCredit> | <TextWithDownload> |
<DownloadTerms> | <WebsiteDescription> |