Booknet Canada Blog

Archive for the ‘Biblio’ Category

BNC 101: What is XML?

Thursday, June 17th, 2010 by Meghan MacDonald

What is XML? is the first in a series of BNC 101 blog posts where we’re going to try our best to break down some of the complex tech concepts we talk about all the time into plain language. Wish us luck!

Have an idea for a BNC 101 blog post? Leave a comment below to let us know.

XML is a term that gets thrown around the publishing industry a lot, but what does it actually mean?

First, XML stands for Extensible Markup Language. XML doesn’t do anything; instead, it lets you describe what something is. It is a text format that lets you define information for computer-to-computer communication. Basically, it’s a way to let two programs that speak different languages talk to each other.

I find it’s best to think of XML as content without form: XML is what is in the background describing what everything is, then how it looks is determined by where that information is being used. Some familiar examples of XML-based languages include: XHTML for the web, IDML for InDesign, and ONIX for book information.

Elements

Elements are the building blocks of XML. Think of these elements like descriptors, adjectives attributed to the content. Each bit of content gets described by the element. Elements are made up of opening and closing tags, and the content goes between these tags.

Elements look like this:
<tag>content</tag>

Or, for something publishing-specific, like this:
<Title>Canadian Book Market</Title>

XML allows you to describe infinite amounts of information, but it is the receiving program that decides what to do with it. If an online store receives your file, it will take that title tag and know to post the title as the title on its website. Some programs act on more of the described information than others, so to be on the safe side it’s best to provide more rather than less information to avoid blanks.

For example, if a book has a Canadian author, you would want to add an element that says the author is Canadian. Even if some receivers of the XML file won’t process it some will and it will be to your advantage.

XML always sounds big and scary, but really it’s just another version of something we’ve been doing for years in publishing: marking up documents in the same way you would a manuscript.

The publishing-specific XML language for transferring information about your books is called ONIX. Next week, we’ll post BNC 101: What is ONIX? — your introduction to online information exchange.

You can find all of our introductory blog posts in the BNC 101 category .


ISBNs — D’oi!! It’s 11PM — Do you know where your product records are?

Wednesday, June 16th, 2010 by tom richardson

You’d think you were at Market Day in Ganges there’s such PANDEMONIUM out there. Everyone’s talking about Product Identifiers — live twitter discussions, position papers staked by major organizations, and the BISG has no less than two dedicated industry groups (maybe three) discussing issues around ISBNs, e-Books and ISTC to make sure that their supply chain perspective is heard.

I’m lucky enough to be able to participate in some of the BISG discussion, quietly like a good Canadian will, but I know the Canadian supply chain wants to know what to do. Right now, the answer is participate. But maybe to give a flavour of what’s being talked about here’s some jottings from a recent meeting — points that came up that are being thought about, things that aren’t answered yet:

  • Using product identifiers in a closed system like Kindle or Nook, vs open systems where the identifier is actually traded to identify the product. What’s the best practice for each?
  • Libraries need to be able to buy for specific e-readers that are supported by them institutionally and are unable to because of a lack of unique identification and data. Library wholesalers are developing kludges to add the missing link.
  • Sales tracking — if nothing else, amalgamated sales for Best Seller lists? How can this info be combined… (ISTC discussions ensue)
  • Identifier BLOAT! No, wait: there’s DATA bloat too… (think about it: they are different and both real). It all costs and costs too much — this is about driving sales (or decreasing costs)
  • One record = One Product Identifier OR Hierarchical data? Who says we have to repeat everything all the time, endlessly just because version 2.8 of Dirty Sock Reader was released?
  • Sales rights: the responsibilities of 3rd parties to not sell when they shouldn’t, and the responsibilities of publishers to inform others about a book’s status.
  • Are we just building a giant disconnect? What is a product anyway — it really isn’t that clear in the digital supply chain…
  • GDSN / DOI / ISTC — Does DOI belong on this list? Discuss!

You get the idea. This is important stuff even if it tastes like thick dust. BookNet Canada is Canada’s publishing supply chain organization — but we answer to your pulls on it. This is how your business does business with other businesses — EDI, your records, your royalties are all wrapped up here. If what’s happening doesn’t support your need to communicate with your business partners, let us know.

LINKS

Twitter hashmark (live, Friday noon EST): #ISBNhour
Dec 2009 position paper from BIC
Feb 2010 position paper from the Int ISBN Agency
Book Industry Study Group (BISG)

New ONIX Tools…now with 3.0

Wednesday, June 2nd, 2010 by Noah Genner

Today we are happy, no ecstatic, to announce the release of one new ONIX converter and an upgraded BNC Excel Template to ONIX converter. This might not be exciting to everyone, but to us metadata junkies this is a little like a Data Festivus. So exactly what do we have:

- BookNet has released a new converter that will convert ONIX 2.1 (following the Canadian Bibliographic Standard)  to ONIX 3.0. Yup, 2.1 in…3.0 out.

- We have also refreshed the BNC Standard Excel Template to ONIX Converter to output either ONIX 2.1, or ONIX 3.0.

Did I mention that to use them is free? No, well it is. Go ahead and check them out. Any commets? We’d love to hear them -  biblio@booknetcanada.ca

Caveats:

- This should likely not be used in a production environment until you have thoroughly tested things. While the conversion works well there may be cases where it doesn’t (tell us about those cases). Also, there are very few, if any, data recipients accepting ONIX 3.0 yet. Don’t start sending ONIX 3.0 to anyone until you know it will work for them. Definitely don’t stop sending data in the format you are now without some discussion with said recipients.

- The converters can only convert data, they don’t add data. So, you are really missing many of the ‘new’, and important, changes that the development of ONIX 3.0 brings to the table. Go read about some of them here and get the standard.

- Most important! Garbage in, garbage out! Validate your data, send it into BiblioShare and get a free data quality report.

ONIXEdit vs EXA Editor

Monday, May 17th, 2010 by tom richardson

EXA Editor is the venerable XML editor for use with ONIX that BookNet Canada made available as a free download for the past few years. It was an excellent product with a not-so-excellent GUI (nerdese for your ability to find and click on things) and a useful tool for learning about ONIX. A number of publishers’ ONIX programs are, or were, based on it.

EXA Editor was software developed by a Montreal firm called GPG Solutions and it was different from the database solutions offered by Bob Houghton at HiPoint, or Doug Plant’s PeXod in that it was basically a way to directly work in an XML file. A database interface lets you to edit the fields in a database whose contents could then be exported as an ONIX file — your data entry is once removed from XML. EXA Editor was an XML editor, an interface that made it easier to directly edit the ONIX file — and the interface also provided the ONIX code lists, enforcement of mandatory fields and similar data quality basics.

Canada really is blessed with some excellent software!

ONIXEdit is GPG’s next generation of ONIX software, with a vastly improved user interface, an ability to check files against different standards including BookNet’s, our Quebec equivalent BTLF, the US’s BISG and the UK’s BIC, as well as a number of other improvements.

BookNet Canada, through various arrangements, was able to offer EXA Editor for free. And we continue to commission work from GPG Solutions (our template and ONIX converters are their work), but BookNet Canada has no more linkage to them than we do with any other development firm we work with. But this does allow me to address one question publisher’s have:

BookNet Canada does not offer ONIXEdit for free — EXA Editor is no longer being offered by GPG Solutions and BookNet Canada has not made any arrangements to offer the new product. GPG does offer a limited version it for “free” for file of less than 50 records, 30 day trial periods to assess the software, and the cost is pretty reasonable. You can look it up yourself, but I’d estimate the initial cost to be less than the cost of a single — a single — seasonal catalogue page in black and white for most publishers. The world of metadata for less than the cost of an outdated-before-it’s-published b/w page. Cheap in my book.

BookNet Canada, as a rule of thumb, does not review software, but 2 future blog posts will focus on the converters we are about to release, and have a look at ONIXEdit and why one might edit (or not) in XML in more detail. ONIXEdit still offers all the advantages that made our arrangement for EXA Editor of mutual benefit — and it’s an excellent product whose history with BookNet makes it a special case.

BNC Visits the Espresso Book Machine at McMaster University

Friday, May 14th, 2010 by Meghan MacDonald

Earlier this week, the BookNet team took a field trip to Titles McMaster University Bookstore to check out their Espresso Book Machine (EBM). Mark Lefebvre , BNC Board member and our gracious host for the day, took us on a tour of Titles and gave us a live demo of the EBM (with some help from Laura the EBM magician).

Don’t know what an EBM is?

"The Espresso Book Machine is a fully integrated patented book making machine which can automatically print, bind and trim on demand at point of sale perfect bound library quality paperback books with 4-color cover indistinguishable from their factory made versions." - On Demand Books

It really is as quick as they say. It only takes a few minutes from the time you select and order the book for you to have your shiny new POD book in your hands.

While I was hoping for something like this:

Charlie and the Chocolate Factory

It actually looks like this:

Espresso Book Machine

But it manages to make books anyway. Success!

POD Book and Tim

Now, it wouldn’t be a BNC Blog post if I didn’t remind you about how important metadata is. Hilarious moment of the day as described by Mark :

we selected a title from the catalog of just under 1 million titles to show them how we order from the EspressNet Catalog. We picked a public domain Google Book of Shakespeare — a "King Lear" search result that was listed as 120 pages. We figured it would be a nice short book that could be completed in about 3 minutes, as part of demonstrating the quickness of this process.

Of course, it took a unexpected longer time for the book to load to our system and start printing. And once it started, the print que was showing a gigantic page count, well behind 120. So we let it run it’s course and out came a 1000 page book.

The BNC folks, grinned at this and stated something they often say, and something I’m familiar with given my previous job role as data wrangler at Chapters/Indigo between 1999 and 2006.

"See," Tom, the Bibliographic Manager at BNC said. "It all comes down to the quality of the metadata."

The EBM has been a huge success for Titles. It has opened up new business opportunities that a university bookstore would normally not be able to tap into, and makes it so that millions of books are available at the click of a mouse.

Want your books to be available on the EBM? Comment below — I’m sure Mark would be happy to pass on some info.

Happy Birthday, ONIX!

Tuesday, May 4th, 2010 by Meghan MacDonald

May 2010 is the 10th anniversary of ONIX for Books Release 1.0. The first release was a result of a need to provide full, accurate data to online retailers. That need was first expressed through the Association of American Publishers who worked with EDItEUR to develop release 1.0.

Benefits of ONIX for Books from the EDItEUR website:

For publishers, experience has shown that ONIX for Books brings two important business benefits. As a communications format, it makes it possible to deliver rich product information into the supply chain in a standard form, to wholesalers and distributors, to larger retailers, to data aggregators, and to affiliate companies. And by providing a template for the content and structure of a product record, ONIX has helped to stimulate the introduction of better internal information systems, capable of bringing together all the "metadata" needed for the description and promotion of new and backlist titles.


Facebook’s Open Graph has huge potential for books

Thursday, April 22nd, 2010 by Meghan MacDonald

Facebook announced it’s Open Graph earlier this week, opening up to the rest of the web and making the user experience more social in the process. But what does this mean for books?

The two features that will have the biggest impact for publishers are the Open Graph and Like buttons .

Facebook Open Graph

The Open Graph is Facebook’s attempt to aggregate all of a user’s social activities from across the web into the Facebook platform. In doing this, they’re mapping connections between users and elements. There is HUGE potential here: 400 million users makes for a lot of social data, and if we can tap into that it will change the way we market and sell books.

Facebook Like Button

The Like button, specifically, is a simple way to open Facebook up to the rest of the web — something that has been a huge problem for marketers trying to use Facebook to their advantage. Before this, Facebook users were only on Facebook and it was really hard to get them to either leave or bring information in. Now, Like buttons can be added to any web page, the same way that you would add a share button.

Make the most of it now:
Add a Like button to a webpage about one of your books. When a Facebook user clicks that they like that book, it will get pulled into their Facebook profile and show up to their friends, acting as a personal recommendation. This also creates a connection between that user and that book. (feel free to replace ‘book’ with ‘author’ in that last paragraph — same deal)

Plan for the future:
The way Facebook will be creating these connections can change the way we sell books. Having access to the data that tells you that User A likes Book A, B, C & D and is friends with Users B, C & D who like Books B, D, E, F & H the most, means that you suddenly have information about potential readers that you never had access to before. With this information, you can make smarter auto-generated recommendations based on their current likes and the likes of their friends.

What that means, though, is that you need to have good metadata for your books! (totally didn’t see that one coming, did you?) Let’s say that a user has a lot of connections to Canadian authors who write mystery & detective fiction, but you don’t have the contributor flagged as Canadian and got lazy and used Fiction General as your subject category. Your book probably isn’t going to get pulled up as a recommendation for that user — and you’ve lost a potential reader.

It doesn’t have to be that way, though. You can improve your ONIX files (it’s easy — I promise! — and BNC BiblioShare can help), which in turn will help readers discover your books that are right for them.

The BiblioShare Certification Challenge

Tuesday, April 20th, 2010 by tom richardson

BNC BiblioShare is going gang-busters with over 230,000 EANs in the system, over 12,000 Canadian author markers and aggregators starting to experiment with the data. It seems like a good time to take stock.

One of the first things we found is anyone who wants the data really wants – and I mean really really wants – Canadian indicators, as much as they can get. I’m required at every opportunity to list these, so to review:

  • Canadian authorship is shown using the Contributor composite’s country code;
  • BISAC Regional Codes (code 11 in the Subject composite) can be used to draw the attention of a region’s book buyers or libraries;
  • if you’re still using an old BISAC code list, well, you know: a number of Canada specific codes were added 3 or 4 years ago and you should update yearly.

The industry wants to promote your books, so help them. There’s even been some feelers about, maybe, getting publishers to use Country of Publication more too – and Country of Manufacture can be in the mix too and with Code Issue 11 you can include it in ONIX 2.1 files using the familiar Other Text composite.

The other piece of feedback that we’re getting from aggregators is: Why is the data so inconsistent (actual adjectives have been standardized in the interest of simplicity and taste)? Even BookNet Canada’s own programmer has wondered, the programmer from the ACP’s Bookshelf (as experienced an ONIX producer as exists in Canada) has been caustic, and Library and Archives Canada has been close to un-librarian-like.

This is not a problem unique to Canada: BIC’s most recent annual report said the same thing and questioned the effectiveness of certifying publishers, it was discussed last year at the ONIX meetings at the London Book Fair and if not for volcanoes would have been again.

When it gets to e-pub metadata the universally agreed descriptor is crap.

It’s a feature of BNC BiblioShare that it’s the publisher’s data and that rather than changing it we work with publishers to make it better.

Just to be clear, BNC’s aggregated dataset – the stuff we’ll be serving up over the web and issuing as aggregated files from our own database – will have some level of standardization in it, but we aren’t staffed or funded to do a intensive remediation of files. Actually, that how the industry got to where it is. Big Publishers submitted files and Big Retail’s programmers wrote scripts, publisher by publisher, to clean them up to match as best it could their individual standard. Bowker and other aggregators invested heavily in systems to compensate so that their output is a clean as they can make it. And it’s lovely! But the smaller the publisher the less economic that model is, and everyone who’s data has been changed knows it’s a mixed blessing: The fix can be its own problem.

So, I say again: It’s a feature of BNC BiblioShare that it’s the publisher’s data. The size of file is irrelevant to us because all files are processed the same – the only limit to getting in is they have to pass a strict validation. But once that’s passed we work with publishers to make it better. Each and every file has its own Detailed Report – it’s not perfect or the final word on data quality but it’s very good. Mind you it’s only good if publisher’s use it and think that these are things they should fix.

And this leads to a question for the Canadian publishing industry: What do you want from certification? The aggregators we’re working with seem to think that a publisher file that’s been “certified” should be usable without work on their part. And by that they mean: quality bibliographic information that really matches the book, full utilization of all the correct ONIX elements, no glitchy characters, Canadian identifiers up the wazoo, as much enhanced content a possible… They’re pretty demanding given they are trying to provide publishers with free discover-ability and support.

This seems like a no-brainer – of course a certified file should be usable! But I’m not sure. Does Canada want a hard-assed data certification system? What are the boundaries of failure to certify? What should that mean?

Here’s a chance to talk about it and comment below…

ISTC: A duffer’s perspective

Thursday, April 1st, 2010 by tom richardson

I had the good fortune to attend BISG’s meeting on Tuesday “Focus on ISTC” which was a supplement to Michael Holdsworth’s recent paper:
The International Standard Text Code (ISTC): A Work in Progress / A Supply Chain Perspective

Holdsworth reviewed the paper’s content and then a panel including senior representatives from BN, Hachette, Bowker and BISG discussed the paper and its implications with the room which was chock-a-block of senior representatives of the US publishing and data industries.

The short message for Canadian publishers and data suppliers is that the US and UK supply chain is taking the ISTC seriously and that Canadian publishers should be familiarizing themselves with it too. You might need to be using it soon.

The good news is that the ISTC and its implications are all pretty much good news. Holdsworth’s paper is far more detailed than I want to be here, but basically the ISTC is an identifier for text. Just text, similar in a way to what you might say has a copyright – a string of characters in a specific order is more or less what it identifies. But, it’s also simple: don’t go getting too specific about it (though there are stakeholders out there who will want to make it so). And unlike copyright it’s applied only to material that is in the publication process – not necessarily published but to be published.

The simplest case is a publisher has a manuscript and has decided to publish it. Before anything else a simple registration of that is done and a ISTC number is obtained for it as an original work. That number can then be used, in ONIX 3.0, to identify all the various formats: hardcover, various paper, e-books in whatever format, audio. The text – the order of the words — remains the same so it’s the same work and carries the same ISTC. Kazaam: Suddenly Amazon and Indigo can get their records right.

OK: You know and I know that different formats might not be exactly the same string of characters – reprints are published with corrections, e-books are enhanced… Well there are groups who would like the ISTC to be meaningful in that regard – but the “supply chain” perspective is more “be realistic.” A Harry Potter book may be Americanized and carry a different title – but essentially, it’s the same text and should carry the same ISTC. Will this be how it works, necessarily? Time will tell but hopefully.

The second important concept here is that ISTC can be for a source – the top of the chain — or they can be “derived.” A derived ISTC would have its own unique ISTC, but it would also have a “source” ISTC. The clearest example would be a translation. This text, ISTC number X, is supplied to a translator who produces a clearly unique text string identified by ISTC number Y. But the derived ISTC Y would carry the source ISTC X and standard explanation of how it’s derived. That’s the simplest case, but like assigning a new ISBN to a revised edition is a judgment call, what’s a derived ISTC and when to use it will be as well. Holdsworth’s paper has an excellent discussion of this. Again, it’s detail at that point – very important and being discussed – but in the broad scheme, simple.

But consider: If everyone simply started, tomorrow, to add this easily obtained number to their publishing process, for every “new” product, right at the start. And when they sold rights, or assigned distribution to another publisher, or what have you that it was referenced in the contracts, appeared on the products and in the various metadata — all that hard to track stuff — that number was simply part of the deal: Well wouldn’t it offer a huge improvement to everyone? Wouldn’t there be more clarity?

Don’t get bogged down thinking about backlist – yes the implications there are huge and messy.  I’m not denying that. Just consider how simple it would be for retailers, rights holders, academics, libraries to know which book is which going forward. It’s simple clean and effective, and while all published products are intended to carry it, it’s really only the ones that appear a lot in contracts (that you make money on) that its importance is fully realized. And then, it really is important.

Even if you’re not willing to accept it as part of your business model you should be prepared for other businesses to be using it. So expect to see contracts referencing it and think about adding it to your dataset as a reference.

BISG Product Identifier committee will be discussing all of this in detail.  What a wonderful opportunity to really get down and dirty in the bibliographic trenches!!

Data Exchange Tips #8: Up(date) Your ONIX, or: What code should the White Rabbit use when late for Tea, and will his invitation change in response?

Wednesday, February 10th, 2010 by tom richardson

If your company uses EDI you probably understand the need to maintain consistent and accurate ‘transactional’ records But EDI is a limited number of fields and the company comptroller probably has it all under his or her thumb and allocates resources to its maintenance because it makes money (or more importantly loses money when done badly). EDI doesn’t really require “updates” because of the question/response nature of transactions flow. It’s all pretty simple unless you’re responsible for it, because an error means something is shipped or not, paid for or not. You and your trading partner notice quickly.

If it only was so for ONIX files! Just like EDI an ONIX file is an electronic resource, used to support business to business communication. While it’s not used to communicate transactions, retailers use the information in it to sell and expect the information in the ONIX file to be correct and updated. The price on their on-line retail site is probably sourced from your ONIX file, as well as the publication date and availability status (not to mention the author and title!), so if you cancel a title and don’t update or simply stop sending the title’s ONIX record, the retailer may be trying to sell a product that will never be available.

Consumers seem to think that the on-line record should match the book they order, shipping departments think that the book weight should too, and the buyer thinks the carton quantity should be accurate.

Well, duh! Like you didn’t know that? And you probably know why it’s not quite right, too. I mean I know some links on our website are down — I really need to fix them and that’s not happening either!

There are not really clear guidelines on what constitutes a proper update routine and the answer changes radically between a poetry publisher with a slow growth list and nothing OPed in 20 years, and a multinational whose books can come in and out of print in a season, but here’s some guidelines:

Retailers rely on book metadata be it for the initial buy (6 months before publication), transactions (active titles), and customer relations (accurate titles and descriptions). Each on-line page describing a book is like a little contract with the consumer – maybe not written quite in stone but it should be treated that way.

ONIX suppliers, by supplying book metadata should make a commitment to accuracy, and be willing and able to update it if it changes. This is different from including enhanced data – that’s different and just as necessary to maximize sales. I mean the basics: title, author, subject, imprint, publisher, status, pub date, supplier, availability and expected ship date all should be accurate, maintained fully utilizing the relevant ONIX composites so that name and title are parsed out, the publisher and supplier names are consistent, etc. etc. If it’s wrong it should be fixed and re-sent.

Frequency should be as often as it’s needed – for big companies that might be weekly “deltas” (change only) with monthly (full files), medium companies can probably do monthly files and small companies quarterly. But everyone should make an effort to realistically maintain and update their ONIX records and resend them regularly. A full file of all active records should be submitted to your supply chain trading partners at least once a year – and more often probably makes sense too. It’s not enough to issue a record once and expect 5 years later that retailers know it’s still active. It’s not enough to never clean your file of the books you no longer support either.

And yes, when a book is out-of-stock and reprinting you really should tell retailers when it’s due to be available again. If you really don’t know then fake it: maybe give a date 3 months away and keep updating it – it’s better than saying nothing until the reprint is ordered and restocking is 2 weeks away.

When books go inactive – Out-of-Print or No-Longer-Carried, what-have-you – you should also maintain records and release them appropriately (aren’t I cunningly vague about that). There’s no need, once the supply chain knows their status to continue to send them, but give everyone a chance to update their records before you drop them. And then maintain a file so that it’s available on request. Whoa! A lot of this information would be useful internally too!

It’s not easy to do and it takes thought to set up the internal communications – but it’s not necessarily that hard either. It’s a breeze in comparison to what agency pricing is going to be like. And how were you planning to communicate that? (New codes in ONIX should be in place by March by-the-way.) It’s only going to get faster.

The more astute of you might be thinking, I bet this harangue about the need to do the obvious well is a lead up to BNC BiblioShare. And you’re right!! There’s a webinar “Introduction to BiblioShare” on the 24th of this month… 2 to 3:
http://biblioshareintro.eventbrite.com/
and it’ll be available after the fact too.

BIC Releases Recommendations for eBook IDs

Monday, December 21st, 2009 by Morgan Cowie

Book Industry Communication (the UK equivalent of BNC or the BISG) has released their recommendations on dealing with ISBN assignment for eBooks.

In a nutshell:

Publishers should not assign ISBNs to non-product source or production files which are not being traded in the supply chain.

So what does that mean? It means that BIC’s position is ISBNs are for format level issue i.e. hardcover, trade paper, eBook. Different iterations of the same format file, as you get with proprietary treatments like retailer-specific DRM, should not be treated like different books as they are not being traded in the supply chain. The source file is the tradeable ancestor and thusly, the smallest unit that should be ISBN-stamped.

BIC suggests that for single-channel formats (the aforementioned DRM marked retailer files) or chapters/fragments, other identifiers other than the ISBN would better serve the purposes of the publisher. In the former case, it would be an internal system and in the latter, digital object identifiers (DOI). This keeps the purpose of the ISBN intact while still allowing for practical sorting and filing.

Thoughts from the peanut gallery? This is a controversial issue which is far from settled. Do these recommendations address the reasons why ISBNs are being used to identify proprietary files? Are the suggestions for alternate practice practical in real life? How do these guidelines fit with what is actually happening? Let’s hear it!

Data Exchange Tips #7: Nic Boshart on Mac based XML solutions: Using oXygen

Friday, December 4th, 2009 by tom richardson

XML on Mac is a rare bird, or if not rare, seriously undercooked. There isn’t the same amount of options as on a PC, and certainly not an ONIX-based editor such as with ONIXEdit. There are a couple of free XML editors available, however Smultron is no longer updating and Serna Free XML Editor is not available for commercial use, instead deferring to Serna Enterprise for businesses. However, Serna Free is a good tool for getting used to XML.

OXygen is a WYSISYG XML editor with lot of powerful tools to do complex XML development — but using it to validate ONIX files is simple. It’s Java based, so it’s able to run on Windows, Linux or Mac platform and it edits files up to 70mb. You can use the large file viewer in the Tools menu to look at larger files, but you won’t be able to edit them due to the constraints of Java.

Caveats, they aren’t many. OXygen will do a basic or “DTD” validation on an ONIX file with the standard declaration. And to do a strict or “schema” validation, you’ll need to follow the same procedure detailed in the post about XML Notepad: The normal ONIX declaration needs to be replaced with declaration information set up for pulling schema information from your files.

But that’s quite easy. Follow the exact same steps as you would setting up XML Notepad. Download the schema, name it well, and replace the declaration with the same as in the “Create a Schema Specific File” portion of the BNC blog post Data Exchange Tips #6: A DIY Guide to Schema Validation on a PC: XML Notepad 2007. Now this part is a bit easier in oXygen, as you do not have to replace the last line of the declaration with the local address of your XML schema, the program will do that for you.

Setting Up to Use Schemas Using OXygen

Once you’ve downloaded the XML schema, open up your file in oXygen and replace the declaration. The hard part is over (well, depending on the quality of the metadata, anyway).

  • Top Menu Bar: Click on Document
  • From the dropdown: Choose XML Document
  • From the second dropdown: Choose Associate Schema

You should have opened a dialog box with several tabs at the top. XML Schema should be the first tab, already selected. The empty bar below is labeled URL — don’t be fooled, you want to open a local file. Click the folder to the right and find your schema accordingly.

Now you’re ready to validate. The declaration should have been changed accordingly. If not, it will tell you.

OXygen is a good system, it offers a lot of useful tools, including track changes to keep record of who did what to which file. It’s also useful as an ePub editor as you can open the full file without extracting it. The blog Instant InDesign has a good article on this:
http://instantindesign.com/index.php?view=412

Nic Boshart is Research and Communications Coordinator at the ACP and one of the organizers of next week’s The Canadian Publisher’s Digital Workshop on December 9 – 10, 2009.


Data Exchange Tips #6: A DIY Guide to Schema Validation on a PC: XML Notepad 2007

Thursday, November 26th, 2009 by tom richardson

For the purist, those who want their XML validation without the added benefits of what some programmer thinks would improve their ONIX file, there is a lovely generic XML software product called XML Notepad 2007. It’s free and written by a Microsoft programmer, Chris Lovett, so the freeware is from a safe source, it’s easy to set-up for a schema validation and robust with files as large as 20,000 records. The only problem is that you’ll need to use a file with its XML declaration information set up for a schema validation rather than using the normal ONIX declaration. That just means replacing the first few lines of the ONIX file with a different script — a simple cut and paste that only takes a few seconds. Just be sure to use the correct ONIX declaration on the file you send to trading partners.

The software requires that you have.NET Frameworks v2.0 or above installed (you’ll likely have it already on your computer, but it’s another Microsoft product) and you can download XML Notepad here:
http://www.codeplex.com/xmlnotepad
Just follow the links to the installer.

Getting the ONIX Schema

You’ll need is the ONIX Schema on your computer, which is available from www.editeur.org:

  • Navigate through Standards to ONIX for Books, Previous releases (not ONIX 3.0)
  • Scroll down to “Download Release 2.1 XML Schema” and click on it.
  • Click on the “Release 2.1 (revision XX) XML Schema” and save the zip to your hard drive.

Unpacking the zip will give you a directory with 7 files in it, 6 xsd ’schema’ files and a ‘read me’. You’ll need to put a location reference to these files into your schema, so make it easy on yourself and store this on your computer in an easily named location — avoid spaces in your directory names (spaces can confuse the XML software’s ability to find the file). As an example, if you were to create a directory in your top level C drive named XML with a subdirectory XSD and put the contents of the zip there, then naming the local file reference would not only be easy but have a long tradition behind it:
C:\XML\XSD\
However you choose to name the location put the 7 files into that location.

The schema file ONIX_BookProduct_CodeLists.xsd includes the ONIX codes, so every time the code list is updated, this file needs to be updated as well. New code lists are announced and listed by BookNet but if you add a new code to your ONIX and your file fails your validation process an outdated XSDs file might be the problem.

Create a Schema Specific File

The last hurdle is creating a schema specific file with your ONIX in it. The ONIX file you send to your trading partners has to have the “declaration” — the first lines before the Header tags — as defined by the ONIX for Books XML Message Specification. In order to schema validate using XML Notepad (and other XML software) you’ll need to replace that declaration with another one.

While you can modify your ONIX file with a new declaration, what I find easier is to create a file just for schema validation and then to copy the ONIX data section into it. Create a file, say: “schema.xml,” with either the Reference or Short tag declarations as below. And then copy and paste the ONIX file in using everything from the tag < Header> (Reference) or < header> (Short) to the bottom of the ONIX file (including the < ONIXMessage> or < ONIXmessage> tag).

Or alternatively, you can copy and paste a declaration from below into your ONIX file. You just have to be sure to replace it with the correct ONIX Message declaration before you send the file to your trading partners.

But one way or another, everything from the first line < ?xml version… through to < ONIXMessage> (for reference tag files) or < ONIXmessage> (for short tag files), such as in this example:

< ?xml version=”1.0″ encoding=”utf-8″?>
< !DOCTYPE ONIXMessage SYSTEM “http://www.editeur.org/onix/2.1/02/reference/onix-international.dtd”>
< ONIXMessage>

needs to be replaced with the following:

For ONIX files using Reference Tags:

< ?xml version=”1.0″ encoding=”utf-8″?>
< ONIXMessage refname=”ONIXMessage” shortname=”ONIXmessage” release=”2.1″
xmlns=”http://www.editeur.org/onix/2.1/reference”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://www.editeur.org/onix/2.1/reference
C:\XML\XSD\ONIX_BookProduct_Release2.1_reference.xsd”>

For ONIX files using Short Tags:

< ?xml version=”1.0″ encoding=” utf-8″?>
< ONIXmessage refname=”ONIXMessage” shortname=”ONIXmessage” release=”2.1″
xmlns=”http://www.editeur.org/onix/2.1/short”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://www.editeur.org/onix/2.1/short
C:\XML\XSD\ONIX_BookProduct_Release2.1_short.xsd”>

Note that the last line in the declaration contains the “local address” on your computer and one of the Schema files you downloaded from Editeur.org. So unless you’ve used the suggested directory naming proposed above you’ll need to modify the above script to use the local computer address you used. Note also that if there are spaces in the directory name (and possibly long directory names) that the address may not be read properly by the XML software. This is one of the most likely things to go wrong — keeping it simple and direct makes your life easier.

That’s it. Save the file (always save the file as the software works with the last saved version and won’t include unsaved changes) with your data and the new declarations.

Setting up to use Schemas using XML Notepad

Start up XML Notepad 2007. You’ll need to let the software know where to find the schema by creating an internal link to the XSD schema files from Editeur. So:

  • Top menu bar: Click on View
  • From the dropdown: Click on Schema
  • From the dialog box: Click on the box with the ellipse (three dots) at the far right

This opens a standard Windows file finding user interface — use it to navigate to where the schemas are stored and select:

For Reference tags: ONIX_BookProduct_Release2.1_reference.xsd

For Short tags: ONIX_BookProduct_Release2.1_short.xsd

(You can do this step twice and put both reference and short links here. The program will work fine if you do, but it’s only necessary if you use both short and reference tag files.)

Note: Whenever you first start XML Notepad you’ll need to let it know where the local schemas are — it’s just a matter of opening the Schema dialog and clicking “OK.”

Your First Validation using XML Notepad

Using XML Notepad go to its top menu bar: Click on File / Open and from the dropdown choose the file you created using the schema declaration and your ONIX data in it and open it.

Assuming it opens (if not, skip back to previous blog posts on File Cleaning), in the upper left, you’ll find a tree that corresponds to the file’s XML structure — the composite and tag names are displayed.

In the upper right is the data within those tags

And on the bottom half — the Error List which may (or may not) be displaying errors.

First: Did you open the Schema dialogue and click OK? Are you sure? XML Notepad will display errors until you confirm the schema that it should use. So, just to be sure go to the top menu bar: Click on View, Choose Schema, (and assuming the ONIX schemas are there as described in the previous section) Click on OK.

In the error section double click on one of the errors. If you have a large file it may take a while to respond, but it will take you directly to the error identified. If you have no idea why it’s a problem (and why would you?) I’ll try to put some advice together in future blogs, but the problem will either be a violation of the XML Standard or a violation of the ONIX Standard.

There’s a webinar available on this
http://connectpro95248216.acrobat.com/p33362696/
It’s a little rough, but we’re just learning the software (and truth be told, I’m not at Michael Tamblyn’s level of presentation), but if you find this sort of webinar helpful let us know and we’ll try to do more.

For your convenience here here are two files set up for Schema validation. Limitations on our website prevent me from using files with a .xml extension, so you’ll have to change the file from .txt to .xml. There are two, one for Reference Tags and one for Short Tags — if your tags are largely in English then it’s Reference and if the tags are largely a character plus 3 digit codes then it’s Short:

Reference Tag Schema

Short Tag Schema

Data Exchange Tips #5: Some Basics: Tools before validation

Monday, November 16th, 2009 by tom richardson

An XML file is simply text — nothing very special — except that in order for XML software to read and interpret it, everything needs to be just so. XML is, loosely, a computer formatting language, and as such is a low type of computer code — if not quite as finicky as a proper programming language it has much stricter rules than HTML.

Every part of the structure of the file, and aspects of the contents, must match two defining documents: An ONIX file is validated by using XML software to compare your file against the rules of the XML standard (www.W3.org) and the schema written by the ONIX developers at Editeur (www.editeur.org). So an ONIX validation is both something that applies to all XML documents and is specific to the ONIX data exchange standard — and validation errors might be from either. You shouldn’t confuse the XML validation process with the Certification report generated by BiblioShare. Every file accepted into BiblioShare, after it passes the XML validation discussed here, gets a quality assessment that looks for data issues. This is a distinct and separate process from XML validation.

You probably should research and try to understand as much about XML as time, energy and inclination allow you — you’ll be happier producing ONIX if you do, and possibly more comfortable using Epub too. There are good resources on the web, Wikipedia and www.w3schools.com/XML/ are recommended as a start.

What gets validated?

The ONIX file, the file you send to BiblioShare and your other trading partners is what gets validated. In solving validation problems you might make corrections to your original dataset and re-output the ONIX file, or you might just manually modify the file itself, but it’s the file, whatever.xml, that we’re working with here. Validation is always the last step — before sending XML files to anyone they should be checked.

Taking Stock

First off: Do you have any XML software — an XML editor or development suite like XML Spy or oXygen? If you’ve inherited this job, look at your program list, ask! It can’t hurt and you may as well use what you’ve got or paid for. I will be recommending some specific software and one is free, but there’s nothing special about it. You should consider getting more than one validation tool (you can never have enough validation).

You can find software through a web search on “XML editor” or look at the “XML Resources” at O’Reilly’s www.XML.com.

Text Editor

As noted above, an XML file is just text and XML files can be opened in a text editor. If you’ve got an XML editor you might use that as it’s designed for the work, but you absolutely can view and edit an ONIX file in a text editor. Your only concern is ensuring the editor does not change the file. For example you can use MS-Word to open an xml file — but don’t do it!! Word is set up to “help” you run XSLT transformation scripts and will make any number of assumptions and changes to the file content, none will be good for our purpose of using the XML standard to exchange data. (This warning about software changing files applies to a lot of XML software. Until you’re sure it doesn’t, assume any software might be making changes to an XML file.)

What you want to use is as simple a text editor as possible, on a PC Notepad or Wordpad, on a Mac, TextEdit or SimpleText, make gentle changes to the text you can see and save it without rendering the document unreadable to XML software. Really, it’s just use the keyboard or cut and paste text, and exit using the most straightforward options. If forced to choose format options on saving first try to use the one labeled “ASCII US” or “ASCII text”.

There are number of text editors available designed to be used by programmers — they tend to have better “Go To Line” features, are usually tag sensitive (you’ll understand that when you see it — very handy in XML) — and they don’t muck with the code. I’m fond of Notepad2, http://www.flos-freeware.ch/notepad2.html, but Notepad++ http://notepad-plus.sourceforge.net/uk/site.htm might be worth checking out.

As always all work should be done on a copy of your ONIX file — experiment but don’t trash your work.

PC vs MAC

Macs are better for a lot of things but you have more options (and more free options) for XML software on a PC. If your Mac has a Windows emulation or operating system boot area any PC solution should work. Mac solutions are typically Java based — and there’s nothing wrong with that (PC software usually rely on the .NET Framework) — but they are more likely to have fees associated with them.

I would really appreciate feedback from Mac users as I’m not very familiar with what’s out there. oXygen seems to be the clear favorite but I’m sure there’s some good freeware too.

File Size

XML software is typically processor intensive and requires a lot of RAM memory resources. Some software fails at large file sizes, and all most will be more difficult when handling large files. You’ll find it faster and easier to understand if you do this on a smaller file (below 1000 records and below 100 records would be even better), at least while doing you’re first validations. When you’re familiar with the software and its responses try using larger sizes — most XML software has an upper limit at which it’s unresponsive. How would you know if you haven’t done it successfully?

How do you cut a file down to size? Use a text editor, open the ONIX file and remove individual product records by starting with the tag (or

for short tags) and include the corresponding tag (or ). So long as you remove whole product records ( to ) and leave the other tags alone you can take out as many as you want.

Internet Access

XML software usually needs internet access to work — do this on a computer hooked up to it.

The ONIX Documentation

It’s big, it’s dull and you need it on your computer: www.editeur.org ONIX / ONIX for Books / Previous releases / Release 2.1 Downloads / Download Release 2.1 format specifications You’ll need to get the current release so I’ve not provided a direct link. Having a copy of the Product Manual and the Message Specifications is invaluable. The PDF is linked to the code lists and it’s the easiest way to look up something.

Data Exchange Tips #4: Escaping Entities: Påvøl breeches Checkpoint Charlie

Tuesday, November 10th, 2009 by tom richardson

Most publishers have more than 100 books on their list, and a few of them will be by non-European authors or reference some atypical symbol. And wasn’t the production manager proud when the cover copy spelt it right?

ONIX is a bibliographic data exchange standard, and it behooves us who toil in publishing to spell the author’s name, book title or their review journal correctly. Really foreign scripts will (probably) allow use of some transliteration system, so it’s unlikely you’ll need to use Chinese, Arabic or Cyrillic scripts. I have arbitrarily decreed this to be beyond my scope (or knowledge) — but “similar” alphabets like Norwegian or Romanized Slavic are not part of iso-8859-1. And even for the Western European languages it does cover (French, Spanish and German) there are missing letters. There are common symbols missing like trademark and copyright… At some point you’ll need to put something in an ONIX file that’s outside of the common encoding schemes, and for that you’ll use an escaped entity.

I’m going to make one of my daring generalizations here to help you recognize an entity: It starts with an ampersand, “&”, has simple keyboard characters in between and ends with a semicolon, “;”. It’s recognized by XML software and rendered as characters by browsers. Here’s a link to one of my favorite sites with lists of entities:
http://htmlhelp.com/reference/html40/entities/

For example an e with an acute accent — é — can be “escaped” as &eacute; or &#233; or &#xE9; — and further, an entity is special in XML because the ampersand should not be itself escaped. You should never see a “double escaped” entity like &amp;#233; in an ONIX file.

A file encoded as utf-8 has everything that can’t be expressed as a simple keyboard character escaped while iso-8859-1 can have characters like é, ç, à, è, ô, ö, û, ñ, etc. but not characters like ů, ũ, or š, etc. Neither of these encodings can accept “smart” characters, m or n dashes, etc. although there are escaped versions of these. Where is the dividing line? When the software complains, and just like the previous post on file cleaning the solution is simple substitution by find and replace. What? You haven’t kept track of how your characters are stored in your source file? Oh dear… understanding what’s in your source file is the first step. But basically: the XML software complains and you fix a problem, just like the encoding problems.

There are considerations: First, while a basic validation in most XML software accepts all entity types, XML schema languages are stricter (I don’t know why), and they don’t accept the ‘html’ type like &eacute;. So using html style entities will cause you a problem with BookNet Canada’s BiblioShare and ONIX 3.0. Schema validation is the future, so the prudent administrator should avoid html entities. The ‘decimal’ style &#233; is the most common one supported by schema languages, and the one I recommend. I seldom see files using the ‘hex’ escaped entities like &#xE9; so I suggest not using it but I can’t defend that prejudice. Ideally you should use one system, consistently, in your file, and if you need to change it at some later point it won’t be hard.

A second consideration is the on-line companies that get your data. In the descriptions and biographies you can certainly spell things correctly as the entities will appear correctly in browsers, but what about “searchable” fields like Contributor and Title? What happens to your author Hélènne Ővēn if you submit her name ‘correctly’ (according to who? me?) in encoding=’iso-8859-1′ and escaped as ” Hélènne &#336;v&#275;n” or in utf-8 as “H&#233;l&#232;nne &#336;v&#275;n”? It may render properly in a browser, but will it affect how easily you can search for her name in Amazon or Indigo? Can a consumer search the obvious keyboard bastardization of “Oven” and find the book? It’s a problem, that’s about all I can tell you. The on-lines are way better about this than they were a few years ago when anything outside of simple keyboard characters weren’t acceptable in a searchable field but there are no guidelines here. I’d say that “Hélènne” with it’s pretty normal “special characters” within iso-8859-1 wouldn’t be much of a stretch, but “Ővēn ” is likely to cause problems. it might matter if you submitted “Hélènne”, “H&eacute;l&egrave;nne” or “H&#233;l&#232;nne”. They are all different, clearly, and programmers at every aggregator or on-line would have to set up to process all of these to index. Did they? Will your own website? Oh dear!

This is, sort of, what it means when BISG says iso-8859-1 is the recommended encoding for the US supply chain: Aggregators should accommodate at least the special characters in it. And maybe they do more, maybe they do less, but it’s reasonable to think they’ll do that much. And when I say Canada hasn’t made a recommendation it means, well, we haven’t gone that far.

If you really have a lot of special characters that are critical and you don’t yet know what you and your trading partners are doing, well, that’s beyond the scope of this blog. I’m trying here for practical help to largely English language ONIX producers. But mostly I want to say: Take advantage of ONIX!! You can, and should, update your records. So spell the name right, submit your data as early as you can and then check the on-line records. If it doesn’t look right or the searches fail, then ask them about it or judiciously misspell the name to compensate and re-submit your data. You should have 6 months before publication to work it out. Maybe Amazon and Indigo will be OK but it’ll be wrong on Barnes & Noble. Maybe it’s only Walmart who can’t get it right. And maybe Walmart is the only one that matters to you. It’s your call but the author will probably understand why you made your choice. Try again in 2 years and the answers will have changed.

And that advice should make anyone who cares even a little about the accuracy of their records cringe.

Data Exchange Tips #3: File cleaning, not just for your nails.

Tuesday, November 3rd, 2009 by tom richardson

In later posts I’ll look at and recommend XML software (if anyone has favorite software — particularly for Macs as I don’t use them — let me know), but for this I’m assuming you have some and you’ve loaded into it an ONIX file that uses one of the two most common XML encoding declarations:
< ?xml version=”1.0″ encoding=”utf-8″?> (the file contains only standard keyboard characters)
or
< ?xml version=”1.0″ encoding=”iso-8859-1″?> (the file contains only standard keyboard characters plus basic French, Spanish, or some German accented characters)
and this being XML, the software is giving back some sort of statement saying on Line X, position Y there’s an unrecognized character — or possibly shown some sort of box listing 5 or 6 gibberish values that it’s converted to an underscore. Or maybe the software just craps out and won’t load.

This is what XML software does when it looks at your file and finds something in it that doesn’t’ match the encoding declaration — and this is what will happen when the file is loaded at Bowker, Indigo or Amazon. The aggregators are probably fixing minor problems because it’s faster to do that than complain, but if there a lot of problems your file may well get shuffled to one side and never loaded. So you can rely on the kindness of retail to fix and maybe load your file, or you can do what you can to make sure that the file loads properly. If you make the effort I can assure you retailers will know and will be much more likely to contact you if they have problems.

The first (but not only) step in file cleaning is finding and correcting encoding issues because they usually prevent XML software from working. Because not all XML software is the same it helps to use more than one piece of software when trying to clean files. File cleaning is pretty simple conceptually, and simple in practice too. An XML file is just a text file — the simplest type of computer output possible. XML software needs all the characters in the file to be recognized in order to work, so to fix problems the easiest thing to do is open the ONIX file in a simple text editor (Notepad, WordPad, SimpleText, etc.), or if it’s available in the “text” view of your XML software.

What you don’t want to do is open the file in something like MS-Word that will recognize it as an XML document and start modifying it based on what it thinks you’re doing. ONIX is a data exchange standard and Word will think you’re trying to XSLT transformations.

Use the “Go to line” function (ctrl G) to go to that line specified and look around (if the Go to Line function isn’t available, I’ll have some suggested text editors in the software discussion). You’ll probably see some glitchy text, a “smart” character, or possibly an accented character. If it’s the latter and your encoding is UTF-8, change the declaration to iso-8859-1 and try loading it into the software again. The game you’re playing is matching the characters in your file to what the XML software expects, so changing the encoding statement to the appropriate one is allowed (but no aggregator accepts every possible encoding and the two recommended here are the most common). The next blog post on “Escaping Entities” will deal with leaving your encoding as UTF-8 or using special characters outside of iso-8859-1.

But let’s say it’s a glitchy character — incoherent text strings or symbols, or possibly it’s a “smart” character: curly apostrophes, special dashes and the like that are pretty and work in their source software but are not part of the encoding. The first test is if can you copy and paste them into the “Find” dialog box? If you can’t then whatever they are they’re so not-text that the text editor is not willing to work with them (Bones might say: “It’s a letter Jim, but not as we know it.”). At a guess they are hex (witchcraft?) characters and you may be forced to clean such issues one at time. I’ve never seen a file with a lot of this problem, but cleaning them in the source (what your ONIX was created from) is the way to go.

The second test is: Does the character you’re searching reoccur consistently in the file and in each case does it represent the same thing. If not, you’re again looking at manual cleaning. There’s no easy way to do this, but if there’s too much to fix manually, maybe you need to go to the source of the character and do some tests there. This is why encoding is so important — it’s so fundamental to the file that everything hinges from it. You may need to change how you create your documents in order to prevent problems.  But the point is that XML software won’t care about anything more than the XML file in question. What came before it doesn’t matter to it and only send files that match the encoding statement.

The most likely thing will be if you copy and paste the problem into “Find” (ctrl F) is that it reoccurs numerous times in your file and that it’s consistently the same problem. If it’s a big file try to test at several spots in the file because it’s just possible that data loaded at a different time will be different.

This is a copy of your ONIX file, right? So no harm in experimenting — use find and replace to transform the glitch to what it should be — the simplest possible keyboard character or an escaped entity (next blog post). Make a copy of the two values for future reference. You’ll possibly find that there are hundreds or thousands of instances of the problem in your file. Save the file and go back to the XML software and attempt to load it again (remember, the software will load the file’s last saved state so be sure to save your work). You’ll probably get another problem. Repeat the process.

While this may seem futile there are probably only a limited number of such problems in your file — 5 to 10 types are normal — smart quotes and apostrophes (several types) and dashes. Depending on the encoding and sensitivity of the XML software you may also find accented character, trademarks and other special characters similarly noted. My next post on Escaping Entities will be a fuller discussion of these.

You’re going to need to make a decision at this point. You can fix the characters in the source document — that is if your ONIX file is generated from a database or other source — to go back to it and fix the problem there so that problem won’t exist in any future ONIX output. Or you can just fix the problem in the ONIX file and do this as a step every time you create and send it. Which makes the most sense probably depends on the number and how easily you can change the source file. Some database software allows you to do find and replace on multiple entries while other content management systems (CMS) only allow you open up individual records.

It’s common for publishers to clean each ONIX file prior to sending — and the whole point of storing information as XML in text files is to make it easily transformed — but clearly having the source clean is preferable. And given the source probably supports other uses like your website and catalogues it likely is worth the effort. The very best choice is to ensure material added to your source is clean and if over time you clean up problems in the existing records eventually it will be. If your system only lets you edit one record at a time and there’s no way to convert every example of a bad character across records the understanding of what you need might be enough to get the developer to help you clean the contents as a one-time project.

I do this with trepidation as such documents may lead you astray, but here are a couple of documents with the most common “smart” characters I see, with alternatives and a spreadsheet that lists the most common escaped characters. The problem is that because these documents list characters that are encoding problems they may well not render the same way on your computer as mine — you may be better off creating such lists in the environment that you work in. So with that caveat, that what you see may not be what I see, here are a couple of files that might be helpful:

sample_bad_characters.doc
special_character_list.xls

Data Exchange Tips #2: So what(’s) encoding, anyway?

Tuesday, October 27th, 2009 by tom richardson

In the first tip, I tried to establish why you, the ONIX file sender, have to test your file, and that’s simply to ensure that the files content — all the characters — would be recognized by the aggregator’s software. The “encoding” declaration in the first line of the file tells the recipient what to expect — and your job is to ensure that the file matches that.

If you’re trading files in English speaking North America you’ve got a choice of three encodings that will almost certainly be considered acceptable by aggregators. (There are lots of others, but my assumption is that you’re trading files largely in English, with some French and/or Spanish thrown in).

The default encoding in ONIX is UTF-8. It’s the most commonly used in English North America for XML and the most supported by XML software. It’s more-or-less what was called ASCII (but not extended ASCII) — the English language keyboard characters. Any text document in English will almost certainly be largely in UTF-8 encoding without any work on your part.

The other common encoding is ISO-8859-1, what might be called ‘extended ASCII’ or Latin-1. It supports the common accented characters in French, Spanish and German. BISG has identified this as the preferred encoding for the US supply chain. We in Canada are more demure and think it slightly impolite to discuss, but are OK with it too.

And then there is “windows-1252.” This is what, in desperation, your trading partners will use when they hope you’re on the Windows operating system and your file is screwing up when they load it. It’s the Windows version of ISO-8859-1. I think. I don’t really know… Who could possibly care about this!!!!

Here’s the dummy version: When you hit a computer key some code is generated and interpreted and appears on your screen. There’re conventions and standards that control all this and when you bought your computer if the sales person was awfully knowledgeable, they might have been able to tell you what conventions your computer follows. If you’re on a PC with a number pad try this: Hold the ALT key down and on the number pad key 80. If you did that you made a big pee, and I’m really, really pleased with myself for getting you to do it. My only point is that there really isn’t a way to know what your computer is doing, except that:

  • If you bought your computer in English speaking North America;
  • and no one said it wasn’t an standard keyboard;
  • and you’ve not really thought much about it;

then what happens when you make simple keystrokes is almost certainly UTF-8 (unless some piece of software is screwing with what you type). Can you cut and paste into a text document or email and it (usually) doesn’t turn to gibberish? Then it’s more or less UTF-8.

XML software doesn’t care. It’s up to you to tell it what your characters are, and as a start assume that you’re typing largely in UTF-8. You don’t really have a choice. But here’s a quick solution to testing your ONIX and it’s not loading properly because of unrecognized characters. Change the encoding declaration to encoding=”iso-8859-1″ and hope. It may be all that you need, but more likely you’ll have a small number of unrecognized types of characters in your file.

To summarize: You must test all XML files before sending them, and the initial point of testing XML files is to ensure that the contents are recognized and defined. There are some secondary data quality and validation issues that will come up when the actual ONIX standard is discussed, but the first step is always a coherent recognized file acceptable to XML software.

The next post is some practical tips on cleaning files, and the one after that is on what to do with special characters outside of your encoding statement, so don’t worry about your weekly excitement just yet.

Data Exchange Tips #1: Why XML?

Wednesday, October 21st, 2009 by tom richardson


I’m going to do a series of blog posts on some of the very basic issues in file trading — what needs to be done before you submit an ONIX file (or an E-book if your e-book is in XML). In doing this I’m hoping that publishers will comment about software they like (and don’t), problems they have — and with any luck their successes.

So, for the first post: Why XML?

Any discussion about file exchange has to start with why XML works, which is because of its underlying assumptions and the software that supports them. The main assumption is that all the characters, line returns, visible and hidden content — all of it — are recognized in every file. XML software tests for this and it’s so important that information about it normally appears in the first line of an XML file as an encoding statement, right after you identify that this is an XML document:
<? xml version=”1.0″ encoding=”utf-8″ ?>
or
<? xml version=”1.0″ encoding=”iso-8859-1″ ?>

Think about that for a moment: How obvious and how could it be otherwise? And then think about just how unlikely it is to be true about a publisher’s ONIX file, built up over long periods of time through cut and paste from who knows what source documents. You don’t really know where all the millions of characters in your ONIX file came from, do you? And that’s why trading delimited files or database files doesn’t work. None of these test the incoming data. But XML software does and it won’t work with less than “well encoded” data.

Publishers can think of it this way: You’ve probably heard of or published a book where an “incompetent freelance designer didn’t use the right font” (or used “outdated software,” or provided “bad thingies”) and the files screwed up when it went to the printer. And your production manager “fixed that file” with a lot of overtime and foul language. That’s an encoding problem: What you sent to someone else didn’t appear as you intended it to be. If you were trading files in XML and did it right that wouldn’t happen. All sorts of other things might — but not that.

The trick to the encoding statement is it doesn’t really matter where the characters came from — it’s not your ability to answer the Zen koan: “What is the encoding of the letter you’re typing now?” What matters is what happens when someone else loads the file. Does their software recognize all the characters? You may have software designed to create an ONIX file, but does it monitor what’s going into it? Does it prevent you from loading dashes from Word 97 or WP5.1 with an error message? Does it ask you want the output encoding to be and prevent anything else going it? It would be surprising if it did.

So the first rule of data exchange is that you must test the ONIX output every time you create it. You test your data with XML software before you send it. The XML standard demands it. The ONIX standard depends on it.

That’s why XML works. The XML standard and software are designed to enforce things like this. You may think you can trade data using Excel or delimited formats, but none of these will do a good job of ensuring that what you send can be read at the other end. XML does (somewhat — don’t think it’ll be perfect), and that’s main reason it’s better for data transfer.

BISG Presentations

Monday, September 21st, 2009 by sberes

The Book Industry Study Group’s Annual Meeting of Members 2009 was held on Wednesday, September 9, 2009 in NYC. Some very interesting and thought-provoking topics were presented that are available on Slideshare. When the Book Rights Registry’s presentation is available, it will be posted at the link below.

You can view them here.

And of course, you can also follow them on Twitter @BISG.

Future of the ISBN - Free Webcast from BISG

Wednesday, August 12th, 2009 by Morgan Cowie

The Book Industry Study Group is hosting a free webcast on Identification and Digital Publications: Exploring the Emerging Standards Landscape. The cast takes place on Tuesday, September 15, 2009 from 11:00 AM to 12:00 PM.

The book industry has had the ISBN for nearly 40 years; there has been little cause for excitement. Now, suddenly the whole subject of “identifiers” has become a hot topic, particularly when it comes to digital books and other online resources. This BISG Webcast will explore why the book industry has standard identifiers, and consider the future of the ISBN (International Standard Book Number), as well as the role of newer identification standards like ISTC (International Standard Text Code) and ISNI (International Standard Name Identifier). What do you need to know to make informed decisions about how — and whether — to use them?

You can find out more and register here.

ONIX 3.0 - Free Web Seminars from BISG and EDItEUR

Monday, July 27th, 2009 by Morgan Cowie

Though it’s still early in the life and times of ONIX 3.0, the new release is starting to generate some questions. Thanks to the BISG and EDItEUR, you can get some answers in two free web seminars that are coming up later this month.

The first, ONIX for Books 3.0: An Introduction, answers four primary questions:

  • Why did the book industry need a new ONIX release?
  • How does ONIX 3.0 provide new support for digital publishing?
  • What are other important benefits of ONIX 3.0?
  • How should publishers and other ONIX users respond to the new release?

The second, ONIX for Books 3.0: Best Practices for Implementation, features Richard Stark (Director of Product Data for Barnes & Noble, Inc. and Chair of BISAC’s Metadata Committee) and is an hour of how to best implement ONIX 3.0 at your firm.