At last year's ebookcraft conference, Teresa Elsey, the Digital Managing Editor of the Trade & Reference Division at Houghton Mifflin Harcourt, shared her experiences maintaining backlist ebooks with her team.
Besides the technology advances that affect the quality of those earlier conversions, she explained that those backlist ebooks aren't getting any more semantic, aren't getting any more legal, aren't getting any more accurate, and have increasingly irrelevant metadata.
In this thought-provoking talk, Teresa discusses the challenges and questions that arise when books turn into software and ebooks don't go away after you've made them.
Learn more about ebookcraft: ebookcraft.booknetcanada.ca
(Scroll down for a transcript of the conversation.)
Want to make sure you never miss an episode of the podcast? You can subscribe for free on iTunes, Stitcher, Pocket Casts, TuneIn, or SoundCloud.
Transcript
Zalina Alvi: We now live in a world where almost two out of every five books purchased by Canadian consumers are ebooks. What that means for most publishers is that frontlist titles are increasingly being published in both digital and print formats, while hundreds, if not thousands, of backlist titles are also being converted and maintained as ebooks as quickly as ebook designers can create them. But this new world we're living in is just that, new. And many of the unique challenges and considerations of a digital plus print publishing landscape are still making themselves apparent. Take for example the conundrum of the backlist ebook. Whereas print books used to, you know, go out of print eventually, ebooks never have to. And while this may seem like a dream come true for many publishers, it also comes with a whole array of editorial production and technical obstacles that may surprise you.
At last year's ebookcraft conference, Teresa Elsey, the Digital Managing Editor of the Trade and Reference Division at Houghton Mifflin Harcourt shared her experiences creating and maintaining backlist ebooks with her team. Besides the technology advances that affect the quality of those earlier conversions, she explained that those backlist ebooks aren't getting any more semantic, aren't getting any more legal, aren't getting any more accurate, and have increasingly irrelevant metadata. In this thought provoking talk, Teresa discusses the challenges and questions that arise when books turned into software and ebooks don't go away after you've made them.
Teresa Elsey: So a lot of what we've talked about today and a lot of what we talk about at these conferences is always about making brand-new ebooks, often ones that go along with brand-new print books. So we talk about how you're gonna storyboard them out, how you're gonna enhance them, what kinds of tools and workflows you can use if you're starting fresh from the very beginning of the ebook-making process to produce the most accurate semantic, accessible, high-quality, wonderful ebooks. And when I started this job at Houghton Mifflin four years ago, that was primarily what I thought I would be doing, working with my group to make these brand new ebooks. And in the present, I think that's still what most of the people working at my company think that we're doing. But when I started counting the titles that my group was actually working on, it turned out that working on these brand new ebooks is really just a very small part of what we do.
In 2015, my group sent 1,400 ebook files for distribution, and about 300 of those were brand new frontlist ebooks, almost all ebooks that were going along with brand-new print books, some of them maybe being digital-only products. And about 150 more were titles that were appearing in ebook for the first time, but the print book was old. We were making new ebooks of old print books that for some reason didn't have an ebook yet. This is us filling in holes in our backlist. But still the remaining 950 titles that we worked on in 2015 were updates or redos of ebooks that were already on sale. So when I look at these numbers, I see that only 20% of the files that we worked on are the brand new ebook files that people think of as the primary work that our groups do. And about 70% were actually reworks of ebooks that already existed, of books that were already on sale.
So why do these ebooks keep coming back to us? Why two-thirds or three-quarters of the work that my group does is on these old ebooks is what I wanna talk about today. And my group saw an early version of this slide, and they asked me to tell you that there are only four of them, so just to give you an idea how quickly we're working. Now, when I look at our entire ebook list, I think of it it's looking something like this, where we have 300 or 400 new ebooks that we're making in a given year up here. And this is the part of our work that everyone sees and everyone's aware of. And this is where our friends the editors are and the designers, and the marketing team.
But then down here, we have approximately 5,000 ebooks, and we'll come back to why that's approximate. But those are the ebooks that we've already made that we now have to take care of. And like an iceberg, it's this underwater part that fills those of us who are aware of it with terror or at least, you know, some significant professional concern. And if I can diverge a little from actual facts about iceberg, let's also mention that this underwater portion is growing every year. The 300 new ebooks that we made in 2015 have now become part of this underwater mass that we have to maintain. And the company is moving on to 300 new ebooks that we'll make in 2016.
Now, this is not entirely different than the situation with print books, where you have a large backlist that's supporting your smaller frontlist. And there are plenty of people in my company who work on the print backlist doing things like make sure our print books get reprinted as needed, and stay in stock at the retailers, and come back to the public's attention when they're relevant. But because we're here at a digital conference, what I actually wanna talk about is how this is totally different from print books. And I'll point out a few ways that this is true as we move through the talk, but here are some to start with.
So one major difference is that many of the ebooks we have already made are not very good. And I think many publishers are in this boat, that they convert a lot of their backlist books to ebooks in an awful big hurry when it started looking like ebooks were gonna be a thing. And so all of a sudden you have hundreds or maybe thousands of backlist ebooks. And they've all been made to old standards with poor-quality OCR, and they're in outdated formats. And they have low res art, and they've never been proofread, and they're made by processes that are incompatible with your current ebook workflow. And perhaps, ironically, it's your most important backlist books that are the worst because they were the first ones that somebody thought they should convert.
So this is very unlike print books, where I hope anyway, you're not looking at books that you designed and typeset in 2010 and saying, "Oh my gosh, these are so terrible. We didn't know anything about how to make books back then." But that's the situation with a large share of our ebooks. And so one of my group's priorities has been to triage this 5,000 title ebook backlist and to think about how are we gonna fix these? What kind of criteria do we use for deciding which ones we should try to improve and which ones we should just try to remove from sale? And how do we balance the time we spent working on cleaning up that backlist with also getting the frontlist out and continuing to innovate in our processes, and learning some JavaScript, and doing storyboarding, and all the wonderful stuff that we've been talking about here? And then also, because we're not the only ones who have noticed this about our older ebooks, how do we respond to the customer and retailer complaints that are rolling in about these 5,000 old poor quality ebooks? So that's one thing.
I think another major difference is that we expect print books to go out of print. When we make ebooks by contrast, there seems to be this idea that we're making them to last forever, that ideally every single ebook we make is gonna stay on sale for eternity, that it's gonna become part of this perpetual money-making long tail for our company. And I think there's also a sense that by digitising these books, we're adding them to sort of the world's cultural digital repository of all books ever published.
With print books, on the other hand, I feel like there's more of an understanding that not all books are meant to last forever, that there are books that we could cease to publish when they cease to make money for us or when there's more costs to keeping them up to date than we want to invest. And because there are costs to printing physical books and to storing them in a warehouse, and to shipping them where there's needed, there's a much higher bar for whether it makes sense to keep your print books in print and available.
So the situation we find ourselves in now is that many of our backlist ebooks no longer have a print equivalent. The print book has gone out of print or it's been switched to print-on-demand. And what that means is that there's no longer a physical reminder of the book around our office. There aren't copies. There aren't cover proofs. It's not coming in on reprint agendas. And so that just means that the book is no longer coming up in a lot of my colleagues' consciousnesses. These are books that they've stopped thinking about, even though we, as ebook managers, still have to think about keeping them up to date.
One more difference, I think of all books as going through a sort of life cycle where first the book is published. And for a while, it's new and interesting, and relevant. And then it passes into being less interesting and less relevant, and less correct. And some books might move into other stages where they become classics, or they become vintage kits, or they become interesting as historical records, or as the subject of textual study. And these steps become relevant to us as we think about what kind of changes we might make, what kind of choices we're gonna make as we continue to maintain these ebooks. If we're trying to keep an ebook in the relevant bucket for longer, we might do things like update the facts in it, correct mistakes, add new figures, add a new afterword. If we think a book is destined for textual studies, it might be important to preserve the author's exact intentions as they were in the first printing of that book.
And now with print books, though, as they move through this life cycle, they pick up markers that show their age. We can look at a stack of books like this and judge which ones are relevant, or up to date, or likely to contain outmoded language or incorrect facts, either through their package or through their cover design, through their typography, through the wear on the actual physical book. And there are just a million other clues that you don't even consciously know when you're handling a print book that set your expectations for what you're gonna find inside that book.
Ebooks, on the other hand, don't have the packaging to show you the difference. And this is a little bit of an exaggeration. I'm aware that it's possible to put covers on ebooks at this point. But my point is that every ebook, in some sense, looks like a brand new ebook. They're not confronted necessarily with age-appropriate typographical or design cues. There's no wear on the book. There's no yellowing of the book. There's no old book smell. And you're reading the ebook on an e-reader, or an iPad, or a phone that's maybe it's two years old if you're a slow adopter. So nothing about the interface or the reading experience in an ebook suggests that the content you're consuming might be 5, or 10, or 20, or 50 years old.
So what I think we lose with ebooks is a chance to set expectations for the currency of their content, that in the same way we expect that our iDevices are gonna just work, we expect that our ebooks are gonna just be perfectly up to date, and accurate, and timely, and customised to our needs at the moment that we're reading them.
So if you're with me, the ebooks are a little bit of a different beast here. We can talk about you've got this ebook backlist, how are you going to maintain it? And I think a lot of us come from companies that are rightfully proud of their high quality print product and that they're not necessarily aware of what's happening in their ebooks in the same way.
Now, Sanders asked this morning, whether those of us who work on ebooks consider ourselves primarily coders. And I don't dispute his point that ebooks are made of code and that the work that we do when we're making them is software development because Sanders is very smart, and I don't argue with him. But my group's work doesn't end once we've made the book. We also have to attend to all the tasks that are necessary to keep them up to date and on sale. So we're not just coders. My group is also editors. You know, we work with code and we do technical work. But we also work with content, and we do editorial work. And so that's kind of what I wanna talk about here, that intersection of technical and editorial work. What happens when, like at my company, the ebook developers work for managing editorial? What happens at that place where code meets books?
And so when we talk about maintaining these backlist books, what are we talking about? What are the kinds of issues that come up? And I'm just gonna zip through a whole bunch of them here without suggesting that these are all of them or that all of them necessarily apply to your particular list of books, but just to map out a large range of what the possibilities are. So starting with technical issues, one of the ways that ebooks are different from print books, of course, is that they get technically out of date. They start looking out of date faster.
So as soon as you've made an ebook, you have the clock running. One of the ideas I have about backlist ebooks is that they're basically all ticking time bombs. The clock is running until that ebook is gonna start looking out of date or embarrassing to you. And again, I think that's not the case with print books, that your designers are not looking at books that they made 5 years ago, or 10, or 20, and just saying, "These are now unacceptable to have on the market."
Some of the difference here is purely cosmetic and I can see some of you thinking, you just need to swap in the new CSS. That's no big deal. You'll have your new 2015 model ebook, but some of what we're doing here poses new problems. I don't think you can probably see, but we moved from using straight quotation marks in 2010 to curly quotation marks in 2015. And though you can use your tools to automatically figure out which way, let's say 99% of those should be turned, you still have the problem of finding and correcting the remaining 1%.
So we actually need to add new semantic information to our books to be able to do this update. And this is a trivial example, but it also applies to updates you wanna make to add accessibility, or if you wanna add epub:type semantics, or you wanna make more sophisticated design or some kind of enhancements, that in order to retrofit old ebooks up to modern standards, actually requires human beings to think about the content of each book and what it means. The information that you need to make these updates is not necessarily already in the ebook or available in some kind of digital form. There's no automated way to write good alt tags or to know whether a sidebar is part of the main content, or whether it's an aside. Sometimes we just need to update because we get better at what we do. Sometimes we need to update our ebooks to respond to technical spec updates to make our EPUB 2s into EPUB 3s because an e-reader software update has surprisingly broken something. And sometimes what e-readers are able to do has actually changed.
But anyway, all of these things are reasons why you need to be consistently reviewing and updating your ebook backlist. And so if we take for a moment our hypothetical 5,000-ebook backlist, how often do we need to revisit each one of those titles? Based on the current pace of change in the examples I've brought here, it seems like five years was much too long. At the other extreme, sometimes my team will look at a book they made six months ago and say, "This looks terrible. Let us redo it." So if we compromise on something like every three years, then the question is, can you reissue a third of your entire ebook backlist every year? For us, that would be about 1,600 ebooks. And like I said, last year we did 950. So not quite there yet, but that's what we're aiming for.
Another set of issues that take up a lot of my time are those involved with keeping those 5,000 backlist ebooks on sale legally. And this is not a concern that's different from print books, except that we have more of these old books and more impetus to keep them on sale. And unfortunately, fewer editors and authors invested in doing this complicated legal work for the relatively small returns they're seeing on ebooks. And as I mentioned, I sometimes think of backlist ebooks as ticking time bombs. And I think there's no place where that ticking time bomb is more obvious than in the case of rights and permissions for your ebook, making sure you still have the legal right to publish and sell that ebook.
So to start, there's the question of whether you have the rights to make an ebook at all, and this can be complicated for a book that was published before the invention of ebooks. Your contract may not clearly specify whether you have the right to sell an ebook or how much you pay in royalties if you do. And this has been a huge effort for our contracts department to go back for every single one of those 5,000 backlist ebooks to determine whether it was legally permissible for us to make and sell that ebook. And this is ongoing work. I get a report in my email every single day of new books that our contracts department has determined it is legal for us to make an ebook for. And you may have the right to make an ebook, but that comes with certain stipulations, like the illustrator gets to approve it or you have to disable text to speech, or you can make an ebook but not one that has audio or video in it. And that right doesn't necessarily last forever. Another set of emails I get every week are titles where the rights are reverting to the author. And if we have an ebook, we need to remove it from sale.
So easy enough, 5,000 ebook contracts to keep track of to keep your 5,000 ebooks on sale. But then the piece that we've come to think about a lot since then is the rights and permissions for all the separate assets that exist in the ebook. So first, the cover, which often includes a photo or a piece of artwork that we might have the right to use for a certain number of copies or for a certain amount of time. And again, if the cover was designed before the invention of ebooks, we probably didn't get the right to use it on one. And that's why you'll see generic covers on a lot of backlist ebooks. And maybe your book interior contains art, or it contains photographs, or it contains a map, or it contains quotations from poetry or for song lyrics. And again, you may not have requested permission to use those in an ebook when you're arranging for the print book. So those may each individually need to be recleared for your ebook use.
And again, each of those permissions may have its own terms or expiration, where again, it's good for a certain number of copies, or a certain amount of time, or include certain restrictions, like you're gonna print it at a certain size, or with a certain credit line, or only below a certain dpi, or in an ebook that has some kind of copy protection. And then specific to the ebook also is the right to embed any fonts you've decided to use, which is a separate thing from the rights to use them in print or on the web. And the way this works at my company is we have embedding agreements with font foundries that allow us to embed their fonts across a certain number of ebooks in our backlist in total. So let's also be keeping track of how many fonts from which foundries we've used and how many books.
So now, for our approximately 5,000 backlist ebooks, we have the right for the book itself and the rights for the cover on the art, and the rights for, let's say, between 0 and 50 other assets that are in that ebook. So now you're talking about 10,000 or 100,000 separate contracts and agreements that you need to be keeping track of just to legally keep your ebook backlist on sale.
Now, to the editorial issues and I think this is a type of maintenance that seems straightforward, the editorial error, the spelling error, the grammatical error, the factual error. These are the kinds of things that are really easy to fix in an ebook and so we do quite a lot of fixing of these. And with things like these, the question is not whether to fix them, but just how quickly and efficiently we can get these out of our books.
So if we say it takes on average maybe 10 hours to proofread an ebook fully, and we have 5,000 ebooks, that will only take 50,000 hours. And if you have someone working full time, 40 hours a week on that, that would be 25 years. And as those of you who have worked as proofreaders know there's really no substitute. You can use all kinds of technical means to get closer and closer to 100% fidelity in your books. But to feel fully confident that there's not a single incidence of feces in your ebooks, you still need to do the full 10 hours of proofreading.
Now, ebooks can have automated spell checking applied to them, and this does help a lot for certain errors. My team now manually reviews every instance of feces that occurs in one of our ebooks. And you would be surprised how many times it does accurately occur in our books. But of course, our retailers are also interested in the quality of our ebooks, and they will helpfully flag things that they think are errors for us.
And so we actually spend a really considerable amount of staff time going through processing these errors that are not really errors. Things like, this is a book that has a quotation from a 19th-century primary source that has original 19th-century spellings in it. This is a book called "Flowers for Algernon," and these are not mistakes in the book. This is how the book is written, but it's one that is continually, continually being reported to me by retailers as having spelling mistakes when they do automated checking on it. And you could say, "Well, just look at the print source." And I wish that's something that our retailers would start doing instead of calling me every time they see this. But, you know, unfortunately, my company does occasionally publish a print book with an error in it as well. And then what do you do in the ebook?
Thinking, again, of the lifecycle of the book. Is this book a classic, where the textual errors are gonna become important enough to study and we're taking something away from the ebook text by making it not match the print book or by correcting that error silently?
And so one set of books my company publishes are "The Hobbit" and "The Lord of the Rings." And these are books where I'm just petrified of making any changes to the ebook text, even when we see something that is obviously an error because the integrity of that text is something that people care about. It's something that they write papers about. And here's a paper about the text of "The Lord of the Rings" that quotes Tolkien himself complaining about "impertinent compositors that have taken it upon themselves to correct as they suppose, my spelling and grammar." And so I wonder if in 2015, we can predict which books we're publishing now that are going to be the Tolkiens of 10, or 20, or 50 years from now if we wanna set ourselves up to be this decade's impertinent compositors by going in and fixing all those things that Amazon thinks are typos in our books. And sometimes we're just not sure if something is an error.
This is the print book of a science fiction novel by Philip K. Dick, and you can see he has some creative use of language in there. But in the third line, it says, "He zomed in." And I think that should probably be zoomed. But that's what's in the print book, and we just don't have any way to know if this is an error we should correct, or if this is an intentional use of language on the part of the author. So what do you do for your ebook?
Then there's a case where the ebook is completely accurate at the time of publication, but becomes factually inaccurate about over time. So this is a children's book we've published about time-telling. And it has this sentence about Daylight Savings Time, which was accurate until recently, but we've changed the dates. And similarly, you may have books that say things like, "Pluto is a planet," or social science books where the statistics have become out of date, or maybe you publish travel guides where literally every piece of the information becomes out of date after 5 or 10 years.
And again, because every ebook looks like a brand new ebook and because you're reading it on your brand-new eighth-generation Kindle Fire, these kinds of factual errors, I think, are a lot more jarring than they would be in a print book that has cues that tell you how old that book is and what you could expect from it. And again, are these errors that we can fix or that we should fix? Do we need the author's input to make a fix like this? And what do you do if the author is dead, and the agent is retired, and the editor has left the company? Are these errors that we can fix silently? Do we need some kind of editorial note when we change them? Great questions.
Then there's the case where the content of the book is not incorrect per se, but it's become outmoded or it's become offensive over time. So this is an example from a diet book from the 1980s, which in addition to the diet advice being completely the opposite of what we would give you for diet advice today, it's out of date, or I think we could even call it racist in the way that it refers to Asian Americans as Orientals. And then it goes on to suggest that Orientals who live in the United States somehow are not Americans. But so what do you do here? This is language that's incidental to the content of the book. And does it matter if you know who's buying it? Do you think it's people who want the diet advice or is it people who are doing research about how Asian Americans have been represented in diet literature throughout the 1980s in the United States?
Similarly, this is a lovely classic children's book. It's from 1960. It's about two little girls and there's a witch, and the witch somehow is a baby. And there's a spelling bee, like a bee that actually flies around spelling words. And there's just this minor part of the story where the girls are putting on their Halloween costumes and one is gonna be a witch, and the other is a little Chinese girl, and she had makeup on her face. And I'm not really sure what's happening. I think we agree that a nationality is not a super cool Halloween costume at this point. But I'm not sure we're actually claiming that she's putting on yellowface or that she's just, you know, borrowed her mother's lipstick here. And I don't know who can decide. Is this fine or what do we do about it? You know, this book is not Huckleberry Finn. It's not a book about race where we're gonna talk about the history of the book, and we're gonna talk about the controversy. It's just a little incident in a book that's about something else entirely.
So is this something we should be concerned about in an ebook that we're selling today that looks like a brand new ebook, maybe just like these more enlightened other children's ebooks that we're publishing in 2015?
If you want to avoid all these questions, you can be a strict constitutionalist and say, "The answer to the question is just follow the print book." But even if we believe for a moment that our only goal for ebooks is that they be a perfect replica of the print book, we're still gonna run into issues because print books themselves contain some ambiguity, particularly when we don't have digital source files. And I think this is something that we don't talk about a lot, that the physical print book actually does not have all the answers that we need in order to make a perfect ebook.
So some of these things have to do with the intention of the author, the designer, which we may not know. And a common case of this is a print book index or anywhere that a print book gives a cross-reference to a particular page. If we look at this example and we wanna learn more about Frank Alexander on page 99, we know that the content the indexer is pointing us to is somewhere between here and here, where page 99 was in the print book. But if we're working from a printed index, we don't have any information to tell us where exactly on that page the index are meant to point us to, even though now, in the ebook, we have the technology to take you to that more precise location.
Another case of ambiguity in the print book is the location of a piece of art, which is often chosen to work with the layout of the print page. So the designer may put a photo at the bottom of the page and for the ebook, that turns out to be in the middle of the paragraph. We'd like to move it out of that paragraph, but we don't know if that picture was meant to go with the paragraph before or the paragraph after. And this is a print book that we've resisted converting to ebook for a long time, just because nobody can seem to answer the question of whether these tarot cards in the margins throughout the book belong specifically to certain paragraphs or passages, or whether they're merely decorative.
As well, there are conventions of print design that just don't work correctly in an ebook. So this is a print book and you probably don't see an error in the first line here. But in the ebook, it's a lot more obvious that the open quotation mark is missing. That's a design convention to omit the open quotation mark before the drop cap. And it's not something that makes sense in the digitised text, especially if I'm gonna remove the drop cap. And I seem to be getting a lot of advice on Twitter today that I should be taking drop caps out of my books. So add the quotation mark when you take the drop cap out.
Similarly, you can see a book that has a text ornament at the bottom of the page and that's something that's easy enough to replicate in an ebook. But then when you go through and review the whole ebook, and you notice that this ornament only appears once in the entire book. And why is that? It's because the designers only used an ornament in a space break when it's at the bottom of a print page. That's a convention that no longer makes sense in the ebook, but there's also no clue to tell us that we should take it out in this ebook, but not in another ebook.
Another kind of ambiguity we get a lot from print books is the hyphen that falls in a line break. Is that a hard hyphen or is it a soft hyphen? Similarly, you might have a poem that runs several pages long and you can't tell if there's a stanza break at the bottom of that page or not. So all these are just to say that as we're cleaning up and improving our backlist ebooks, we're continually finding places where we, as the ebook developers, need to make decisions and judgments, where the source material just doesn't provide everything we need to create the completely accurate ebook. And as I mentioned before, a major way that old books, print books, backlist books are ambiguous is in their semantics. As we begin to use things like epub:type, or HTML 5 markup, or try to add accessibility to our electronic tax, the information that we need to do that is not automatically extractable from the print source.
Onto metadata, I think there are two types of metadata that are important to your backlist. The first is the customer-facing metadata, the kind of stuff that you see on the Amazon product page for the book. And we find that the metadata describing an ebook grows out of date over time and needs constant attention to be maintained. And this is true for the print book as well, of course, but, again, it's exacerbated by these never-out-of-print ebook backlists perhaps that have become orphaned or unnoticed by their authors and editors.
So first, as new types of metadata come into use, you need to go back to your 5,000 books and add those. And those might be things like adding keywords or BISAC subject codes for juvenile and YA titles have just changed. They'll go back and update those 5,000 titles, and then you need to update the existing metadata as needed. One place we see this a lot is in the author bio. So the author bio may say something like, "This is the novelist's first book," but she's actually published six books since then, or the author lives in California, but she's actually moved to Paris, or it might suggest that the author is living when she no longer is. And that's been a real question for us, whether we should leave the biography as it was when we received it, true as of the time of the publication of the original book. Will Shakespeare is a 30-year-old actor from Stratford-upon-Avon. This is his first play or should we change it so that it's true as of the current moment?
So this gets tricky. Now, if you're updating, whose responsibility is it to keep track of what's going on in your author's and former author's lives? It gets personal. You're gonna have to ask where they live, whether they're married, how their two dogs are doing, what they've written since then. And I've been seriously asked the question, you know, "The bio says she lives with her son, but that was 20 years ago. Can we assume that the son has grown up and left the nest?"
The customer description of the ebook is another thing that can grow stale quickly. You can imagine the sort of breathless rhetoric that you use to sell books to say, "This is new. It's revolutionary. It's cutting edge. This is the most up-to-date. It's the best. It's the only." But 2, or 5, or 20 years later, that kind of language just seems silly. But revising the description is not just removing those telling words but actually recontextualising what is relevant or important about the book today, trying to tell the story to the prospective reader of why this book is still worth buying.
So besides your customer-facing metadata, you also have internal metadata. And I've been telling you all along that we have 5,000 backlist ebooks with an asterisk because we're really not sure how many backlist ebooks we have. Internal metadata and record-keeping for ebooks is tricky because your internal record-keeping systems were probably not designed to answer the kinds of questions that we have now. As an example, you may be making different ebook file formats for different retailers. And the technically correct way to handle that would be to assign a different ISBN to each format. And I don't believe that anyone actually does that. And if they do, it creates its own problems in terms of metadata that should be common between those files.
So anyway, somehow, under your single ebook ISBN, you all of a sudden need to be capturing information about things like what formats of this book have we made? How do we make them? Which retailers are selling this ebook? And who's selling which format? How many times have we updated that ebook file? What changes did we make? What version of our ebook-making tools did we use to make this book? Can we make it again by running it through our current standard tools? There are versions of this book that have been removed from sale. Why was that? If we've made changes from this book from the print edition, what were they and why did we do that?
And maybe you've been capturing this data, but it's in sort of a freeform way because you don't have the right fields for it in your title database, which is fine until it turns out one day that you need that data to be searchable and reportable on because someone's gonna ask you something like, "How many backlist ebooks do you have? How many of those backlist ebooks are EPUB 2? How many ebooks do we have on sale at Apple that we don't have at Barnes & Noble? Why does this one ebook file exist in an EPUB 2 version and an EPUB 3 version, and another EPUB 3 with audio, and a separate Kindle file, and a custom sample?" Great questions.
And I have just one more thing for you. Many of our ebooks have URLs in them, particularly nonfiction books or adult books that have a lot of sources, and end notes have a lot of URLs. And in ebooks, we make those URLs into hyperlinks. And as we know, URLs can stop working. This is something the web community has gotten really good at talking about and they call it link rot. And there's some scholarly research on the prevalence of link rot, which is when URLs stop working, and also what they call reference rot, which is when the information at the given URL changes from what it was when the author cited it. And that's really hard to study. So I don't even wanna talk about that right now. But anyway, this particular study found that more than half of the URLs cited in U.S. Supreme Court decisions suffered from either link rot or reference rot. And that's a pretty difficult thing for the state of American jurisprudence that all of these sources that contribute to your body of case law are now no longer available to anyone.
Here's another study suggesting that the half-life of the average URL is two years. So what does that look like in a book? This is a typical nonfiction book from our backlist. It has 275 URLs in it, mostly in the endnotes. Thank you to FlightDeck for helping me count those URLs. And it was published in 2014. So if we use the model that says the half-life of a URL is two years, we would expect that this summer, in August 2016, maybe 50% of those URLs are gonna still be good. By August 2018, maybe 25% of the URLs will still be good. By August 2024, which is 10 years after pub, a totally reasonable amount to have this book still in print and available, this model would predict that only 3% of the URLs are still going to work. And I don't know about all of you, but I'm actually not expecting to be retired by 2024. So this is like a real problem for me I see coming up.
So I went ahead and actually tested the 275 URLs in this book. And these are the results that I got. I found that about 47 of the URLs or 17% were not working. And to be fair, some of these are errors that we actually introduced in making the printer ebooks. So these are URLs that actually never worked, not ones that have stopped working since we made the ebook, but that was still good information to know. And 83% of the URLs in the book are still working. So a year and a half after publication, that's running maybe a little ahead of the model, a little ahead of people who are trying to sell us link rot solutions would predict where we would be, but still a problem.
And as usual, there's a do it right from the beginning solution for this, which certain parts of the web and scholarly communities are embracing, using things like DOI to identify electronic documents or a service like Perma.cc that lets you archive site and online content. And this is something that I haven't really seen trade publishing take on. Although as a shameless plug, my husband works on Perma.cc, which is being used to make legal web citations permanent, and he would love to talk to publishers about how he can help them.
But because what we're talking about here is backlist, the problem is that we don't have the advantage of going back in time to make our authors archive their websites properly. So what's happening to us is we're inheriting tens of thousands of URLs that were not deliberately preserved in any way. And I would say that not all of our books have 275 URLs in them. Some have none, but we might say 20 on average. And so hypothetical, 5,000-ebook backlist, 20 URLs per book, 100,000 slowly decaying URLs that are now our problem.
And this is a problem in the print books as well as in the ebooks, of course, but I think we're more content to let the URLs in ebooks function essentially as decoration as like a sign that there is scholarly research behind this claim being made. And we also assume that a very motivated reader who's going to type out an entire URL to get to a source document also has some basic web literacy around using a search engine to find a replacement source if the one they're looking for isn't there. But because the URLs in ebooks have become hyperlinks that imply the information is just a click away when it isn't, it doesn't just seem like the link is broken, but it seems like your ebook is broken.
So this is the scale at which we're dealing with this problem right now. Our backlist has 100,000 slowly decaying URLs, again, these little time bombs all through our backlist, and the retailer response is to send me an individual email every time they notice one. So fortunately, they're not looking very hard right now. But when this happens, and this is the interesting part to me, what are we going to do to fix this? There are a few options, but they all have some sort of textual implications.
So if the site's just been reorganised, you could find a new correct URL and replace it. If the point is just to point the reader to a certain resource, you could find a replacement URL that takes you there. You could just remove the URL. In the case where maybe the citation to the print source is complete enough that we think the reader can still find the print source if they need it, we might just take the URL out entirely.
Our solution that seems to work for Apple, but is not my favourite is removing the hyperlink from the URL so the URL is still in the book, but it's not clickable. And that to me just looks like a mistake, but it's satisfying enough to someone who's doing automated checking of our books and rejecting ones where they think there are URLs that are broken. So what do you do? I've now told you about all the insurmountable issues that exist in your backlist ebooks. How are you gonna handle that problem?
So what we've done in my group first is we develop a process by which our backlist books are getting periodically reviewed. We have two production reports. One is for our frontlist books and one is what we call updates. And those updates have schedules, and they have deadlines the same way that our frontlist books do. So every week, we're dredging up and working on some of our older books. And there are a few ways that ebooks get on that update list. It may be that the vendors alerted us to a problem. It may be that there's a new edition coming out in print. We have a paperback that's gonna have a new cover and core agenda. The ebook might be scheduled to be promoted and might be related to a holiday. There might be news or events that are gonna lead to new interest in that ebook.
We'll work through a list of our best-selling ebooks from top to bottom, or we'll work through a set of related books like a series from beginning to end. And that's not a systematic approach in the sense that some of our backlist titles never make it to that list and other titles cycle on to it over and over again like a holiday title that gets promoted every year. So we're talking about how to find and address all our oldest ebooks, but we're at least comfortable that through this process, the titles that are selling the most, and getting the most attention are also getting re-reviewed the most frequently.
It might be ideal for us to update 1,600 ebooks a year and we can't do that. But if we just threw up our hands and didn't do any, we wouldn't have done 950 last year. We don't have time to fully proofread most of our ebooks, but any backlist we reissue get an hour or two of review. So I think just because the problem is hard, because the problem is insurmountable doesn't mean you shouldn't do what you can and be doing that regularly.
There are also technological and workflow tools that can make this process easier. So you might have scripts to search for common errors. You might be using SASS to manage your CSS. One of my group's projects last year was to replace our primary ebook-making tool and that reduced the time that it took to build an ebook from 3 minutes to 30 seconds, which didn't add up to a lot in terms of the time it saved each week, but it really lowered the barrier to getting an ebook rebuilt.
So my advice to you is to make it very easy to update and rebuild an ebook because that's something you're gonna wanna be doing and you're gonna do it frequently. Along the lines of making it easy to update an ebook, as you design your frontlist processes, I think a big piece of your brain should be devoted to thinking about how much you're gonna wanna redo that process over and over, and over, and over again as you maintain the ebooks. My group has a process so that can get fairly standard where you build the ebook, and then optionally, you can crack that ebook open to add manual customisations. And every time they do that, I want them to be thinking whether the value of that manual customisation is worth the cost of having to maintain it, again, potentially forever.
There's a record-keeping piece where we're trying to keep good notes about the decisions that we make for each ebook. So when we come back to remake them in 5 years or 10 years, we know why that book had custom CSS and what the rights are for the embedded fonts, and what kinds of workarounds we employ that may no longer be needed because five years from now, we're gonna be in a perfect standards-compliant future. And what I've seen in my group is that the need to do this record-keeping suddenly starts to seem very important when you inherit someone else's ebook to work on where they didn't provide that critical information for you.
But I think there's a piece of this that is also just hard, and it requires smart human beings to continually engage with the types of problems that we've just discussed, and to make these case-by-case calls on what you need to do in order to keep your ebooks available and representing your company the way you want them to. And I think that's a process that it's really important that you're not just delegating to a lone digital production intern you have sitting in a corner or to your ebook outsourcing vendor. But that actually requires everyone at your company who cares about the content of your products to get involved.
And what I hope I've expressed is that I find these issues really interesting and engaging, as well as just challenging. But figuring out how to tackle these sorts of novel problems at scale on this large backlist while still respecting the individual needs of different titles is what makes what we do really interesting and engaging work.
And this last one is why I'm here today because I think convincing people to notice this backlist work and to acknowledge that it's important work and to help them be thoughtful about how the ebooks they're making are gonna gracefully become a part of your perpetual backlist is a really important part of whatever you're doing to make frontlist ebooks. And so if you can get the rest of your company to care about this stuff, to care about the content of the ebooks, you'll also be getting them involved in the digital future of their words and their ideas. And I think that's what makes the work we do here make a lot more sense and be a lot more meaningful.
Zalina: Thanks to Teresa for giving this presentation at ebookcraft 2016. If you're interested in hearing more talks like this and learning more about today's challenges for digital publishing, visit ebookcraft.booknetcanada.ca for information on this year's conference. We gratefully acknowledge the financial support of the government of Canada through the Canada Book Fund for this project and, of course, thanks to you for listening.