Podcast: Using data to predict bestselling and prize-winning books

Can quantitative analysis be used to predict if your book will become a bestseller or win a prestigious award? Andrew Piper, associate professor at McGill University and director of .txtLAB, a digital humanities laboratory focused on applying computational approaches to literature and culture, certainly wanted to find out.

With a reputation for a 75% prediction rate for bestsellers, what can Andrew teach us about the bestseller formula and what are the practical implications for the publishing industry?

(Scroll down for a transcript of the conversation.)

Want to make sure you never miss an episode of the podcast? You can subscribe for free on iTunes, Stitcher, Pocket Casts, TuneIn, or SoundCloud.

Transcript

Zalina Alvi: Hi, and welcome to the BookNet Canada Podcast. I'm Zalina Alvi, the community manager here at BookNet. In this episode, I'll be talking with Andrew Piper, an associate professor at McGill University. He's also the director of .txtlab, a digital humanities laboratory focused on using computational approaches to study literature and culture.

Since we're all about using data to better understand books, publishing, and the readers who love them, we jumped at the chance to have Andrew on the podcast so we can share some of the findings from their research. I'll be talking about the patterns they found after analyzing a decade's worth of bestsellers and prize winners, how the lab's findings can be applied in the real world, and what else we can learn about books by using quantitative analysis. So, let's start with how this whole project started.

Andrew Piper: This project, it actually began when we were starting to look at what makes a novel win a prize. So, our first question wasn't initially into bestsellers, but more about whether we could predict prize-winning novels. And as we started going down that road, we began to think about prizes as kind of one way that novels are socially important, are socially valued, and bestsellers then become a kind of another way of looking at that problem.

So, for us, what we really began to think more about was a larger kind of literary field of what we might call social value. So, how books become important to people and in different ways. So, winning a prize is one way that that's true. Selling a lot of copies is another way. And there might be other kind of metrics that one could come up with to think about how books become important, culturally.

So that's kind of where we started. And I think that's been an important background for the stuff that we're finding and the questions we're asking, because what we're really interested in is how these groups relate to each other, not in any kind of absolute sense, but more in a relative sense.

So, we looked at a sample of about 200 bestsellers from the past decade, and these were the best works on the bestseller list for the longest period, aggregate period of time. And again, this is in an American context. We were using the "New York Times" bestseller. And that ranges from books that have just been strongly, strongly present for several 100 weeks, "Fifty Shades of Grey," for example, down to books that have been there only for about 9 or 10 weeks.

So, the range is quite strong, and one of the things we're learning is that there's probably kind of subgroups within this bigger category called bestsellers. Not all bestsellers are alike and some do different things, and that can connect a little bit with just how present or how popular those books are.

But so that was our samples drawn from the "New York Times" bestseller list, about 200 novels. We have an overall collection of a couple of thousand that we compare those to within a larger context of different types of writing to understand them a little bit better.

So, when we began to look at this and began to compare bestselling novels to their prize-winning counterparts, what we might call sort of more serious section, the things that we found were often, principally, related to being more people-centred, really, just kind of generally speaking. So, there's a lot more dialogue. There's about 50% more dialogue in bestselling novels than in prize-winning novels.

There's considerably more characters, probably about 30% more characters we found. And a lot of the grammatical formulations revolve around people that are unique to bestsellers, so you just see a lot more proper names and verbs attached to each other, whereas in serious literature, nouns will do a lot more of the work. So, these kinds of abstractions are objects, things that we're describing or thinking about.

And so, if you're giving advice to people about how to write a bestseller, really, what you really want to focus on is, orient always as much of your writing as possible around people, and around the experiences of people, the way they interact with each other, and the way they do things in the world. That was probably the most important thing.

We also found kind of very clear thematic, too, which most people are sort of familiar with intuitively when they look at bestseller lists. So, a lot more focus on crime and legal dramas, a lot more technology-focused.

And one of the things that surprised us that came out over the course of the research was something that probably would be less intuitively clear to readers or to people setting out to try and write a novel is, the emphasis on time. That was something that we found is a very strong predictor.

So, serious literature tends to be much more focused on things in the past, much more retrospective, much more nostalgic, whereas bestselling writing tends to be much more written to the moment. So, the timeframes of bestsellers really are, like, the single-day sort of team, tonight, tomorrow, yesterday, this afternoon.

They'll mention specific timeframes, next, now. There's a sort of vocabulary of urgency that drives the action of a bestseller, whereas the serious literature tends to be much more reflective and based in the past.

Zalina: Can you talk a little bit to the purpose and the goal of this kind of data analysis?

Andrew: Yeah, that's a good question. I mean, I think we have different aims, or I personally have just kind of different interests. Some of this is related to my own background.

So, my background was actually more in the history of reading, history of books fields. And so my last book, which is called "Book Was There," was really about trying to understand how reading has changed over time, depending on different types of reading technologies.

And so I was really drawn to computation, as just a sort of new way of engaging with the same documents that have always been interesting to me, which are literature. And so that's kind of where I started. And so my background is in literary studies, not in computer science, and I gradually kind of learned and gotten more comfortable with the computational side of things.

And it's really a kind of process of translation to begin to figure out, how do we map the questions that we traditionally ask about, things like novels or poetry, and how can we use computation to try and understand those things in kind of meaningful ways?

In terms of what we're hoping to do, I mean, I think it has sort of two aspects. One is, I think we're addressing a big evidence gap in our field. So, traditionally, if we want to make assessments about large categories, like a period, romanticism or modernism, or the 20th century, or a genre like the novel, there's really only so many we can read.

And so, I think we're actually quite adept at generalizing from the small set of examples. But we really don't know how well those generalizations hold up. So, on the one hand, it's really kind of research-based, let's bring more evidence into our discussions, as a field. And I think computation gives us a way to do that, and I think it's very, very important from the research perspective.

I think in terms of, for readers, and students, readers, people who kind of are interested in literature, and reading, I think computation gives us a kind of different way of thinking about books. It can be more analytical, and obviously more quantitative, and that's important.

That is to say, books have very strong quantitative dimensions. This is kind of, that's what they're composed of, words repeating themselves over and over again, and that's in some sense, how they become meaningful to us.

And so I think what's interesting about computation is that it lets us see those dimensions of texts, which using, again, our traditional methods that we really couldn't use. And so it kind of profiles a different aspect of culture and creativity that really interests me.

Zalina: But what about practical applications? An article in the "Globe" recently mentioned that your lab has a 75% prediction rate for bestsellers. So could publishers, in theory, use your findings to choose or commission manuscripts that have a high probability of becoming bestsellers?

Andrew: Yeah, so I think we... I mean, we kind of see different audiences for this kind of research. So, I think publishers are one natural audience. I think writers are another, which I can talk about in a little bit, and so are readers.

So, in terms of publishers, it may be the case that some publishers are beginning to experiment with this kind of work. I haven't heard much about it. I'd be curious to know the extent to which it has become active within publishing circles. I'm sort of surprised it hasn't.

It makes sense that having a more analytical understanding of one's backlist or even commissioning new books, is just a really practical way to go about predicting whether something may catch fire with the public. And we know there are a lot of extrinsic factors that go into making a bestseller, not least of which whether the person has already written a bestseller.

But we also know from this research that there are a lot of intrinsic factors. There are a lot of aspects that are very predictable about whether a novel will or will not be a bestseller, and having that information can, as a publisher, it could be very useful for helping select manuscripts from your submissions and figuring out how to gauge your advance, or predict marketing...you know, what...sort of make judgments about the marketing budget that should go with something.

So, it's not going to change everything, but I suspect it could be a very useful tool in trying to make assessments about where to make these investments and to make choices which have traditionally been sort of aesthetic or intuitive. And I think data can once again be very helpful in that domain.

Zalina: It does seem like publishers already tried to do that kind of looking at the industry, see what's working and trying to replicate that, but they don't have much accurate data or anything really hard to pin down to really make those predictions based on. So, this would be a way of making it a little more scientific, but in tandem with obviously, other factors going into that decision-making.

Andrew: I think what they haven't been able to do yet really successfully is getting inside the books. I have no doubt there's all sorts of metrics which try and look at readerships, or social media, or all sorts of things, whatever, the genre or keywords that might be associated with the book, the author's background, these kinds of things.

And I'm sure those are all very useful ways of trying to think about how a book might do. And what this kind of research, what our research really starts to let people be in to study is, you know, what's going on inside that book?

Which books is this book going to connect up with? Which types of readers is this book going to connect with based on what it's talking about? And I think that's a really, really powerful new tool to think about how to reach readers.

Zalina: Do you think your research could ever be extended to any of those external variables?

Andrew: I think they're definitely there. And in fact, what makes studying contemporary literature, in some ways, more interesting or dynamic than historical literature, which is also one of my focuses, is because that extrinsic or sort of social data, we might call it, is just so much more available and abundant, and changing.

So, in our lab, we've talked a lot about, you know, what happens when you start to add in features about social media, so Goodreads scores, Goodreads ratings, and again, not just quantitative things, like ratings numbers, or how many reviews there are. But actually, what are those reviews saying, what are they focusing on, and what is that telling us about what readers are looking for?

So, there's a lot of information there, too, that could be mined, using the same kind of techniques. And so when you begin to connect up these features, the book has these kinds of topics, or tendencies, or aspects that it's focusing on.

And we know, readers are kind of interested in these types of focuses and features, and so forth. You really can get a more robust picture of how, again, how a book is going to connect with readers, what kind of audiences it's going to be most suitable for, how large those audiences are?

So, yeah, I think understanding the inside of books is a piece of the puzzle. But the same techniques can give you a really rich portrait, I think, also of your readerships, which are getting more public and visible. In the old days, the myth was the sort of lonely reader, silent on a couch, and you didn't know anything about that person.

I think people are much more public and invested in sharing their reading experiences. And that obviously gives all of us more of an understanding about what motivates readers, what makes readers enthusiastic, what makes them get attached to books.

Zalina: And I think, I mean, ebooks, or e-reader manufacturers are already collecting a lot of that data on how people read books, what they're reading, when they stop reading something, how quickly they're reading, things they highlight. So, there's definitely...we're getting a larger segment of data to look at.

So, your research on the bestsellers, focused on American bestsellers, or ones that were on the "New York Times" fiction list, do you have any research or speculation on variance between countries? I mean, obviously, we would be interested in Canadian bestsellers, but also maybe nonfiction or juvenile titles?

Andrew: It's a very good question. One of the things we're finding is that the Canadian bestseller list tracks, depending on which ones you're looking at, but the ones we've looked at so far, seem to track quite closely with the US ones, which probably won't come as a surprise to many listeners. There were a few slight differences, but overall, they really seem to be fairly close to one another.

We don't know too much about the rest of the world, which is actually a kind of new pilot project we're trying to get off the ground, to figure out ways to study those differences. Because I think it is a very interesting question about different cultural orientations to bestselling writing, and how different that is. But we haven't gone too far on that road and right now, we're focusing exclusively on fiction.

Zalina: Can you identify patterns changing over time? It sounds like, I mean, you had a dataset for a decade, and you identified patterns from those, but it would be interesting to extrapolate the data in some way to see how things might be changing over time.

Andrew: Yeah, we became really interested in this. And so far, honestly, we haven't seen much change. So, we looked at this past year. So, our dataset stopped in 2014. So then we looked at 2015. And if anything, we saw the most dominant trends from the previous decade only getting stronger.

So, more crime, more violence, more legal drama, and a lot more technology. In fact, it was really the technology that seemed to be the features that jumped out the most. So, more screens, more texting, more phones, more videos, anything you can think of that has to do with communications technology seem to be even more prevalent this past year than in previous years.

One of the other...I mean, I feel like we are interested in digging a little bit further down into that past decade to see if that itself breaks up by time. As I said before, I think it probably is more of an example where you're going to find different types of writing within the bestselling group that aren't going to be quite so much tied to time.

I've heard some people say things along the lines that bestselling writing is very much of its moment. It would interest me about the uniformity of the features we're finding, the continuity is either that moment is longer than we think about, year-to-year, may not be the best timeframe, or whether there's just a whole lot more sort of somatic and stylistic stability to bestsellers than we've previously thought. So, that's something we want to look more into.

Zalina: Because I guess, I would think that major social movements, or current events, or "The Zeitgeist" would affect these patterns, but maybe that'll be something that would be more applicable to, say, nonfiction titles, and things, and fiction, or at least steadier.

Andrew: It's possible. Again, I wonder whether the timeframes, when we think about these sorts of historical moments, whether actually, we often register them in broader scales so that a decade may be too small of a window almost, and that really, these things happen in 20 or 30-year cycles.

Right now, what it really looks like is that kind of CSI effect, the sort of just detective, technology-based, action-driven drama story. Even when you reframe it as a historical novel, or put some other window dressing on it, that's really what continues to jump out, and that hasn't gone anywhere. That just seems to be really dominant, and obviously, from what we know, from popular culture, it's only getting stronger, too.

Zalina: Whilst you can collect data for the next century or so, I think that will be fascinating if I'm around to see that.

Andrew: Me too.

Zalina: So, I'd like to talk a bit about something I've read recently about a challenge the lab put towards writers in Quebec, where you had to write short stories based on the guidelines produced by this research. Can you tell me a little bit more about that, about what the mission was behind that, with the idea, and what it was like reading those stories?

Andrew: Yeah, the idea was actually generated in a conversation with this year's Giller Prize winner Andre Alexis. So, you may or may not have read on my blog, but I had the strange honour of actually predicting this year's Giller Prize winner.

But then when I went ahead and read the novels on the shortlist, I changed my answer, because I thought the committee would never vote for his novel because it's very challenging. It's a wonderful book. It's completely interesting. But it's very experimental. And so I just didn't see it happening.

And so I went ahead and reconfigured the algorithm to take into account extreme outliers and that meant, it ended up predicting a different novel. So, I had this very sort of dubious distinction of getting it right, and then using my own judgment to ruin my own success. Sorry, go ahead.

Zalina: The data doesn't lie.

Andrew: Yeah, well, it's really not an unfamiliar story to many people. But what it really tells us, it is maybe a side story, but what it really tells us is, what we're doing is we're modelling human behaviour.

And we can do that in many different ways. And so I had an initial model trying to understand which novel might be successful based on how different it was from previous year's winners. So, the idea was they may be looking for something that's fresh and new.

And then I had a kind of second model, where I thought about committee behaviour, and I said, "Oh, well, but they won't vote for something that's really too different." So, I added another kind of filter into the algorithm.

And so what we always have to keep reminding ourselves the data is not human independent. It's very much tied to our values, and our judgments, and our beliefs about how the world works. And those beliefs can be more or less correct. And in my case, it started as a better model and got worse, as I thought about it more. But it's an important kind of thing to keep in mind.

So, on the idea of The Devoir Challenge, so he began corresponding about this and Alexis was... So, he expressed a lot of interest in this bestseller research, and he said, "I'd really like to learn more about it."

And I initially thought he wanted to know how to sell more books, and maybe a prize wasn't enough for him, or something like that. And he corrected me and said, "No, no, this is really a creative exercise. I want to use it as an opportunity to think about writing a novel and what the implications would be for me as a writer to try and write like a bestseller."

I thought that was a really interesting idea and also really inspiring for me that this kind of research could be useful for creative writers. I hadn't anticipated that, and I thought that was a really nice feature.

And so once the word began to travel around and we ended up talking with the book's editor at The Devoir, they got super interested into it and wanted to ask them, Quebec writers, if they wanted to participate, and they jumped at the opportunity.

I think people had a lot of fun with it. And again, for me, it was just a really nice sign that, again, data is not this kind of empirically, cold, analytical thing, but it can also be very creative and generative.

Zalina: Did you read all the stories? What was it like reading them?

Andrew: Yeah, they're fun. I mean, most of them were ironic to a degree. So, they were very self-conscious about trying to write, like, a bestseller. So the task was, we gave them this sort of cheat sheet of, like, "These are the features you should try and emphasize."

And they had to write the short stories along those lines. And then we went ahead and kind of rated them and said, 'How well could we predict whether this thing would be a bestseller or not?" And the two things that jumped out at me, one, was most people failed, which may be from their point of view was a good thing.

But the stories were also, like I said, they were sort of parodies or ironic, but they also, at some level, were in many ways about some aspect of identity, something that they were uncertain about. The protagonist in each of the stories were upset or unstable about something, whether one character was a transgender character, another one was imagining herself in some sort of, like, "Star Wars," like, fantasy, but it was actually a sort of conflict with their parents.

Another person was a potentially murderous inheritor of a big estate. And they each kind of wrestled with something that they couldn't fully understand. And the one I liked the most was about a writer who couldn't finish her own story. And she was wrestling with these constraints that clearly sounded like the tables we had sent along. And so it clearly influenced some of their creative thinking.

Zalina: Interesting. Well, it definitely sounds like a very interesting challenge for the participating writers and any other writers who would want to do something similar. So, it's just writers. What about readers? Could this research be used to benefit them?

Andrew: We think so. We think so. I mean, I think the idea is, as I was saying before, that we're hoping to serve different populations with this. So, it could be interesting for the industry to have access to this information.

It's hopefully going to be interesting for creative writers to think about their writing, to think about the markets that they might want to be writing for, the audiences they might want to be writing for and to learn more about those audiences and those audience expectations.

And I think we're also hoping this sort of a new aspect that I've become recently more interested in. It's thinking about that kind of reader experience. And I think, traditionally, we thought about books as these kinds of blank slates, these sort of pristine objects, and you read it, and you have this very imaginative engagement with the text, and it's very absorptive, and you sort of lose yourself in the book.

And I think that's one kind of reading, and it's certainly an important one. But there are many other ways that we are asked to read and that we need to read, sometimes in educational settings, sometimes in workplace settings, and that can be much more analytical. And I think these kinds of tools, this kind of data, this kind of research can actually be really useful at that level.

So, you can imagine in an educational setting, whether it's in secondary school or even at the level of higher education, where instead of just this kind of blank slate, this kind of clean text, what students are presented with is a little more analytical overview of the text.

So, if it's a secondary school, you're providing students with character lists that they can annotate, or maybe those characters are already annotated with kind of words or themes that are associated or values that are associated with those characters.

It creates the social network so you can have an understanding of who's connected to whom, and where, much of the social tension in the storylines, or even focusing on distinctive vocabulary or phrases, or stylistic features that are unique to that book.

It gives students a way into the book that is more analytical and self-conscious, as a kind of conversation starter, rather than hoping that they just have these ideas on their own, which, as someone who's been in education for over a decade now, I know doesn't happen very successfully all by itself.

And so we're really hoping that adding these features into the reading experience isn't really, for me, a distraction, as much as it provides new kinds of insight and kind of conversation starters. We're beginning to think critically and analytically about something you're reading.

Zalina: While we've talked a lot about the benefits and the potential uses of all these kind of quantitative analysis and application of this research, there does seem to be a lot of resistance to the application of quantitative analysis to works of art. Have you received any criticism around your research, and if it devalues literature or anything like that?

Andrew: Oh, lots. There's no end of criticism in the world. No, I think, yes, I mean, I think people are resistant to it for some understandable reasons and other reasons that are probably a little more negative. There was a sociologist, Pierre Bourdieu, who said, "The question we need to be asking ourselves here is, in whose interest is it to say that art is fundamentally unknowable?"

And so, what he was trying to point at was that, oftentimes, these statements or these beliefs are there to sort of shore up somebody's authority. So, if this thing is kind of an unknowable object and needs to be venerated, only those people who have special access to it can talk about it.

And our hope is when you add this more analytical, computational approach to this, to these documents, to culture and literature, you kind of open up the process and make it...and kind of and argue a little more democratic. You make more explicit, the questions you're asking, the data you're looking at, the models you're using, the way you're constructing your analysis.

So, it puts a lot, but more of that, on the table as something explicit and shareable, something we can talk about, as opposed to saying, "These things are off-limits. These are things we can't talk about." And when people say, "We can't talk about it," they usually mean, "Only I can talk about it, and only in this particular way."

And so we kind of feel like analytics can be a useful way of expanding who can have that conversation and what can be talked about, making a little more explicit what people are talking about when they do talk about these things. So, I think it's a really important new direction in thinking about how we study art and culture.

I actually think there's a very strong public interest in it. And I think you don't want to just emphasize the negative at the end, but also really confirm with people that...I think people sense that this is a really interesting way to think about art and literature and culture that fits within the kind of moment that we're in right now, that we're using computation to study all sorts of things, how the brain works, how our behaviour works, how the universe works.

And then it makes sense to begin to use these resources to study how culture works. And so for me, it's actually a very exciting moment, and one that connects a lot, I think with the sort of general public interest about sort of enlivening the humanities and enlivening literary studies. So for me, it's a very positive experience actually in many ways, despite the concerns or criticisms that are sometimes raised.

Zalina: Well, it will definitely be interesting to see what else comes out of this, well, out of .txtlab and the work you're doing. Where can listeners go to learn more about the projects you're working on?

Andrew: Well, they should follow us at our blog, which is textlab.org. And we post regular updates there. I also occasionally write for a blog called "Culture After Computation," which also has information about new articles and new essays, or thoughts that we're posting about how this is beginning to change our thinking about culture.

Zalina: I'd like to offer a big thank you to Andrew for taking the time to join us this month. You can check out the podcast description for links to his blog so you can learn more about the work they're doing at .txtlab. And if you'd like to learn more about the work BookNet does, you can find us at booknetcanada.ca.

We gratefully acknowledge the financial support of the Government of Canada through the Canada Book Fund. And of course, thanks to you for listening.