There’s been much recent attention paid to the addressability of book content on the web, with a “Publishing Hackathon” in New York, and HarperCollins’ creation of an API-fueled hackathon “Programming Challenge“, both of which received a mix of criticism and praise; nonetheless they are a good start. But in the rush to try to entice a more technically savvy element, I think publishers are missing a more elemental approach – borrowing simple and well-established web standards.
If, technically, you consider EPUB (and particularly EPUB3) digital books – as well as Kindle books – to be constrained, packaged web sites – or content bundles containing a variety of web expressions – then instead of a publisher conceiving of their books as assets to be addressed via an API as if they were inventory on a grocery shelf, it’s possible to bring to bear more lightweight tools for discovery and access. As is so often the case, lessons can be learned from the years of instruction offered by the maturation of plain old web site discovery standards, and by practices in the academic journal community.
Simple parables document this case. At the not-for-profit where I work, Hypothes.is, we’re working on building software to support annotation of web documents. One of the obvious challenges is to make sure that comments on one version of an item appear on a related version; the problem is akin to a reader wanting their comments on one edition of Huck Finn appearing on a second printing of that title, even if it has a different ISBN.
One of our contributors, Ed Summers of the Library of Congress, penned an excellent introduction to some of the core prescriptions, “Guidance for Web Publishers.” Substitute “ebook” for “web” and the utility is remarkably the same. The post should be instructive for a wide range of publishers.
For example, consider a reader’s annotations or comments on a New York Times article; any comments penned in single page mode should appear on the paginated version, where an article spans by default two, three, or four separate web pages, even though the web URLs are different for each page. The Times makes this easy because they use a common standard promulgated by Google and other search engines called a “canonical url,” which is now an IETF RFC. Here’s one from a sample article that has two pages:
<link rel="canonical" href="http://www.nytimes.com/2013/06/12/world/europe/greece.html?pagewanted=all">
This line appears in both the paginated and full-text versions – it tells search engines to treat the canonical url as the one that “counts” for SEO, and it provides a common addressable location. Book publishers should be using canonical urls to highlight the url pointer for the book that they want to anchor not just SEO, but social commentary, online references in reviews, library catalogs, and other uses. This is true even if the book is not openly readable on the web: simply having a single “resource identifier” provides a anchorage that both software and users can point to in common. It is the equivalent of a post office’s standardized address.
Another example of a common web standard that should be adopted by book publishers is the alternate link statement. Many academic journals use alternate links to associate a PDF version of an article, for example, with a native HTML/web based version of the article. This greatly facilitates the ability of annotation software to permit annotations in one version appearing in another. Most newer academic journals, such as eLife, PLoS One, and PeerJ – which turned one year old this week – also utilize simple web naming conventions which augment discoverability as well.
So, the first PeerJ article published is “How long is a piece of loop” – rather more technical than the title might appear to a lay reader. The url for the web content is: https://peerj.com/articles/1, and the url for the pdf content is: https://peerj.com/articles/1.pdf. But even more importantly, the HTML version of the article contains this string:
<link rel="alternate" type="application/pdf" href="http://peerj.com/articles/1.pdf">
That “alternate link” is a standard part of HTML, and should be used by every publisher who wants to alert readers that an alternative format of a book exists in another location. Canonical URLs and alternate links can be used to tremendous effect in allowing a publisher to tie together many different instances of books.
Using standards like these in web addressable catalogs, such as OPDS catalogs, on publisher web sites, and inserted into EPUB files, can help a wide range of web discovery tools and programs without having to think about heavier weight APIs which often carry naive licensing restrictions.
In fact, the weight of focus among higher end library and publisher discovery services is on “index-based search” and away from parochial API-based solutions where multiple sources of information must be queried and then inter-related – exactly what these web standards help to address. NISO is supporting an Open Discovery Initiative to help foster further agreement among publishers and content search providers serving the library market. There’s much that both trade publishers and retailers could learn from, and contribute to, this effort.
Thinking about ebooks as web sites, and considering the kinds of simple metadata and link references that can be embedded within them, can help address a range of discovery issues with simple solutions. The necessity of adopting the breadth of web standards will only become more pressing, as publishers begin to recognize the desirability of providing a greater number of independent options for consuming ebooks beyond a handful of retailers.