So much foundational work is happenning now that is insuring the solidity of the bedrock of the Library 2.0+ meme. The future of publishing is indeed wide open.
Here’s a reading list of links to summarize the discussion at the Reading 2.0 summit held today in San Francisco:
- Organizer Peter Brantley of California Digital Library opened the meeting. His blog.
- I talked about Web 2.0 as it applies to the future of publishing. In particular, I talked about how ideas such as harnessing collective intelligence, the perpetual beta (dynamic content), the long tail, remixing, and open formats are being used on the consumer internet, and how these ideas ultimately need to be applied to the world of books and scholarship as well. I also talked about the idea that “worse is better” systems seem to propagate better than carefully designed systems that cross all the ‘t’s and dot all the ‘i’s. I used Bill Joy’s phrase “the future doesn’t need us” to remind participants that existing models have no guaranteed persistence, and suggested that competition to the book will come from very different forms of content that do similar “jobs” as particular types of books.
- John Kunze of California Digital Library talked about Archival Resource Keys used by the CDL, also known as ARK identifiers. In brief: “The ARK identifier is a specially constructed, globally unique, actionable URL….Entering the ARK in a web browser produces an object. Entering the ARK followed by a single question mark (“?”) produces the metadata only. Entering the ARK followed by a double question mark (“??”) produces a CDL commitment statement.” I like the ? and ?? idea as a hack to overload traditional URLs more than I like the whole idea of a registry for persistent URLs. (Related, but not discussed at the conference: the way Connotea bookmarks are always followed by an (info) link that provides metadata.) A “commitment statement” is metadata about the permanence of the item — a lot of librarians are concerned about preservation, and this is a way to share the commitment to maintaining an item.
- Herbert van de Sompel of Los Alamos National Laboratory presented his OpenURL framework. Briefly: “Citations to other works are familiar to any scholar- they ground a work of scholarship to a field of study, put new research into context, and often give credit where credit is due. The essence of citation is to identify the previous work with a set of metadata-author, title, and particulars of publication. The idea behind OpenURL is to provide a web-based mechanism to package and transport this type of citation metadata so that users in libraries can more easily access the cited works. Most typically, OpenURL is used by subscription-based abstracting and indexing databases to provide linking from abstracts to fulltext in libraries. A institutional subscription profile is used together with a dynamic customization system to target links at a user’s OpenURL linking service.” But more specifically, one of the types of services enabled by openURL metadata is navigation of the various permissions granted to various institutions by their subscriptions to scholarly content. The openURL services thus create a kind of transparent rights management layer. Here’s a list of the metadata formats most often used with OpenURL.
OCOinS, or Open Context Objects in Spans, are a way of embedding OpenURL citations directly in HTML without using the associated lookup service.
Adam Smith, product manager for Google Book Search and Google Scholar, points out how Google Scholar supports OpenURL, and notes that Scholar really shows a possible future for GBS if publishers do start providing their own digital content — the current scanning initiative is to bootstrap the acquisition of content that’s not currently in digital form. But in the scholarly world, the content is already generally digital, and Google has taken on its traditional role as a search engine, albeit in a world that requires something like OpenURL to navigate the permissions that are required for much of this content.
- Chad Dickerson, now at Yahoo!, lives on the consumer side. He agrees: Worse is better. He talks about Yahoo’s use of microformats, simple conventions for embedding semantic markup in html documents. Discussed as examples hcal as used by upcoming.org, Creative Commons Search, rel=nofollow, and rel=tag.
- Herbert was back up, talking about OAI-PMH, the Open Archives Initiative Protocol for Metadata Harvesting. This is a protocol for managing metadata about additions, deletions, and updates to XML collections, allowing for the easy creation of mirrors. Harvester asks what metadata formats are supported Dublin-Core, MARC/XML, MPEG-21, or METS. Once format chosen, asks for all time-stamped updates to metadata, then potentially makes selected data requests.
Brewster Kahle notes that Yahoo! monitors the activity on the Internet Archive via OAI-PMH. But more importantly, it’s a method for accessing “the deep web.” Lorcan Dempsey also talks about the way that OAI is used to share info about repositories of theses and dissertations.
- Lorcan Dempsey of OCLC talked about FRBR, Functional Requirements for Bibliographic Records, and FictionFinder as a demonstration of its utility. Which “Huck Finn” do you want? Very cool. You have a work, realized through an expression (e.g. an illustrated edition, a Spanish edition, an abridged edition, a spoken word edition), embodied in a manifestation (the 1954 Penguin edition), an item (an actual copy of that manifestation.) An adaptation, on the other hand, would be considered a different work. Only items actually exist. Everything else is a concept, metadata about that item that helps us to categorize it. FictionFinder not only lets you find different manifestations of a work, it also lets you do cool metadata search: find me bildungsromane taking place on the Mississippi, or detective novels taking place in Columbus OH. Not just the usual metadata! Apparently an improved version of FictionFinder will be out in a month or so.
In a takeoff on my language, Lorcan notes that “OCLC is a way of harnessing the collective intelligence of catalogers.” It’s a database of about 47 million works (28 million print books), in 60 million manifestations. On average, there are 1.3 manifestations of a work — i.e. most works get only one manifestation. 87% have only one; 12% between two and five, and only 1% have more than 5.
In some ways, the number of libraries holding a book can be seen as a kind of PageRank for the popularity of books (at least among librarians, if not among their customers :-). Some services based on this idea: Top 1000 works by library holding (The top ten: The Bible, the US Census, Mother Goose, the Divine Comedy, the Odyssey, The Iliad, Huck Finn, Lord of the Rings, Hamlet, Alice in Wonderland.) Audience Level, a service that estimates the audience level of a book by measuring the type and number of libraries that hold it. (For example, a book held by a public library receives a different score than one held by a research library.)
Given an isbn, the xisbn service finds alternate editions of the same work. Nice.
- Bill McCoy, Adobe: We shape our product thinking by understanding that there are four levels of content: level 0 represents content as actual bits such as ink on paper or pixels on a screen, level 1 gives final form representation (e.g. pdf) that is faithful to the bits but may be scalable etc., level 2 gives reflowable presentation (e.g. html + css), and level 3 data a real separation of content and presentation. He presents some non-bloggable ideas from the Adobe labs about how Adobe is thinking about moving content through those levels.
- John Mark Ockerbloom of the University of Pennsylvania discusses his work on communities around preservation of content, and particularly focuses on the importance of preserving evanescent content such as periodicals, which give a window on an era. Hopes to see everything in the public domain becoming available online. In order to do that, we need to know what’s out of copyright.
He describes his work on the catalog of copyright entries, and points to a directory of first copyright renewals for periodicals: “Most periodicals published in the US prior to 1964 had to renew their issue copyrights after 28 years in order to retain copyright on the issue….Below is a list of periodicials and their first copyright renewals, if any. The list below should include all of the more than 1000 periodicals that renewed between 1950 and 1977, and selected periodicals that renewed between 1978 and 1992. (After 1992, copyright renewal was no longer required.)”
- Juliet Sutherland of Distributed proofreaders: “This site provides a web-based method of easing the proofreading work associated with the digitization of Public Domain books into Project Gutenberg e-books. By breaking the work into individual pages, many proofreaders can be working on the same book at the same time. This significantly speeds up the proofreading/e-book creation process….When a proofreader elects to proofread a page of a particular book, the text and image file are displayed on a single web page. This allows the page text to be easily reviewed and compared to the image file, thus assisting the proofreading of the page text.” Volunteers have produced over 8100 titles (periodicals or books); another 4000 in progress. We need to get beyond scanned images of works to plain text, which can be remixed and used in other ways. This is one way to get there. Juliet says: “Come do your page a day!”
A lot of similarities to Amazon’s Mechanical Turk in the way tasks are broken into atomic units (one random page at a time) rather than as larger units. 4-500 unique volunteers per day. Each physical page gets looked at 4 times. (This is good prior art if Amazon tries to patent the Turk. In addition, Yochai Benkler discusses the issues in splitting up work like this in Coase’s Penguin.)
- Dale Flecker of Harvard talks about the lack of norms in citation: “When you pick up a pointer, there’s no standardized expectation of what you’re going to get.” He points to a bunch of sites that do interesting, but different things. Wishes for a system that gives multi-valued pointers — showing options for reaching different versions of an item.
- John P. Wilkin, University of Michigan: There are a lot of rights, and no easy answers. Describes the methodology used by the University of Michigan to determine the rights status of works being digitized by the UMich library. Six status categories – public domain (pd), in copyright (ic), out of print and brittle (opb), orphaned because no copyright owner can be found (orph), undetermined (und), open to U Mich by contract (UMall), open to the world by contract (world). Reason codes – bibliographically by copyright date (bib), no copyright notice (ncn), by contract (con), or by due diligence (ddd). Some interesting discussion about the problem of embedded rights — e.g. does the work include other works (e.g. a photo) that has different rights. (At O’Reilly, we’re struggling with this problem right now. We want to put out some of our content under open source licenses, without actually granting our competitors the right to copy the format. (Think Hacks books or cookbooks.)
- Ammy Vogtlander of Elsevier also talks about rights. Because they have a small community, infringement is easy to determine, and rights are determined by contracts between institutions. And because readers are also the authors, managing DRM violations is often as much a PR and community relations issue as much as a legal one. Would love to see some mechanism developed for search engines to recognize and skip content that has been discovered outside its permitted domain.
- Jon Noring of OpenReader talks about “Open DRM.” He says that the acronym DRM is coming to be a lot like the acronym UFO, where an “unidentified” flying object is actually identified in the popular mind as an alien spacecraft, and thus the term has come to mean something very different from its literal meaning. A copyright notice is a type of DRM. DRM should not be confused with “TPM” or “technical protection measures.” And TPM doesn’t really work. If something is popular, it will be pirated. 48 hours after Harry Potter is published [12 hours for the latest book], people have retyped it in (correcting errors along the way) and put it into distribution. Meanwhile, the protection mechanisms irritate ordinary users. Publishers are also worried about protection systems controlled by a single vendor. There’s a desire for an open source type system that isn’t controlled by any vendor, and that allows flexible permissions. In discussion, Cliff Lynch points out that the Creative Commons is in fact just that: a mechanism of asserting rights, without inconveniencing readers.
- Brewster Kahle of the Internet Archive and the Open Content Alliance offers to be a place to bring together best practices and a place for experiment. He requests more collaboration around issues and concrete projects that show what the digital future can look like. We can use the out of copyright materials as a testbed.
- Cliff Lynch of CNI spoke about the value in the future of collections of content that we can not just read but compute on. Finding and clustering are just the beginning. What new services will we be able to build? We’ll be moving beyond individual people interacting with small groups of texts. He also talks about the difference between knowledge and information sources, which are somewhat fungible, and works of artistic expression, which are much less so. We may end up needing very different methods for dealing these two types of work.
He also talks about sharing the burden of cleaning up the mess we’ve made in losing track of who owns what. This is so expensive we really don’t want to do it more than once.
Overall, a fascinating meeting. I learned a lot, and urge readers to follow some of the links above and learn about some of the amazing work being done by the library community.