Link List: Reading 2.0 Summit

March 17, 2006

reading 2.0 logo.jpgSo much foundational work is happenning now that is insuring the solidity of the bedrock of the Library 2.0+ meme. The future of publishing is indeed wide open.

Oreilly radar > Link List:meme Reading 2.0 Summit

Link List:meme Reading 2.0 Permalink
By tim on March 16, 2006

Here’s a reading list of links to summarize the discussion at the Reading 2.0 summit held today in San Francisco:

  • Organizer Peter Brantley of California Digital Library opened the meeting. His blog.
  • I talked about Web 2.0 as it applies to the future of publishing. In particular, I talked about how ideas such as harnessing collective intelligence, the perpetual beta (dynamic content), the long tail, remixing, and open formats are being used on the consumer internet, and how these ideas ultimately need to be applied to the world of books and scholarship as well. I also talked about the idea that “worse is better” systems seem to propagate better than carefully designed systems that cross all the ‘t’s and dot all the ‘i’s. I used Bill Joy’s phrase “the future doesn’t need us” to remind participants that existing models have no guaranteed persistence, and suggested that competition to the book will come from very different forms of content that do similar “jobs” as particular types of books.
  • John Kunze of California Digital Library talked about Archival Resource Keys used by the CDL, also known as ARK identifiers. In brief: “The ARK identifier is a specially constructed, globally unique, actionable URL….Entering the ARK in a web browser produces an object. Entering the ARK followed by a single question mark (“?”) produces the metadata only. Entering the ARK followed by a double question mark (“??”) produces a CDL commitment statement.” I like the ? and ?? idea as a hack to overload traditional URLs more than I like the whole idea of a registry for persistent URLs. (Related, but not discussed at the conference: the way Connotea bookmarks are always followed by an (info) link that provides metadata.) A “commitment statement” is metadata about the permanence of the item — a lot of librarians are concerned about preservation, and this is a way to share the commitment to maintaining an item.
  • Herbert van de Sompel of Los Alamos National Laboratory presented his OpenURL framework. Briefly: “Citations to other works are familiar to any scholar- they ground a work of scholarship to a field of study, put new research into context, and often give credit where credit is due. The essence of citation is to identify the previous work with a set of metadata-author, title, and particulars of publication. The idea behind OpenURL is to provide a web-based mechanism to package and transport this type of citation metadata so that users in libraries can more easily access the cited works. Most typically, OpenURL is used by subscription-based abstracting and indexing databases to provide linking from abstracts to fulltext in libraries. A institutional subscription profile is used together with a dynamic customization system to target links at a user’s OpenURL linking service.” But more specifically, one of the types of services enabled by openURL metadata is navigation of the various permissions granted to various institutions by their subscriptions to scholarly content. The openURL services thus create a kind of transparent rights management layer. Here’s a list of the metadata formats most often used with OpenURL.

    OCOinS, or Open Context Objects in Spans, are a way of embedding OpenURL citations directly in HTML without using the associated lookup service.

    Adam Smith, product manager for Google Book Search and Google Scholar, points out how Google Scholar supports OpenURL, and notes that Scholar really shows a possible future for GBS if publishers do start providing their own digital content — the current scanning initiative is to bootstrap the acquisition of content that’s not currently in digital form. But in the scholarly world, the content is already generally digital, and Google has taken on its traditional role as a search engine, albeit in a world that requires something like OpenURL to navigate the permissions that are required for much of this content.

  • Chad Dickerson, now at Yahoo!, lives on the consumer side. He agrees: Worse is better. He talks about Yahoo’s use of microformats, simple conventions for embedding semantic markup in html documents. Discussed as examples hcal as used by, Creative Commons Search, rel=nofollow, and rel=tag.
  • Herbert was back up, talking about OAI-PMH, the Open Archives Initiative Protocol for Metadata Harvesting. This is a protocol for managing metadata about additions, deletions, and updates to XML collections, allowing for the easy creation of mirrors. Harvester asks what metadata formats are supported Dublin-Core, MARC/XML, MPEG-21, or METS. Once format chosen, asks for all time-stamped updates to metadata, then potentially makes selected data requests.

    Brewster Kahle notes that Yahoo! monitors the activity on the Internet Archive via OAI-PMH. But more importantly, it’s a method for accessing “the deep web.” Lorcan Dempsey also talks about the way that OAI is used to share info about repositories of theses and dissertations.

  • Lorcan Dempsey of OCLC talked about FRBR, Functional Requirements for Bibliographic Records, and FictionFinder as a demonstration of its utility. Which “Huck Finn” do you want? Very cool. You have a work, realized through an expression (e.g. an illustrated edition, a Spanish edition, an abridged edition, a spoken word edition), embodied in a manifestation (the 1954 Penguin edition), an item (an actual copy of that manifestation.) An adaptation, on the other hand, would be considered a different work. Only items actually exist. Everything else is a concept, metadata about that item that helps us to categorize it. FictionFinder not only lets you find different manifestations of a work, it also lets you do cool metadata search: find me bildungsromane taking place on the Mississippi, or detective novels taking place in Columbus OH. Not just the usual metadata! Apparently an improved version of FictionFinder will be out in a month or so.

    In a takeoff on my language, Lorcan notes that “OCLC is a way of harnessing the collective intelligence of catalogers.” It’s a database of about 47 million works (28 million print books), in 60 million manifestations. On average, there are 1.3 manifestations of a work — i.e. most works get only one manifestation. 87% have only one; 12% between two and five, and only 1% have more than 5.

    In some ways, the number of libraries holding a book can be seen as a kind of PageRank for the popularity of books (at least among librarians, if not among their customers :-). Some services based on this idea: Top 1000 works by library holding (The top ten: The Bible, the US Census, Mother Goose, the Divine Comedy, the Odyssey, The Iliad, Huck Finn, Lord of the Rings, Hamlet, Alice in Wonderland.) Audience Level, a service that estimates the audience level of a book by measuring the type and number of libraries that hold it. (For example, a book held by a public library receives a different score than one held by a research library.)

    Given an isbn, the xisbn service finds alternate editions of the same work. Nice.

  • Bill McCoy, Adobe: We shape our product thinking by understanding that there are four levels of content: level 0 represents content as actual bits such as ink on paper or pixels on a screen, level 1 gives final form representation (e.g. pdf) that is faithful to the bits but may be scalable etc., level 2 gives reflowable presentation (e.g. html + css), and level 3 data a real separation of content and presentation. He presents some non-bloggable ideas from the Adobe labs about how Adobe is thinking about moving content through those levels.
  • John Mark Ockerbloom of the University of Pennsylvania discusses his work on communities around preservation of content, and particularly focuses on the importance of preserving evanescent content such as periodicals, which give a window on an era. Hopes to see everything in the public domain becoming available online. In order to do that, we need to know what’s out of copyright.

    He describes his work on the catalog of copyright entries, and points to a directory of first copyright renewals for periodicals: “Most periodicals published in the US prior to 1964 had to renew their issue copyrights after 28 years in order to retain copyright on the issue….Below is a list of periodicials and their first copyright renewals, if any. The list below should include all of the more than 1000 periodicals that renewed between 1950 and 1977, and selected periodicals that renewed between 1978 and 1992. (After 1992, copyright renewal was no longer required.)”

  • Juliet Sutherland of Distributed proofreaders: “This site provides a web-based method of easing the proofreading work associated with the digitization of Public Domain books into Project Gutenberg e-books. By breaking the work into individual pages, many proofreaders can be working on the same book at the same time. This significantly speeds up the proofreading/e-book creation process….When a proofreader elects to proofread a page of a particular book, the text and image file are displayed on a single web page. This allows the page text to be easily reviewed and compared to the image file, thus assisting the proofreading of the page text.” Volunteers have produced over 8100 titles (periodicals or books); another 4000 in progress. We need to get beyond scanned images of works to plain text, which can be remixed and used in other ways. This is one way to get there. Juliet says: “Come do your page a day!

    A lot of similarities to Amazon’s Mechanical Turk in the way tasks are broken into atomic units (one random page at a time) rather than as larger units. 4-500 unique volunteers per day. Each physical page gets looked at 4 times. (This is good prior art if Amazon tries to patent the Turk. In addition, Yochai Benkler discusses the issues in splitting up work like this in Coase’s Penguin.)

  • Dale Flecker of Harvard talks about the lack of norms in citation: “When you pick up a pointer, there’s no standardized expectation of what you’re going to get.” He points to a bunch of sites that do interesting, but different things. Wishes for a system that gives multi-valued pointers — showing options for reaching different versions of an item.
  • John P. Wilkin, University of Michigan: There are a lot of rights, and no easy answers. Describes the methodology used by the University of Michigan to determine the rights status of works being digitized by the UMich library. Six status categories – public domain (pd), in copyright (ic), out of print and brittle (opb), orphaned because no copyright owner can be found (orph), undetermined (und), open to U Mich by contract (UMall), open to the world by contract (world). Reason codes – bibliographically by copyright date (bib), no copyright notice (ncn), by contract (con), or by due diligence (ddd). Some interesting discussion about the problem of embedded rights — e.g. does the work include other works (e.g. a photo) that has different rights. (At O’Reilly, we’re struggling with this problem right now. We want to put out some of our content under open source licenses, without actually granting our competitors the right to copy the format. (Think Hacks books or cookbooks.)
  • Ammy Vogtlander of Elsevier also talks about rights. Because they have a small community, infringement is easy to determine, and rights are determined by contracts between institutions. And because readers are also the authors, managing DRM violations is often as much a PR and community relations issue as much as a legal one. Would love to see some mechanism developed for search engines to recognize and skip content that has been discovered outside its permitted domain.
  • Jon Noring of OpenReader talks about “Open DRM.” He says that the acronym DRM is coming to be a lot like the acronym UFO, where an “unidentified” flying object is actually identified in the popular mind as an alien spacecraft, and thus the term has come to mean something very different from its literal meaning. A copyright notice is a type of DRM. DRM should not be confused with “TPM” or “technical protection measures.” And TPM doesn’t really work. If something is popular, it will be pirated. 48 hours after Harry Potter is published [12 hours for the latest book], people have retyped it in (correcting errors along the way) and put it into distribution. Meanwhile, the protection mechanisms irritate ordinary users. Publishers are also worried about protection systems controlled by a single vendor. There’s a desire for an open source type system that isn’t controlled by any vendor, and that allows flexible permissions. In discussion, Cliff Lynch points out that the Creative Commons is in fact just that: a mechanism of asserting rights, without inconveniencing readers.
  • Brewster Kahle of the Internet Archive and the Open Content Alliance offers to be a place to bring together best practices and a place for experiment. He requests more collaboration around issues and concrete projects that show what the digital future can look like. We can use the out of copyright materials as a testbed.
  • Cliff Lynch of CNI spoke about the value in the future of collections of content that we can not just read but compute on. Finding and clustering are just the beginning. What new services will we be able to build? We’ll be moving beyond individual people interacting with small groups of texts. He also talks about the difference between knowledge and information sources, which are somewhat fungible, and works of artistic expression, which are much less so. We may end up needing very different methods for dealing these two types of work.

    He also talks about sharing the burden of cleaning up the mess we’ve made in losing track of who owns what. This is so expensive we really don’t want to do it more than once.

Overall, a fascinating meeting. I learned a lot, and urge readers to follow some of the links above and learn about some of the amazing work being done by the library community.

Technorati Tags: , , ,


An engineer’s guide to the Matrix

March 13, 2006

If you have seen all 3 movies, been overwhelmed by deja vu and
thought more than you would like to admit about what attraction this
movie holds over you…read on. This is a very long read, and while I
despise overly wordy rants in general, this one is definately worth the

Technorati Tags: , ,

Wikis and Blogs to Ease Administration

March 5, 2006

Empowering the average user with tools that are freely available. Short application generation cycles in a culture with cheap access to information is making room for cheap coordination. Hopefully the ease of setup and customization of these tools will become somewhat more in line with the ease of usage soon.

Using Wikis and Blogs to Ease Administration | Linux Journal

System administration can be like sailing a ship. You must keep your engines running smoothly, keep your crew and the harbors notified and up to date and also maintain your Captain’s log. You must keep your eye on the horizon for what is coming next. Two technologies have emerged over the past few years that could help keep you on course, wikis and blogs.

Maintaining Good Documentation
I find that one of the most difficult aspects of system administration is keeping documentation accurate and up to date. Documenting how you fixed a pesky problem today will help you remember how to fix it months later when it occurs again. If you ever have worked with others, you realize how critical good documentation is. Even if you are the only system administrator, you still will reap the benefits of good documentation, even more so if another sysadmin is ever brought on board.

Some goals of a good documentation system should be:

  • Make it easy for you and your coworkers to find relevant information.
  • Make it easy for new employees to come up to speed quickly.
  • Make it easy to create, edit and retire documentation.
  • Keep revisions of changes and who made them.
  • Limit who sees or edits the documentation with an authentication system.

Unfortunately, keeping your documentation up to date can be a full-time job in itself. Documenting, though not a very glamorous task, certainly will pay off in the long run.

Why a Wiki?
This is where a wiki comes in. From Wikipedia: “a wiki is a type of Web site that allows users to add and edit content and is especially suited for constructive collaborative authoring.”

What this means is a wiki allows you to keep and edit your documentation in a central location. You can access and edit that documentation regardless of the platform you are using. All you need is a Web browser. Some wikis have the ability to keep track of each revision of a changed document, so you can revert to a previous version if some errant changes are made to a document. The only obstacle a new user must overcome is learning the particular markup language of your wiki, and sometimes even this is not completely necessary.

One of a wiki’s features is also one of its drawbacks. Wikis are pretty free flowing, and although this allows you to concentrate on getting the documentation written quickly, it can make organization of your wiki rapidly spiral out of control. Thought needs to be put into how the wiki is organized, so that topics do not get stranded or lost. I have found that making the front page a table of contents of all the topics is very handy. However you decide to organize your wiki, make sure it is well understood by everyone else. In fact, a good first document might be the policy describing the organization of the wiki!

There are several open-source wikis available, such as MediaWiki [see Reuven M. Lerner’s article on page 62 for more information on MediaWiki] and MoinMoin, each with its own philosophy on markup and layout, but here we concentrate on TWiki.

Some of TWiki’s benefits are:

  • A notion of webs that allows the wiki administrator to segregate areas of collaboration into their own areas, each with its own set of authorization rules and topics.
  • A modular plugin and skin system that allows you to customize easily.
  • A well-established base of users and developers.
  • Revision control based on RCS.
  • It is Perl-based and mod_perl or FastCGI can be used.
  • Authentication is handled outside the wiki by mechanisms such as Apache htpasswd.

Installing TWiki is relatively easy, but still needs work. I hope, as the beta progresses, we will see improvements in ease of installation and upgrading along with clearer documentation.

First, you must create the directory where you want to install TWiki, say /var/www/wiki. Next, untar the TWiki distribution in that directory. Then you must make sure that the user with rights to run CGI scripts (usually apache or www-data), owns all of the files and is able to write to all files.

# install -d -o apache /var/www/wiki
# cd /var/www/wiki
# tar zxf /path/to/TWikiRelease2005x12x17x7873beta.tgz
# cp bin/LocalLib.cfg.txt bin/LocalLib.cfg
# vi bin/LocalLib.cfg lib/LocalSite.cfg
# chown -R apache *
# chmod -R u+w *

Now copy bin/LocalLib.cfg.txt to bin/LocalLib.cfg, and edit it. You need to edit the $twikiLibPath variable to point to the absolute path of your TWiki lib directory, /var/www/wiki/lib in our case. You also must create lib/LocalSite.cfg to reflect your specific site information.

Here is a sample of what might go into LocalSite.cfg:

# This is LocalSite.cfg. It contains all the setups for your local
# TWiki site.
$cfg{DefaultUrlHost} = “”;
$cfg{ScriptUrlPath} = “/wiki/bin”;
$cfg{PubUrlPath} = “/wiki/pub”;
$cfg{DataDir} = “/var/www/wiki/data”;
$cfg{PubDir} = “/var/www/wiki/pub”;
$cfg{TemplateDir} = “/var/www/wiki/templates”;
$TWiki::cfg{LocalesDir} = ‘/var/www/wiki/locale’;

Here is a sample section for your Apache configuration file that allows this wiki to run:

ScriptAlias /wiki/bin/ “/var/www/wiki/bin/”
Alias /wiki “/var/www/localhost/wiki”
Options +ExecCGI -Indexes
SetHandler cgi-script
AllowOverride All
Allow from all
Options FollowSymLinks +Includes
AllowOverride None
Allow from all
deny from all
deny from all
deny from all

TWiki comes with a configure script that you run to set up TWiki. This script is used not only on initial install but also when you want to enable plugins later. At this point, you are ready to configure TWiki, so point your browser to your TWiki configure script, You might be particularly interested in the Security section, but we will visit this shortly. Until you have registered your first user, you should leave all settings as they are. If the configure script gives any warnings or errors, you should fix those first and re-run the script. Once you click Next, you are prompted to enter a password. This password is used whenever the configure script is run in the future to help ensure no improper access.

Once you have completed the configuration successfully, it is time to enter the wiki. Point your browser to, and you are presented with the Main web. In the middle of the page is a link for registration. Register yourself as a user. Be sure to provide a valid e-mail address as the software uses it to validate your account. Once you have verified your user account, you need to add yourself to the TWikiAdminGroup. Return to the Main web and click on the Groups link at the left, and then choose the TWikiAdminGroup. Edit this page, and change the GROUP variable to include your new user name:

Set GROUP = %MAINWEB%.TiLeggett

(Three blank spaces at the beginning of each of these 2 lines are critical.)

These two lines add your user to the TWikiAdminGroup and allow only members of the TWikiAdminGroup to modify the group. We are now ready to enable authentication for our wiki, so go back to Several options provided under the Security section are useful. You should make sure the options {UseClientSessions} and {Sessions}{UseIPMatching} are enabled. Also set the {LoginManager} option to TWiki::Client::TemplateLogin and {PasswordManager} to TWiki::Users::HtPasswdUser. If your server supports it, you should set {HtPasswd}{Encoding} to sha1. Save your changes and return to the wiki. If you are not logged in automatically, there is a link at the top left of the page that allows you to do so.

Now that you have authentication working, you may want to tighten down your wiki so that unauthorized people do not turn your documentation repository into an illicit data repository. TWiki has a pretty sophisticated authorization system that is tiered from the site-wide preferences all the way down to a specific topic. Before locking down the Main web, a few more tasks need to be done. Once only certain users can change the Main web, registering new users will fail. That is because part of the user registration process involves creating a topic for that user under the Main web. Dakar has a user, TWikiRegistrationAgent, that is used to do this. From the Main web, use the Jump box at the top left to jump to the WebPreferences topic.

Edit the topic to include the following four lines and save your changes:


This allows only members of the TWikiAdminGroup to make changes or rename the Main web or update the Main web’s preferences. It also allows the TWikiRegistrationAgent user to create new users’ home topics when new users register.

Once you have verified that the Main web is locked down, you should do the same for the TWiki and Sandbox webs.

When you are done configuring TWiki, you should secure the files’ permissions:

# find /var/www/wiki/ -type d -exec chmod 0755 {} ‘;’
# find /var/www/wiki/ -type f -exec chmod 0400 {} ‘;’
# find /var/www/wiki/pub/ -type f -exec chmod 0600 {} ‘;’
# find /var/www/wiki/data/ -type f -exec chmod 0600 {} ‘;’
# find /var/www/wiki/lib/LocalSite.cfg -exec chmod 0600 {} ‘;’
# find /var/www/wiki/bin/ -type f -exec chmod 0700 {} ‘;’
# chown -R apache /var/www/wiki/*

As I mentioned before, TWiki has a plugin system that you can use. Many plugins are available from the TWiki Web site. Be sure the plugins you choose have been updated for Dakar before you use them.

Keeping Your Users in the Know
One important aspect of system administration that is sometimes overlooked is keeping users informed. Most users like to know when there is new functionality available or when resources are down or not available. Not only does it make users happier to be kept informed, but it also can make your life easier as well. The last thing you want to do when the central file server is down is reply to users’ questions about why they cannot get to their files. If you have trained your users to look at a central location for status of the infrastructure first, all you have to do after notification of a problem is post to this central place that there is a problem. Mailing lists also are good for this, but what if the mail server is down? Some people, for instance your boss or VP of the company, might like to know what the status is of things as they happen. These updates might not be suitable to send out to everyone daily via e-mail. You could create yet another mailing list for these notifications, but you also might consider a blog.

If you are not familiar with a blog, let us refer back to Wikipedia: “a blog is a Web site in which journal entries are posted on a regular basis and displayed in reverse chronological order.”

The notion of a blog has been around for centuries in the form of diaries, but blogs recently have had an explosion on the Internet. Many times a blog is started as someone’s personal journal or as a way to report news, but blogs can be extremely useful for the sysadmin.

Blogs can help a sysadmin give users an up-to-the-minute status of what they are doing and what the state of the infrastructure is. If you faithfully update your blog, you easily can look back on what you have accomplished so you can make your case for that raise you have been hoping for. It also will help you keep track of what your coworkers are doing. And, with many blog software packages providing RSS feeds, users can subscribe to the blog and be notified when there are new posts.

There are a lot of blog software packages out there today, but here we cover WordPress. WordPress is fast and has a nice plugin and skin interface to allow you to customize it to your heart’s content. The only requirements for running WordPress are Apache, MySQL and PHP. I don’t go into how to install WordPress, because the on-line documentation is very clear and easy to follow. Instead, I start where the installation leaves off and introduce some useful plugins. I suggest starting with WordPress v1.5.2 even though v2.0 is currently out. There have been some problems with the initial 2.0 release that warrant waiting for v2.0.1. Also, many of the plugins have not had a chance to update to the new system.

The first thing you should do after installing WordPress is log in as the admin user. Once logged in, you are presented with the Dashboard. At the top of the page is a menu of options named Write, Manage, Links and so on. You should first create an account for yourself by clicking on the Users option. Once that has loaded, two tabs labeled Your Profile and Authors & Users are available under the main menu. Click on Authors & Users, and scroll down to the Add New User section and fill in the text fields. Once your user has been added, it appears in the Registered Users section above. There are several columns of data, and one is Promote, which you should click on. Promoting a user makes that user an author and also allows that user to have more privileges based on its level. Once your user has been promoted, it will have a level of one. There are plus and minus signs on either side of the level to use to increase your user’s level. Increase it to nine, which is the highest level a non-admin user can be. Should you ever need to delete users that have been promoted to authors, all you need to do is decrease their level below one and then delete them. I have included a link to a more in-depth description of the privileges of each user level in the on-line Resources.

There are a few other options you might consider changing. In General Options, there are check boxes to allow anyone to register to become a blog user and to require users to be logged in to add comments. You may or may not want these options enabled, depending on your security concerns and the openness of your blog. At our site, users cannot register themselves, though anyone can post comments without being logged in. You should explore all the menus and all their options to tweak them for your site.

WordPress Plugins
WordPress has a very modular plugin system, and a lot of people have written many plugins. WordPress also has a notion of categories. Categories can have many uses, but one might be to create mini-blogs for different communities of users or to group posts about a specific aspect of the infrastructure. But, you might not want all users to be able to see every category. The Userextra plugin, in conjunction with the Usermeta plugin, allows you to control exactly this sort of thing. Once you have followed these plugins’ installation instructions, two more menus are available under Options and one more under Manage that allow you to refine access.

Another plugin you may find useful is the HTTP Authentication plugin. This plugin lets you use an external authentication mechanism, such as Apache’s BasicAuth, as a means to authenticate to WordPress. This is great if you already have an LDAP directory or Kerberos realm that you use for authentication and you have mod_auth_ldap or mod_auth_kerb up and running.

Many more plugins are available for WordPress from the WordPress Codex and the WordPress Plugin DB. If you feel some functionality is missing, there are plenty of examples and documentation available from the WordPress Web site, and these plugin repositories can help you write your own plugin.

Wrapping Up
I hope that after this whirlwind tour of wikis and blogs you have come to see how they can be beneficial to help your shop run a smoother ship and provide your users with all the information they might want. Just as there are many different sails to keep your ship sailing, there are many different wiki and blog software packages out there. The right package for you is the one that keeps your users happy and you productive.

Resources for this article: