Monday, November 30, 2009

ZIP vs. PAF: Has Database Copyright Enabled Postcode Data Business?

Have you ever noticed that there no such field as "Legal Science"? That's because the scientific method is hard to apply to the development of laws. Just imagine applying an experimental law to one population while giving a placebo to a control population. Occasionally a circumstance appears where we can look for the effect of some particular bit of jurisprudence. Today's example is the database copyright. In the UK and other European countries, there is a special type of copyright (lawyers call it sui generis) that applies to databases. In the US, there has been no copyright for databases as such since 1991, even if they are the product of substantial investment.

In the US, databases can only be protected by copyright if they are expressions of human creativity. This is intended to be a fairly low bar. If the selection of data, for example, represents the judgement of a person, then the database can be protected by copyright. What isn't protected is the mindless labor, or "sweat of the brow" effort that someone has made to accumulate the data. The 1991 Supreme Court decision that established this rule was a unanimous one written by Justice Sandra Day O'Connor. It retrospect, the opinion seems prescient, as if the Court had anticipated a day when sweating brows would be banished by scraping computers and global networks of information.

Rob Styles has a post on his blog that got me reading and thing about these database copyrights. His key point is a suggestion that distributed, Linked Data will disrupt database intellectual property rights as profoundly as P2P distribution networks have disrupted the music and entertainment media businesses.

Like all great blog posts, Styles' is at the same time obviously true and obviously wrong- i.e., thought provoking. First, the obviously true part. When technology makes it trivial to reaggregate data that is readily available in a dispersed state, then businesses that rely on exclusive access to the aggregate become untenable. The example discussed by Styles is that of the Royal Mail's Postcode Address File. It turns out that in the UK, the Royal Mail has made a modest business of selling access to this file, which lists every address in the country that receives mail together with geographical coordiantes. This arrangement has been recently in the news because of Ernest Marples Postcodes Ltd., a small company which attempted to provide free API access to Postcode data, but was shut down by a threat of legal action from the Royal Mail. Apparently the royal Mail won't let websites use the postcode data on a website without paying a £3750 license fee. They also offer per click licenses which cost about 2p per click. To all appearances, the Royal Mail supports a healthy ecosystem of postcode data users- they list 217 "solutions providers" on their web site.

Styles' point is that the facts contained in the postcode file are in the public domain, and with Semantic Web technology, a large fraction of these facts could be made available as Linked Data without infringing the Royal Mail's copyrights. Once the data has entered the cloud, it would seem impractical for the Royal Mail to further assert its copyright. My posts on copyright salami attempted (unsuccessfully, I think) to construct a similar evasion for books; Rob's suggested postcode copyright evasion is clean because the "slices" are truly in the public domain, rather than simply being fairly used, as in my scenario.

How does the US differ in the availability of postcode data? In the US, the data file that corresponds most closely with the Royal Mail's PAF file is the USPS Topological Integrated Geographic Encoding and Reference/ZIP + 4® File (TIGER/ZIP+4). In the US, not only is there no database right, but works of the government are considered to be in the public domain. In general, government agencies are only allowed to charge for things like TIGER/ZIP+4 to cover distribution costs. Thus, it's not so surprising that the USPS doesn't even list a price for the TIGER/ZIP+4 file. I called up to ask, and found out that it costs $700 to get a full dump of the file. USPS does not offer updates; I was told that the release file is updated "every 2-3 years". The USPS, unlike the Royal Mail, seems uninterested in helping anyone use their data.

Since the USPS doesn't put any license conditions on the file, companies are free to resell the file in most any way they wish, resulting in a wide variety of services. For example, ZipInfo.com will sell you a license to their version of the Zip+4 file, suitable for use on a website, for $1998, updated quarterly. This is about 1/3rd of the price of the similar offering by the Royal Mail. Zip-codes.com has a similar product for $2000, including updates. On the low end, "Zip code guy" says he'll send you a file for free (the data's a bit old) if you link to his map site. On the high end, companies like Maponics provide the data merged with mapping information, analysis and other data sets.

The purpose of copyright has historically been "for the Encouragement of Learning" according to the Statute of Anne and "To promote the Progress of Science and Useful Arts" according to the US Constitution. The different copyright regimes used for the UK and US now present us with an experiment that's been running for over 18 years as to the efficacy of database copyrights. In which country, the UK or the US, have the "Useful Arts" surrounding postcode databases flourished the best?

After a bit of study, I've concluded that in the case of postcodes, database copyright has so far been more or less irrelevant to the development of the postcode data business. And even though the governmental organizations have completely different orientations towards providing their data, the end result- what you can easily buy and what it costs- is not all that different between countries. Although it's argued that the shutdown of ErnestMaples.com and the higher cost of data in the UK are a result of database copyright, there is clearly more at play.

In theory, one way that copyright promotes commerce is by providing a default license to cover standard use of protected material. In fact, there are very few database providers that rely solely on copyrights to govern usage terms. In both the US and UK, the "good" postcode databases are only available with a license agreement attached. These licenses preserve the business models of postcode data merchants; it's not clear that ErnestMaples.com was complying with license agreements even if it wasn't infringing a database copyright.

Since UK database copyrights don't have effect in the US, we might imagine setting up Royal Mail Postcode business in the US to exploit the absence of copyright. Would we be able to do something that we couldn't do in the UK? Well, not really. We'd probably still want to get a license from the Royal Mail, because £3750 is not a lot of money. It would cost us more to ask a lawyer whether we'd run into any problems. And at least in theory, the Royal Mail would have the freshest data. This is the reason I think Styles' post is "obviously wrong"- the distributed availability of data won't have a big effect on the core business of the Royal Mail or any other database business. It would have exactly the same effect as the absence of copyright protection in the US has had on the UK postcode market. In other words, nil.

My main worry about licensing from the Royal Mail would be in the area of allowed uses; I don't don't really trust an organization with the words "royal" and "mail" in its name to be able to understand and fairly price all the crazy smashed-up uses I might invent. Database copyrights give producers like the Royal Mail the ability to arbitrarily disallow new uses. Since it's hard to prove that any given fact has been obtained free of database copyright; the threat of an infringement lawsuit by the Royal Mail could even stifle non-infringing postcode applications.

What I don't see in the postcode data "experiment" is evidence that database copyright has had any great benefit for "the useful arts" in the UK compared to the US. If that's true, then why bother having a special copyright for databases?

As data lives more and more on the web, and becomes enhanced, entailed, and enmeshed, it makes less and less sense to draw arbitrary lines around blocks of data with copyright of autonomic aggregations. Although we need innovative licensing tools to build sustainable business models for data production, maintenance, and reuse in a global data network, we don't really need the database copyright.


Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Managing print information resources is like managing a lake. You need to be careful about what flows into your lake and you have to keep it clean. Managing electronic information resources is more like managing a river- it flows though many channels, changing as it goes, and it dies if you try to dam it up.

I have frequently applied this analogy to libraries and the challenges they face as their services move online, but the same thing is true for journal publishing. A journal publisher's duties are no longer finished when the articles are bound into issues and put into the mail. Instead, publication initiates a complex set of information flows to intermediaries that help the information get to its ultimate consumer. Metadata is sent to indexing services, search engines, information aggregators, and identity services. Mistakes that occur in these channels will prevent customer access just as profoundly as the loss of a print issue, and are harder to detect, as well.

A large number of journals have made the transition from print distribution to dual (print+electronic) distribution; many of those journals are now considering the transition to online-only distribution. As they plan these transitions, publishers are making decisions that may impact the distribution chain. Will indexing services be able to handle the transition smoothly? Will impact factors be affected? Will customer libraries incur unforeseen management costs?

I was recently asked by the steering committee of one such journal to look into some of these issues, in particular to find out about the effects of the "publish-before-print" model on citations. I eagerly accepted the charge, as I've been involved with citation linking in one way or another for over 10 years and it gave me an opportunity to reconnect with a number of my colleagues in the academic publishing industry.

"Publish-before-print" is just one name given to the practice of publishing an article "version of record" online in advance of the compilation of an issue or a volume. This allows the journal to publish fewer, thicker issues, thus lowering print and postage costs, while at the same time improving speed-to-publication for individual articles. Publish-before-print articles don't acquire volume, issue and page metadata until the production of the print version.

Before I go on, I would like to recommend the NISO Recommended Practice document on Journal Article Versions (pdf, 221KB). It recommends the use of "Version of Record" as the terminology to use instead of "published article" which is widely used in a number of circumstances:
  1. Version of Record (VoR) is also known as the definitive, authorized, formal, or published version, although these terms may not be synonymous.
  2. Many publishers today have adopted the practice of posting articles online prior to printing them and/or prior to compiling them in a particular issue. Some are evolving new ways to cite such articles. These “early release” articles are usually [Accepted Manuscripts], Proofs, or VoRs. The fact that an “early release” article may be used to establish precedence does not ipso facto make it a VoR. The assignment of a DOI does not ipso facto make it a VoR. It is a VoR if its content has been fixed by all formal publishing processes save those necessary to create a compiled issue and the publisher declares it to be formally published; it is a VoR even in the absence of traditional citation data added later when it is assembled within an issue and volume of a particular journal. As long as some permanent citation identifier(s) is provided, it is a publisher decision whether to declare the article formally published without issue assignment and pagination, but once so declared, the VoR label applies. Publishers should take extra care to correctly label their “early release” articles. The use of the term “posted” rather than “published” is recommended when the “early release” article is not yet a VoR.
"Version of Record before Print" is a bit of a mouthful, so I'll continue to use "publish-before-print" here to mean the same thing.

It's worth explaining "Assignment of a DOI" a bit further, since it's a bit complicated in the case of publish-before-print. Crossref issued DOIs are the identifiers used for articles by a majority of scholarly journal publishers. To assign the DOI, the a publisher has to submit a set of metadata for the article, along with the DOI that they want to register. The Crossref system validates the metadata and stores it in its database so that other publishers can discover the DOI for citation linking. In the case of publish-before-print, the submitted metadata will include journal name, the names of the authors, the article's title, and the article's URL, but will be missing volume, issue and page numbers. After the article has been paginated and bound into an issue, the publisher must resubmit the metadata to Crossref, with added metadata and the same DOI.

What happens if the online article is cited in an article in another journal during the time between the version of record going online and the full bibliographic data being assigned? This question is of particular importance to authors whose citation rates may factor into funding or tenure decisions. Since the answer depends on the processes being used to publish the citing article and produce the citation databases, so I had to make a few calls to get some answers.

As you might expect, journal production processes vary widely. Some journals, particularly in the field of clinical medicine, are very careful to check and double check the correctness of citations in their articles. For these journals, it's highly likely that the editorial process will capture updated metadata. Other publishers take a much more casual approach to citations, and publish whatever citation data the author provides. Most journals are somewhere in the middle.

Errors can creep into citations in many ways, including import of incorrect citations from another source, mispelling of author names, or simple miskeying. DOIs are particularly vulnerable to miskeying, due to their length and meaninglessness. One of my sources estimates that 20% of author keyed DOIs in citations are incorrect! If you have the opportunity to decide on the form of a DOI, don't forget to consider the human factor.

It's hard to get estimates of the current error rate in citation metadata; when I was producing an electronic journal ten years ago, my experience was consonant with industry lore that said that 10% of author-supplied citations were incorrect in some way. My guess, based on a few conversations and a small number of experiments, is that a typical error rate in published citations is 1-3%. A number of processes are pushing this number down, most of them connected with citation linking in some way.

Reference management and sharing tools such as RefWorks, Zotero, and Mendeley now enable authors to acquire article metadata without keying it in and link citations even before they even submit manuscripts for publication; this can't help but improve citation accuracy. Citation linking in the copy editing process also improves the accuracy of citation metadata. By matching citations to databases such as Crossref and PubMed, unlinked citations can be highlighted for special scrutiny by the author.

Integration of citation linking into publishing workflow is becoming increasingly common. In publishing flows hosted by HighWire Press' Bench>Press manuscript submission and tracking system, Crossref and Pubmed can be used at various stages to help copyeditors check and verify links. Similarly, ScholarOne Manuscripts, a manuscript management system owned by Thomson Reuters, integrates with Thomson Reuters' Web of Science and EndNote products. Inera's xStyles, software that focuses specifically on citation parsing and is integrated with Aries Systems' Editorial Manager, has recently added an automatic reference correction feature that not only checks linking, but also pulls metadata from Crossref and Pubmed to update and correct citations. I also know of several publishers that have developed similar systems internally.

In most e-journal production flows, there is still a publication "event", at which time the content of the article, including citations, becomes fixed. The article can then flow to third parties that make the article discoverable. Of particular interest are citation databases such as Thomson Reuters' Web of Science (this used to be ISI Science Citation Index). The Web of Science folks concentrate on accurate indexing of citations; they've been doing this for almost 50 years.

Web of Science will index an article and its citations once it has acquired its permanent bibliographic data. The article's citations will then be matched to source items that have already been indexed. Typically there are cited items that don't get matched - these might be unpublished articles, in-press articles, and private communications. Increasingly, the dangling items include DOIs. In the case of a cited publish-before-print article, the citation will remain in the database until the article has been included in an issue and indexed by Web of Science. At that point, if the DOI, journal name, and first author name all match, the dangling citation is joined to the indexed source item so that all citations of the article are grouped together.

Google's PageRank is becoming increasingly important for electronic journals, so it's important to help Google group together all the links to your content. The method supported by Google for grouping URL's is the rel="canonical" meta tag. By putting a DOI based link into this tag on the article web pages, publishers can ensure that the electronic article will be ranked optimally in Google and Google Scholar.

An increasingly popular alternative to publish-before-print is print-oblivious article numbering. Publishers following this practice do not assign issue numbers or page numbers, and instead assign article numbers when the version-of-record is first produced. Downstream bibliographic systems have not universally adjusted to this new practice; best paractices for article numbers are described in an NFAIS Report on Publishing Journal Articles (pdf 221KB).

In summary, the flow of publish-before-print articles to end users can be facilitated by proper use of DOIs and Crossref.
  1. Prompt, accurate and complete metadata deposit at the initial online publication event and subsequent pagination is essential.
  2. DOI's should be constructed with the expectation that they will get transcribed by humans.
  3. Citation checking and correction should be built into the article copyediting and production process.
  4. Use of DOI in rel="canonical" metatags will help in search engine rankings.
Enhanced by Zemanta

Friday, November 20, 2009

Putting Linked Data Boilerplate in a Box

Humans have always been digital creatures, and not just because we have fingers. We like to put things in boxes, in clearly defined categories. Our brains so dislike ambiguity that when musical tones are too close in pitch, the dissonance almost hurts.

The aesthetics of technical design frequently ask us to separate one thing from another. It's often said that software should separate code from content and that web-page mark-up should separate presentation from content. XML allows us to separate element content from attribute data; well designed XML schemas make clear and consistent decisions about what should go where.

In ontology design, the study of description logics has given us boxes for two types of information, which have been not-so-helpfully named the "A-Box" and the "T-Box". The T-Box is for terminology and the A-Box is for assertions. When you're designing an ontology, an important decision is how much information should be built into your terminology and how much should be left for users of the terminology to assert.

It's not always easy to decide where to draw the terminology vs. assertion line. For example, if you're building a dog ontology, you might want to have a BlackDog class for dogs that are black. Users of your ontology could then make a single assertion that Fido is a BlackDog, saving them the trouble of making the pair of assertions that Fido is a Dog and Fido is colored black. The audience, on the other hand, would have to understand the added terminology to be able to understand what you've said. In one case, the binding of color to dogs is done in the T-Box, in the second, the A-Box. The A/B box choice boils down to a question of whether users would rather have a concise assertion box and a complex terminology box, or a verbose assertion box and a simple terminology.

Although I designed my first RDF Schema over ten years ago, I had not had a chance to try out OWL for ontology design. Since OWL 2 has just just become a W3C Recommendation, I figured it was about time for me to dive in. I was also curious to find out what kind of ontology designs are preferred for linked data deployment, and I'd never even heard of description logic boxes.

Since I gave the New York Times an unfairly hard time for the mistakes it made in its initial Linked Data release, I felt somewhat obligated to do what I could to participate helpfully in their Linked Open Data Community. (Good stuff is going on there- if you're interested, go have a look!) The licensing and attribution metadata in the Times' Linked Data struck me as highly repetitive, and I wondered if this boilerplate metadata could be cleaned up by moving it into an OWL ontology. It could; if you're interested in details, go to the Times Data Community site and see.

It's not obvious which box this boilerplate information should be in. It's really context information, or assertions about other assertions. The Times wants people to know that it has licensed the data under a creative commons license, and that it wants attribution. If it's really the same set of assertions for everything the Times wants to express (i.e. it's boilerplate) then one would think there would be a better way than mindless repetition.

My ontology for New York Times assertion and licensing boilerplate had the effect of compacting the A-Box at the cost of making the T-Box more complex. I asked if that was a desirable thing or not, and the answer from the community was a uniform NOT. The problem is that there are many consumers of linked data who are reluctant to do the OWL reasoning necessary to unveil the boilerplate assertions embedded in the ontology. Since a business objective for the Times is to enable as many users as possible to make use of its data and ultimately to drive traffic to its topic pages, it makes sense to keep technical barriers as low as possible. Mindlessness is a feature.

I could only think of one reason that a real business would want to use my boilerplate-in-ontology scheme. Since handling an ontology may require some human intervention, the use of a custom ontology could be a mechanism to enforce downstream consideration of and assent to license terms, analogous to "click-wrap" licensing. Yuck!

The conclusion, at least for now, is that for most linked data publishing it is desirable to keep the terminology as simple as possible. Linked Data Pidgin is better than Linked Data Creole.

Saturday, November 14, 2009

The Book Rights Registry Unclaimed Works Fiduciary: Powerful Regent or Powerless Figurehead?

In college, I did physics problem sets with a study group that called themselves the "Fish Heads" after a song frequently played on the radio by Dr. Demento. We would start work after dinner on the night before the problem set was due, and we'd work till we were done, which was seldom before midnight and more usually like 3 or 4 AM.

I thought of the Fish Heads late last night while racing through the newly revised settlement agreement of the Google Book Search lawsuit. The parties to the lawsuit had already asked for, and received, a four-day extension, and you just knew they were going to stretch out their work to meet the midnight deadline with not much room to spare. Sure enough, at 11:45 PM EST came word that the revised agreement had been filed. A few minutes after midnight, I was racing through the document to find out what the changes were, tweeting along the way. James Grimmelmann and Ken Crews were doing the same thing in our different ways. It was really nerdy. Danny Sullivan was reporting on the Conference call with Dan Clancey, Paul Aiken and Richard Sarnoff.

Here's your basic reading list for Google Book Search Settlement Agreement 2.0:
  1. Start with the New York Times summary (Brad Stone and Miguel Helft)
  2. Then read Danny Sullivan's report on the Conference call.
  3. Having gotten the big picture, read James Grimmelmann's instant analysis of the revised agreement.
  4. Then graze through the coverage overview at Gary Price's Resource Shelf.
Having slept on it and having had some time to think it through, I have a bunch of questions, and they mostly focus on the one demon that has not been exorcised from the agreement, orphan works.

The revised agreement attempts to address the peculiar situation of orphan works by introducing a new entity, the Unclaimed Works Fiduciary (UWF) which, as part of the Book Rights Registry, is to act as a spokesman for the rightsholders of the unclaimed works. The key question for your problem set is this: is this new regime a powerful Regency over Orphandom, or is it a powerless Figurehead masking a Google Autocracy of Zombies?

Here is how the revised agreement defines the UWF
Unclaimed Works Fiduciary. The Charter will provide that the Registry’s power to act with respect to the exploitation of unclaimed Books and Inserts under the Amended Settlement will be delegated to an independent fiduciary (the “Unclaimed Works Fiduciary”) as set forth in [other sections of the Agreement] and otherwise as the Board of Directors of the Registry deems appropriate. The Unclaimed Works Fiduciary will be a person or entity that is not a published book author or book publisher (or an officer, director or employee of a book publisher). The Unclaimed Works Fiduciary (and any successor) will be chosen by a supermajority vote of the Board of Directors of the Registry and will be subject to Court approval.
The section about the Registry Charter provides that
in the case of unclaimed Books and Inserts, the Unclaimed Works Fiduciary may license to third parties the Copyright Interests of Rightsholders of unclaimed Books and Inserts to the extent permitted by law.
James Grimmelmann calls that that last sentence "words of equivocation". The reason is that he and other commentators think there is almost nothing that the law, absent an act of Congress, would allow the UWF to license to a third party. The rule of "Nemo dat" should apply- you can't give something away that isn't yours to give.

The Open Book Alliance goes even further. In a post somehow released earlier than the revised agreement, it calls the revised agreement a "sleight of hand" meant to distract people from Google's monopoly grab, its usurpation of Congress, its shredding of contracts, its destruction of libraries, its bioterror weapons stockpile and its threatening the sanctity of marriage.

Michael Healy, who has been named Executive Director of the Book Rights Registry, which would be the home of the UWF, seems to have a different perspective. In a post on the Publishing Point website, Healy notes
  • The Registry will now include a Court-approved fiduciary who will represent rightsholders of unclaimed books, act to protect their interests, and license their works to third parties, to the extent permitted by law.
  • The new version of the settlement removes the “most favored nation” clause contained in the previous version. The Registry will now be able to license unclaimed works to other parties without ever extending the same terms to Google.
"Extent permitted by law" is a hard phrase to argue with. How could a settlement go any further? Grimmelmann's theory is that the phrase is meant to be an enticement to Congress to pass a narrow law aimed at neutralizing Google's exclusive access to orphan works exploitation.

A closer look at the UWF suggests that its other powers may be less constrained. Here's what it will be able to do, as enumerated by the revised agreement:
  1. UWF may direct Google to change the classification of a Book to a Display Book or to a No Display Book or to include in, or exclude any or all Unclaimed Works from, one or more of the Display Uses (note added- see comments).
  2. UWF may allow Google to
    • alter the text of a Book or Insert when displayed to users;
    • add hyperlinks to any content within a page of a Book or facilitate the sharing of Book Annotations
    and may exclude from Advertising Uses one or more unclaimed Books if Google displays animated, audio or video advertisements in conjunction with those Books.
  3. UWF may approve the use of additional or different Pricing Bins for unclaimed Books
  4. UWF may:
    • dispute Google’s categorization of a Book as Fiction
    • allow Google to offer to users copy/paste, print or Book Annotation functionalities as part of Preview Uses; allow Google to conduct tests to determine if another Preview Use category increases sales and revenues of such Books
    • adjust the Preview Use setting for a particular Book in exceptional circumstances for good cause shown.
  5. UWF may authorize Google to make special offers of Books available through Consumer Purchases at reduced prices from the List Price.
  6. the Unclaimed Works Fiduciary and Google may agree to one or more of the following additional Revenue Models for unclaimed works:
    • Print on Demand (“POD”) - This service would permit purchasers to obtain a print copy of a non- Commercially Available Book distributed by third parties. A Book’s availability through such POD program would not, in and of itself, result in the Book being classified as Commercially Available.
    • File Download. This service would permit purchasers of Consumer Purchase for a Book to download a copy of such Book in an appropriate file format such as PDF, EPUB or other format for use on electronic book reading devices, mobile phones, portable media players and other electronic devices (“File Download”).
    • Consumer Subscription Models – This service would permit the purchase of individual access to the Institutional Subscription Database or to a designated subset thereof (“Consumer Subscription”).
  7. UWF may license to third parties the Copyright Interests of Rightsholders of unclaimed Books and Inserts to the extent permitted by law. (discussed above.)
  8. allow the Registry to use up to twenty-five percent (25%) of Unclaimed Funds earned in any one year that have remained unclaimed for least five (5) years for the purpose of attempting to locate the Rightsholders of unclaimed Books.
  9. UWF can challenge the classification of its Book or a group of its Books as In-Print or as Out-of-Print
All in all, it seems to me that the most significant power of the UWF is not the theoretical power to deal with third parties, but rather the power to control the display status of unclaimed works. (note added- see comments).

Under what circumstances might the UWF turn off display uses? Since the UWF is subject to the approval of the court, the court could, in principle, direct UWF to manage the unclaimed works to minimize antitrust issues. If that happened, Google's monopoly would not go much further than a release of liability for uses that might be considered fair use. Or, the UWF could use its leverage to force Google to open its unclaimed works scans to competitors.

On the other hand, the UWF, being selected by the Registry Board, and being dependent on the Registry for support, would have built-in incentives to enable revenue generating use by Google, not to mention its responsibilities to the orphan rights-holders.

In the end, whether the Unclaimed Works Fiduciary becomes a powerful Regent or a powerless Figurehead depends to a great extent on the Court's willingness to wield power. Good Luck, Denny Chin!
Ask a fish head anything you want to
they won't answer they can't talk.
Reblog this post [with Zemanta]

Friday, November 13, 2009

The New York Times Gets It Right; Does Linked Data Need a CrossRef or an InfoChimps?

I've been saying this long enough that I don't remember whether I was quoting someone else: whenever the internet disintermediates a middleman, two new intermediaries pop up somewhere else. It's disintermediation whack-a-mole, if you will. The reasons for this are:
  1. The old middlemen became fat on mark-ups an order of magnitude larger than needed by internet-enabled middlemen.
  2. Internet-enabled middlemen add value in ways that the old ones didn't.
My last business functioned as an intermediary that aggregated linking data. We'd get data from publishers, clean it up and add it to our collection, then provide feeds of that data to our customers (libraries and library systems vendors). Our customers got good data and support if was a problem. The companies who provided the data didn't have to deal with hundreds of libraries or system vendors, and they came to understand that we would help their customers link to their content.

Some companies, especially the large ones, were initially uncomfortable with the knowledge that we were selling feeds of data that they were giving out for free. They felt that somehow there was money left on the table. Other companies were fearful of losing control of the information, even though they didn't really have control of it in the first place. Once we explained to them how their data contained mangled character encodings, fictitious identifiers, stray column separators and Catalan month names, they began to see the value we provided.

While my company focused on the data needs of libraries (and did pretty well), a group of the largest academic publishers put up some money and formed a consortium to pool a different type of linking data in a way that let the publishers have more control of the data distribution. This consortium, known as Crossref, just celebrated its 10th anniversary. Crossref has not only paid back the money that its founders invested in it; it has arguably done more to push academic publishing into the 21st century than any other organization on the planet.

As academic publishing companies began to understand the benefits of distributing linking data through Crossref, my company, and others like it, they became more comfortable opening up their content and reaping the financial benefits. Despite the global recession, and despite predictions of its impending collapse, STM publishing has been financially healthy with companies such as Elsevier reporting increased profits. This is rather unlike the newspaper industry, for example.

Before I get to the newspaper industry, I should note yesterday's news that InfoChimps are publishing a collection of token data harvested from Twitter.
Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006.
InfoChimps is positioning itself as a marketplace to buy, sell, and share data sets of any size, topic or format. Yet another intermediary has popped up!

Two weeks ago, I wrote a somewhat alarmist article about problems in an exciting set of Linked Data being released by the New York Times. I am pleased to be able to be report that the New York Times is now getting it right! The most important thing that they're doing right is that they're listening to the people who want to consume their data. They've started a Google Group based community for the specific purpose of understanding how best to deliver their data. They've also corrected the problems pointed out by myself and others. It's not perfect, but it's not reasonable to expect perfect. The New York Times has set a very hopeful example for other companies that want to start publishing semantic linking information on the open web.

If, as many of us hope, many publishers decide to follow the lead of the Times and make more data collections available, will more intermediaries such as InfoChimps arise to facilitate data distribution, as happened with linking data in scholarly publishing? Will ad hoc groups such as "the Pedantic Web" become key participants in a less centralized data distribution environment? Or maybe large companies will turn off the spigots as "the suits" grow increasingly worried about their ability to control data once it is let out into the web of data.

Perhaps the time is ripe for a set of forward-looking publishers to emulate the nervous-but-smart journal publishers who started Crossref 10 years ago and start a similar consortium for the distribution of Linked Data.
Reblog this post [with Zemanta]

Wednesday, November 11, 2009

The Uniqueness of Sentences and J. K. Rowling's (Non)Infringement of Tanya Tucker

Have you ever heard someone say something unusual and wonder to yourself if anyone in the history of humanity had ever said that before, ever? It happens a lot more than you might think.

In the discussion of my article on copyright salami, I suggested that copyright based on content as short as a sentence would not be very robust. I had reasoned that if the sentences were short enough, the would be a high probability that the same sentence had already appeared in a copyrighted work, or even in a work that was in the public domain. I imagined building huge databases of sentences that had already been used so as to clear them for reuse.

I decided to do some testing first. I chose a page at random (p. 447) from my (print) copy of J. K. Rowling's Harry Potter and the Deathly Hallows. I extracted the sentences, and put each sentence into Google and into Google Book Search. The results surprised me.

My first test sentence was
"Get - off - her!" Ron shouted.
With only 5 words, none of them uncommon, I expected to get a a few close matches. The book search produced zero hits, and no results at all close. The general Google search was more interesting. Of the 7 hits, all of them exact matches, the top two of seven hits appear to be properly attributed fair use quotations from the book. Two other hits were to complete, unauthorized copies of the book. One of these, on SlideShare, offers this disclaimer:
"hey here i got this book in pdf format .. am i violating anything after .. uploading this stuff over here ... just let me know .. if any issue come in existence, will remove it
Although the item has had 34,000 views, it pdf itself appears to have been removed from SlideShare. The pdf posted by a Filipino web designer on his web site, though, is still available (and has been since August) and is of quite good quality.

The oddest hits are to a site which masquerades as a "game ranking" portal site.
RPGRank is a real-time online game ranking system which provide a best MMORPG ranking portal for both players and games of all genre with the exclusive news, press release, review, preview, interview, trailer and vedio. RPGRank strive to provide all gamers things that they never experienced before by newest game beta keys, live-event, and online tournamentsa with attractive giveaways from games.
It appears that this site generates pages of random text for the benefit of search engines by extracting sentences from books and feeding the sentences to Google in a random order. This site has convinced Google to index "about 318,000" pages of its meaningless "content", and offers to sell "background" advertising space on the site at $1200 per month.

The last hit appears to be to a site which is presenting a Vietnamese translation of the book alongside the complete English text. Although I can't read Vietnamese, I doubt very much that it is authorized use. Vietnam joined the Berne convention only 5 years ago, so this is certainly an illegal infringement.

Of the 26 sentences on page 447, I could find only three that had been used in places that Google knows about. The first, "Leave him alone, leave him alone!" is a line from a Tanya Tucker song. The second, "Harry's stomach turned over.", has been used in James Edward Amesbury's "bloody but weakly conceived thriller", A Sporting Chance and in D. Edwards Bradley's Harry's War.

The third,"Harry did not answer immediately." is firmly in the public domain, having done duty as a complete sentence in Smith Hempstone's A Tract of Time, as a fragment in Frances Elizabeth G. Carey-Brock's 1867 My father's Hand: and Other Stories, and in Adam Williams' 2007 gripping adventure of modern China, The Dragon's Tail.


Three sentences comprising bits of dialog: "Been Stung", "And your first name?", and "Vernon Dudley", turned up numerous matches to fragments of sentences in Google. It was also amusing to see matches for the sentence "What happened to you, ugly?" This phrase matched two people-search sites which specialize in feeding Google pages with text like "What happened to Joe Smith?" Apparently there is someone who uses the screen name "you_ugly", and the people search engines just leapt to the wrong conclusions!

Most of the sentences on page 447 appear to be purely original to J. K. Rowling. Was she lucky, or were the odds stacked in her favor? Word frequencies for English have been measured, so we can easily generate a simplistic estimate of sentence occurrence rate. Ignoring the proper name "Ron", the words "Get", "off", "her" and "shout" have occurrence frequencies of 0.22%, 0.046%, 0.22%, and 0.0055%, respectively. Multiplying these occurrence rates gives us a weighted occurrence probability of this combination of 1 per 8 trillion. If you had the entire population of earth speaking random four-word English sentences they might come up with this combination in a day or two. Add "Ron" into the mix, and they might take the greater part of a year to generate the sentence J. K. Rowling wrote.

For context, it's interesting to guess at the total number of sentences that humanity has written or spoken. It's estimated that 100 billion humans have lived so far. If those humans spent 16 hours a day for an average of 65 years generating 3 sentences per minute, we'd be up to about 20 million trillion sentences. The real number is probably a factor of 100 to a thousand less (half of us are men, after all!). This estimate roughly agrees with estimates of others that all the words ever spoken could be archived using 10 exabytes of storage.

Ten exabytes is not as much storage as it used to be. The Internet Archive currently has 0.003 exabytes; although Google is quite secretive about its hardware deployment, it seems likely that their current storage capacity is in excess of 10 exabytes. Yesterday, Google announced a pricing plan where they'll rent you 0.000016 exabytes for $4096 per year. I'll do the math for you. If you want to store everything anyone has ever said, Google will rent you the space for only $2.5 billion dollars per year!

Given that Google will soon have digitized a large fraction of the world's books, there are a few things we can learn from this exercise.
  • It will soon be very easy for Google to detect unauthorized copies of books in its index, and presumably to remove them. The benefit to publishers of doing this would hugely outweigh any damages they're suffering from the Google Books digitization program. Why have publishers overlooked getting this to happen as part of the agreement settling their lawsuit?
  • It will not be difficult for Google to accurately de-duplicate the Google Books index.
  • J.K. Rowling's hesitancy to release her books in ebook format is really, really stupid.
Before you get distracted with something useful, do this: pick about 5 random words, make a sentence from them, and become the first human ever to say that sentence. Depending on what you do next, you may also be the last!

Thursday, November 5, 2009

The Blank Node Bother and the RDF Copymess

There were many comments on my post about the problems in the Linked Data released by the New York Times, including some back and forth by Kingsley Idehen, Glenn MacDonald, Cory Casanave and Tim Berners-Lee that many readers of this blog may have found to be somewhat inexplicable. On the surface, the comments appeared to be about how to deal with the potentially toxic scope of "owl:sameAs". At a deeper level, the comments surround the issue of how to deal with a limitation of RDF. A better understanding of this issue will also help you understand difficulties faced by the New York Times and other enterprises trying to benefit from the publication of Linked Data.

Let's suppose that you have a dataset that you want to publish for the world to use. You've put a lot of work into it, and you want the world to know who made the data. This can benefit you by enhancing your reputation, but you might also benefit from others who can enhance the data, either by adding to it or by making corrections. You also may want people to be able to verify the status of facts that you've published. You need a way to attach information about the data's source to the data. Almost any legitimate business model that might support the production and maintenance of datasets depends on having some way to connect data with its source.

One way to publish a dataset is to do as the New York Times did, publish it as Linked Data. Unfortunately, RDF, the data model underlying Linked Data and the Semantic Web, has no built-in mechanism to attach data to its source. To some extent, this is a deliberate choice in the design of the model, and also a deep one. True facts can't really have sources, so a knowledge representation system that includes connections of facts to their sources is, in a way, polluted. Instead, RDF takes the point of view that statements are asserted, and if you want to deal with assertions and how they are asserted in a clean logic system, the assertions should be reified.

I have previously ranted about the problems with reification, but it's important to understand that the technological systems that have grown up around the Semantic Web don't actually do reification. Instead, these systems group triples into graphs and keep track of data sets using graph identifiers. Because these identified graphs are not part of the RDF model they tend to be implemented differently from system to system and thus the portability of statements made about the graph as a whole, such as those that connect data to their source, is limited.

At last week's International Semantic Web Conference Pat Hayes gave an invited talk about how to deal with this problem. I've discussed Pat's work previously, and in my opinion, he is able to communicate a deeper understanding of RDF and its implications than anyone else in the world. In his talk (I wasn't there, but his presentation is available.) he argues that when an RDF graph is moved about on the Web, it loses its self-consistency.

To see the problem, ask yourself this: "If I start with one fact, and copy it, how many facts do I have?" The answer is one fact. "one plus one equals two" is a single fact no matter how many times you copy it! You can think of this as a consequence of the universality of the concepts labeled by the english words "one" and "two".

I haven't gotten to the problem yet. As Pat Hayes points out, the problem is most clearly exposed by blank nodes. Blank nodes are parts of a knowledge representation that don't have global identity; they're put in as a kind of glue that connects parts of a fact. For example, lets suppose that we're representing a fact that's a part of the day's semantic web numerical puzzle: "number x plus number y equals two". "number x" and "number y" are labels we're assigning to a number that semantic web puzzle solvers around the world might attempt to map to a univeral concept. Now suppose I copy this fact into another puzzle. How many facts do I have? This time, the answer is two, because "number x" might turn out to be a different number in the second puzzle. So what happens if I copy a graph with a blank node a hundred times? Do the blank nodes multiply while the universally identified node don't? Nobody knows!

I hope you can see that making copies of knowledge elements and moving them to different contexts is much trickier than you would have imagined. To be able to manage it properly you need more than just the RDF model. In his talk, Pat Hayes proposes something he calls "Blogic" which adds the concept of "surfaces" to provide the context for a knowledge representation graph. If we had RDF surfaces, or something like that, then the connections between data and its source would be much easier to express and maintain across the web. Similarly, it would be possible to limit the scope of potentially toxic but useful assertions such as "owl:sameAs".

There are of course other ways to go about "fixing up" RDF, but I'm guessing the main problem is a lack of enthusiasm from W3C for the project. The view of Kingsley Idehen and Tim Berners-Lee appears to be that existing machinery, perhaps bolstered by graph IDs or document IDs is good enough and that we should just get on with putting data onto the web. I'm not sure, but there may be a bit of "information just wants to be free" ideology behind that viewpoint. There may be a feeling that information should be disconnected from its source to avoid entanglements, particularly of the legal variety. My belief is a bit different- it's that knowledge just wants to be worth something. And that providing solid context for data is ultimately what gives it the most value.

P.S. Ironically, in the very first comment on my last post, Ed Summers hints at a very elegant way that the Times could have avoided a big part of the problem- they could have used entailed attribution. It's probably worth another post just to explain it.

Reblog this post [with Zemanta]