Fascinating article here on the transient nature of web information, which is a real problem in the research biz.
In research described in the journal Science last month, the team looked at footnotes from scientific articles in three major journals — the New England Journal of Medicine, Science and Nature — at three months, 15 months and 27 months after publication. The prevalence of inactive Internet references grew during those intervals from 3.8 percent to 10 percent to 13 percent. […] In one recent study, one-fifth of the Internet addresses used in a Web-based high school science curriculum disappeared over 12 months.
This isn’t just a matter of having to update documents. The idea of the “world-wide web” was the interconnectivity of data and the sites that host them. If data vanishes — or moves without a forwarding address — the whole enterprise is in danger.
It’s as though every day I went into the library, threw out a dozen books, and in a dozen more put new bindings on with new titles. Ultimately, not only would the card catelog be useless, even the biblographies within books would be.
Of course, even conventional footnotes often lead to dead ends. Some experts have estimated that as many as 20 percent to 25 percent of all published footnotes have typographical errors, which can lead people to the wrong volume or issue of a sought-after reference, said Sheldon Kotzin, chief of bibliographic services at the National Library of Medicine in Bethesda.
But the Web’s relentless morphing affects a lot more than footnotes. People are increasingly dependent on the Web to get information from companies, organizations and governments. Yet, of the 2,483 British government Web sites, for example, 25 percent change their URL each year, said David Worlock of Electronic Publishing Services Ltd. in London.
That matters in part because some documents exist only as Web pages — for example, the British government’s dossier on Iraqi weapons. “It only appeared on the Web,” Worlock said. “There is no definitive reference where future historians might find it.”
While there are efforts to archive the Web, the data volumes involved are tremendous. Efforts are underway to “fingerprint” documents with persistent codes that could survive migration from one page to another, but it’s still a constant source of catch-up.
Speaking personally, I know that almost every time I go back to an older page on this blog and try to follow the link, I have only about a 35% chance of actually getting to what I linked to. News stories, of course, tend to be transient (especially national news organizations), and links to stories on Yahoo are good for a few weeks at best.
I’ve taken, because of that, to quoting on my page whatever key data I want to retain. It would be easier, sometimes, to just throw up a link (and often it makes for a better joke), but it’s also a lot more hazardous; I have several posts that say only, “Hey, this is keen, I’ll have to remember this in the future,” and the link goes nowhere. Oh, well …
The most stable links, in many cases, are to blogs and personal web sites — perhaps because most folks tend to stick (or be stuck with) a single page structure for a much longer time, given the tremendous effort it can take to migrate information around.
But even there, it happens. People change domain names (through personal preference, fleeing from spam, or because they change ISPs), or change technology types (.html pages suddenly becoming .shtml), etc.
The incredible power of the Internet is the ability to rapidly churn out and reconfigure information. It’s greatest weakness may turn out to be just the same thing.
(via Tyler Cowen)