Reports

The Web Isn’t Forever: New Research Findings from “Not Your Parents' Web” Project

Oct 10, 2024

Today, we’re excited to share the release of findings from the "Not Your Parents' Web" project –– an extensive study that explores the lifespan of millions of web pages across the last 26 years. The collaborative project from Internet Archive, Old Dominion University’s Web Science & Digital Libraries Research Group, and Filecoin Foundation (FF) analyzed data from the Wayback Machine to reveal key insights into the lifespan of URLs and the ephemeral nature of the web.

The study examined 27.3 million archived URLs across 7 million unique hosts, spanning from 1996 to 2021. The scale and depth of this analysis provide a fresh perspective on the transience of web content, highlighting both the fragility and longevity of the digital world.

In uncovering the alarming rate at which web content disappears, this study underscores the critical role that decentralized storage solutions like the InterPlanetary File System (IPFS) and Filecoin can play in the future of digital preservation. Decentralized technology allows web content to be safely and redundantly stored across a distributed network, reducing the risks of centralization, such as single points of failure and content takedowns –– ensuring our digital heritage remains accessible for future generations.

Key Findings from "Not Your Parents' Web" Project

The findings reveal a stark reality about the fleeting nature of most online content. The median lifespan of a web page is a mere 2.3 years.

Two factors indicate that reality might be even worse. First, the research didn’t analyze page content, just HTTP status codes, meaning some pages that responded could just be parked domain names. Second, the Wayback Machine is less likely to archive URLs that were short-lived or unpopular.

Recent data from Pew Research Center supports these findings, showing that 38% of web pages that existed in 2013 are no longer accessible today​. That report indicates that over the past decade, 25% of all web pages sampled from 2013 to 2023 have become inaccessible, with older content being particularly vulnerable to disappearance​.

Deep Links and Root URLs: A Tale of Two Lifespans

The "Not Your Parents' Web" project also sheds light on the varying lifespans of different types of URLs. Deep links –– URLs pointing to content deep within a website –– have a median lifespan of just 1.3 years, which means 50% disappear in less than 18 months. By contrast, root URLs, such as a website's homepage, show less fragility:

  • 10% disappear within a year
  • 50% have a lifespan of 10 years or more
  • 20% have lasted for over 20 years

These findings align with Pew’s research, which revealed that 23% of news web pages and 21% of government web pages contain at least one broken link, indicating the widespread issue of link rot across various sectors​. Additionally, a 2021 study by Harvard Law School and the New York Times, which looked at 2.2 million URLs, found that 25% of links in New York Times articles were completely broken and no longer pointing to accurate sources.

The implications of these findings are significant for the integrity and accessibility of online information. As deep links and root URLs vanish at an alarming rate, vast amounts of web-based content become lost to time, raising concerns about the long-term availability of critical resources, especially in sectors like journalism, government, and academia.

The prevalence of broken links, as highlighted by studies from Pew and Harvard Law School and the New York Times, not only undermines trust in digital archives but also jeopardizes historical accuracy, transparency, and the public’s ability to access reliable information. This issue reinforces the need for resilient solutions, like decentralized storage, to safeguard knowledge in the digital age.

The Enduring Early Web and the Reality of Digital Decay

Interestingly, while the modern web is marked by rapid turnover, early web pages have shown remarkable resilience. Nearly half of all URLs archived between 1996 and 2000 were still active in 2023, serving as digital relics of the internet’s formative years. However, despite these pockets of longevity, the broader picture remains one of decay. The average half-life of a URL is just two years, meaning that half of all web pages disappear within that time frame.

This digital decay is not confined to old content; even more recent pages from 2021 show a significant rate of disappearance, with one in five becoming inaccessible within just two years.

The State of the Web in 2023

As part of the "Not Your Parents' Web" project, a fresh web crawl was conducted in 2023 to determine the current status of the sampled URLs. The results are sobering:

About 60% of the 27.3 million URLs analyzed are no longer accessible on the live web:

  • 27% returned HTTP response errors (e.g. a 404 error).
  • 9% were unable to connect to the web server.
  • 23% encountered DNS failures, indicating that the domain name could no longer be resolved.

For users, this means encountering broken links, missing content, and gaps in the digital record –– experiences that illustrate the importance of robust web preservation efforts to ensure that critical information remains accessible, despite the web's inherent instability.

Pew Research’s examination of broken links in various online spaces supports this narrative, revealing that 54% of Wikipedia pages contain at least one link that points to a page that no longer exists, further highlighting the scope of digital decay​.

Why This Matters

The findings from the "Not Your Parents' Web" project offer critical insights into the lifecycle of web pages and raise important questions about the preservation of digital heritage. As the internet continues to evolve, much of its history is at risk of disappearing. Without the work of archiving projects, the early web — and even more recent digital content — would likely be lost forever.

A look at projects supported by FF and Filecoin Foundation for the Decentralized Web (FFDW) that are focused on archiving on the web:

  • Flickr Foundation focuses on preserving digital heritage by archiving millions of photographs shared on the Flickr platform, ensuring visual culture remains accessible for future generations.
  • Internet Archive, best known for its Wayback Machine, is instrumental in preserving the web’s history by creating an extensive archive of web pages, allowing users to access snapshots of websites over time.
  • Prelinger Archives specializes in preserving historical films and ephemeral media, contributing to a comprehensive audiovisual record of 20th-century culture.
  • Starling Lab pioneers the use of decentralized technologies to ensure the permanence and integrity of digital records, tackling both the technological and ethical challenges of preserving digital content.

These findings highlight the urgent need for broader awareness of the fragility of online content and the importance of web preservation. As digital archivists, developers, and users, we must continue to support initiatives that safeguard the web’s legacy. The "Not Your Parents' Web" project, serves as a powerful reminder of how quickly the web can change — and how easily its history can be lost if we do not act.

Do you want to dig further into the research findings? Read more about the methodology and see a detailed breakdown of findings in the ODU Web Science and Digital Libraries Research Group blog post.

Share Post