Europe’s Open Web Index Pilot Unleashes Nearly 1 Petabyte of Public Web Data, Challenging Tech Giants


In a landmark move for digital transparency, the European Open Web Index (OWI) project has launched its pilot phase, granting researchers and developers access to an unprecedented 1 petabyte of publicly crawled web data. Dubbed *MS-C931*, the initiative marks a critical step toward redefining how the internet is indexed, shared, and governed—free from the grip of commercial search engine monopolies.

Funded by the European Union’s Horizon Europe program, the OWI aims to democratize access to web data, fostering innovation in search technologies while prioritizing ethical standards and data sovereignty. Unlike proprietary indexes maintained by tech giants like Google or Bing, the OWI’s dataset is open-access, transparent, and built collaboratively by a consortium of European academic institutions, including the Centrum Wiskunde & Informatica (CWI), Fraunhofer Society, and multiple universities.

A New Era of Open Search

The MS-C931 pilot, announced at a June event hosted by the OWI consortium, represents the largest public release of crawled web data to date. To put its scale into perspective: 1 petabyte equals 500 billion pages of text, or roughly 13 years of HD video. This treasure trove of information spans 12 European languages and includes content from over 100 million domains, carefully curated to comply with EU privacy laws like the GDPR.

“This isn’t just about data—it’s about reclaiming the internet as a public good,” said Dr. Elena Müller, lead researcher at CWI. “For decades, a handful of corporations have controlled what we see online. With the OWI, we’re handing the keys back to the people, enabling startups, academics, and policymakers to build fairer, more accountable tools.”

Why the Open Web Index Matters

Commercial search engines rely on proprietary algorithms and closed datasets, shaping what information users access—and what remains hidden. Critics argue this centralization stifles competition, amplifies biases, and leaves users vulnerable to opaque content moderation practices. The OWI seeks to counter this by providing a neutral, auditable foundation for next-generation search engines, fact-checking systems, and AI models.

Early adopters have already experimented with the dataset. A team at the University of Helsinki used it to train a multilingual AI tool detecting climate misinformation, while a Berlin-based startup developed a privacy-focused search engine prototype. “The OWI lets us innovate without begging for access to walled gardens,” said co-founder Markus Vogel.

Challenges Ahead

Despite its promise, the OWI faces hurdles. Maintaining a petabyte-scale index requires significant infrastructure costs, and the consortium is exploring sustainable funding models, including public-private partnerships. There’s also the question of keeping the data current. “Crawling the web is like painting a bridge—you finish, and it’s already time to start again,” laughed Dr. Müller.

Privacy advocates have praised the project’s strict adherence to GDPR but warn against potential misuse. “Open data must not come at the expense of user rights,” cautioned Lucia Mariani, a digital ethicist at the University of Bologna. The OWI team emphasizes that all data is sourced from publicly accessible websites, with mechanisms to remove sensitive content upon request.

What’s Next?

The pilot will run through 2024, with plans to expand the index to 5 petabytes by 2025. The consortium also aims to integrate real-time crawling and enhance multilingual coverage, particularly for underrepresented languages like Estonian and Maltese.

For now, the MS-C931 dataset is available to approved researchers and organizations via the OWI’s secure portal. As the project grows, its architects envision a future where the web’s map isn’t owned by a corporation—but by everyone. “This is just the beginning,” said Dr. Müller. “The open web deserves an open index.”


The Open Web Index project invites developers, academics, and civil society groups to explore its dataset and join its mission. Learn more about its goals and upcoming initiatives at openwebsearch.eu.

Related Posts


Post a Comment

Previous Post Next Post