Data Streaming for AI: From Extractive Training to Sovereign Infrastructure DWeb Camp 2026

Data Streaming for AI: From Extractive Training to Sovereign Infrastructure
.ical
2026-07-11 13:30–14:30, AI Barn

AI systems are consuming the world's content without compensating its creators. This session explores data streaming as a new paradigm — where content flows to AI in real time, with built-in rights management, usage tracking, and fair compensation — and asks what it would take to make this infrastructure decentralized, sovereign, and governed by the communities it serves.

The generative AI revolution has a dirty secret: it is built on content that was taken without consent, used without compensation, and consumed without traceability. The lawsuits are piling up (Getty v. Stability AI, Authors Guild v. OpenAI), lump-sum licensing deals are being signed behind closed doors, and the vast majority of content creators — from independent journalists to national libraries — have no seat at the table.
This is not just a legal problem. It is an infrastructure problem. And it is one that the decentralized web community is uniquely positioned to help solve.

The shift from training to inference
Most of the public debate about AI and content rights focuses on training data: who scraped what, and whether it was legal. But the real value chain is shifting. The rise of Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP) — now backed by Anthropic, OpenAI, Google, and Microsoft, with 97 million monthly SDK downloads — means that AI systems increasingly access content in real time, at inference time, rather than memorizing it during training.
This changes everything. If AI accesses your content at the moment a user asks a question, that access can be metered, traced, and billed — just like streaming music. The content doesn't need to be copied or stored. It can remain exactly where it is: on your servers, under your control, governed by your rules.
This is data streaming for AI. And it opens the door to a fundamentally different relationship between content creators and AI systems — one based on consent, compensation, and sovereignty rather than extraction.

What data streaming looks like in practice
At its simplest, data streaming means deploying a secure connector on the content owner's infrastructure that makes their content accessible to AI platforms in real time. The content is used for inference only — never stored, never copied, never used for training. Every query is metered and traced. The content owner sets the price, the conditions, and who gets access.
This model is already in production. France's national library (BnF) has a live data streaming connector serving 40 million documents to AI agents. OpenAIRE, the European open science infrastructure, streams 600 million research papers through 28 MCP tools. The European Legal Data Space is building sovereign AI agents for lawyers that access legal content through this kind of infrastructure.
These are not prototypes. They are real systems serving real users, with real content flowing through them — and the content owners maintain full control.

Why this matters for the DWeb community
The current trajectory of AI content licensing is heading toward a deeply centralized outcome. A handful of large publishers are signing exclusive deals with a handful of large AI companies. Everyone else — independent creators, cultural institutions, scientific publishers, niche content providers — is locked out.
Data streaming offers an alternative path, but only if the infrastructure is built on the right principles. This is where the DWeb community's expertise is critical.

Three open questions that I believe this community can help answer:

1. Can data streaming infrastructure be decentralized?
Today, most data streaming deployments use Kubernetes clusters connected via mTLS tunnels. This works well for institutional publishers, but it requires significant technical infrastructure. Could this be done with peer-to-peer protocols instead? Could a small publisher or an independent journalist run a data streaming node from their laptop? What would a lightweight, decentralized version of this infrastructure look like?
The technical building blocks may already exist in the DWeb ecosystem: content-addressed storage, decentralized identity, peer-to-peer networking, cryptographic attestation. The question is how to compose them into something that serves the specific needs of AI-era content distribution.

2. What does "Copyfair" look like in practice?
I have been developing a licensing model called Copyfair — a new paradigm for content licensing in the age of AI. Copyfair is to AI what Copyleft was to software: a way to ensure that content can flow freely while protecting the rights and revenues of its creators.
The core principle: content should be accessible to AI systems for inference, but every use should be traced, attributed, and compensated. Unlike traditional copyright (which restricts access) or Creative Commons (which gives blanket permissions), Copyfair creates a dynamic, usage-based framework where the terms of access are embedded in the infrastructure itself.
The DWeb community has deep experience with machine-readable licenses (Creative Commons), programmable rights (Story Protocol's PIL), and decentralized governance. I would love to explore how these traditions can inform the design of Copyfair — and whether a community-governed license framework could become a public good that serves all content creators, not just those who can afford to negotiate with OpenAI.

3. Who governs the data economy?
The most important question is not technical — it is political. If data streaming becomes the standard way that AI accesses content, who decides the rules? Who sets the default terms? Who resolves disputes?
Today, this is decided by bilateral contracts between publishers and AI companies. But there are alternatives. In Denmark, a single collective licensing organization (DPCMO) represents 99% of national news publishers and negotiates on their behalf. In France, organizations like the CFC (Centre Français d'exploitation du droit de Copie) and the SNE (Syndicat National de l'Édition) could play a similar role.
Could the DWeb community help design governance structures for the AI data economy that are more democratic, more transparent, and more inclusive than the current system? Could DAOs, cooperatives, or other decentralized governance models be applied to collective content licensing?

What I hope to explore together
This session is an invitation to think together about what the infrastructure for a fair AI data economy should look like — and to consider whether the principles and technologies of the decentralized web can play a foundational role in building it.
I will share what I have learned from building and deploying data streaming infrastructure with national libraries, scientific publishers, and legal institutions across Europe. But more importantly, I want to hear from this community: what would you build differently? What are we missing? What exists in the DWeb ecosystem that we should be building on?
The window is narrow. The standards are being set now. The protocols are being chosen now. If we wait for the centralized platforms to define the rules, we will end up with a data economy that looks exactly like the platform economy we already have — extractive, opaque, and governed by the few.
Let's build something better.

About the speaker
Primavera De Filippi is a researcher at the National Center for Scientific Research (CNRS) and a Faculty Associate at the Berkman Klein Center for Internet & Society at Harvard University. She is the author of "Blockchain and the Law" (Harvard University Press) and the creator of the Copyfair licensing model. She is a member of the WIPO Technical Exchange Network and serves on the Global Tech Thinkers advisory group to the French Presidency.

Primavera

Primavera De Filippi is a Research Director at CNRS (France) and Faculty Associate at Harvard's Berkman Klein Center, working at the intersection of law, technology, and governance. She is the author of Blockchain and the Law (Harvard, 2018) and Blockchain Governance (MIT, 2024).

This speaker also appears in:

Beyond the Network State: Building Governance Infrastructure for Networked Sovereignties

Data Streaming for AI: From Extractive Training to Sovereign Infrastructure .ical 2026-07-11 13:30–14:30, AI Barn

Data Streaming for AI: From Extractive Training to Sovereign Infrastructure
.ical
2026-07-11 13:30–14:30, AI Barn