Why is this harder than normal data privacy?
Traditional SaaS privacy is well-understood. Client A's data goes in Client A's database. Row-level security or separate schemas keep it isolated. Auditors can verify the boundary.
AI systems introduce three complications that traditional privacy architecture doesn't handle.
First, vector embeddings encode information that can't be surgically deleted. When you embed a document into a vector database, the embedding captures semantic meaning. Deleting the source document doesn't remove the information encoded in the vector. Under GDPR's right to erasure (Article 17), a user can request their data be deleted. But "their data" now lives in an embedding that's mathematically entangled with the model's understanding.
Second, the value of AI increases with data volume, creating pressure to share across tenants. An AI system that has seen 50 client codebases is better at code review than one that has seen 1. But sharing those codebases, or even sharing embeddings derived from them, violates the isolation boundary.
Third, AI outputs can inadvertently leak input data. If an agent trained on Client A's architecture patterns generates similar patterns for Client B, is that a data leak or a learned methodology? The regulatory answer isn't clear, which means the architectural answer needs to be conservative.
What does the regulatory landscape actually require?
Three regulations matter for most enterprise AI deployments, and they layer on top of each other.
GDPR compliance covers roughly 80% of POPIA requirements, so building GDPR-first is the practical approach. The EU AI Act adds a new layer: if your AI system is used for employment decisions (hiring, firing, performance reviews), it's classified as high-risk under Annex III, triggering conformity assessments and additional documentation requirements. The critical enforcement date is August 2026.
The NIST AI Risk Management Framework (AI 600-1) provides the governance structure, noting that "LLMs are already able to discover vulnerabilities in systems and write code to exploit them." It structures risk across four pillars: governance, mapping, measuring, and managing.
Sweden's IMY (data protection authority) fined two pharmacies a combined SEK 45M (~€3.8M) in 2025 for improper use of Meta analytics. The Netherlands' AP fined Clearview AI €30.5M in 2024. These aren't theoretical penalties. They're active enforcement in the markets where enterprise AI is deploying.
How do you delete someone's data from a vector embedding?
This is the question that keeps AI privacy architects up at night. You can't surgically remove one person's contribution from a high-dimensional embedding. Two practical approaches exist:
Option A: Re-embed the entire collection after deletion. Delete the source documents, then rebuild the vector index from scratch. This is thorough but expensive, especially for large collections. Best used as a periodic cleanup (monthly or quarterly).
Option B: Deletion log with query-time filtering. Maintain an append-only deletion log. When a deletion request comes in, mark the affected embeddings. At query time, filter results to exclude marked embeddings. The vectors technically still exist, but they're never surfaced. This is more practical for ongoing operations and can be combined with periodic re-embedding.
Most production systems use Option B for day-to-day compliance and Option A on a schedule. The key is that both approaches require you to maintain a mapping between source documents and their derived embeddings, which means your ingestion pipeline needs to track provenance from day one.
What does multi-tenant isolation look like for AI systems?
Three models, each with different cost and risk profiles:
For vector databases specifically, Weaviate's native multi-tenancy gives each tenant its own physical shard, supporting 50,000+ active tenants per node with an architectural guarantee of zero cross-tenant search results. This is the silo model applied to embeddings.
The critical principle: no data path should exist between tenants. Not "we filter at query time" (that's a policy, and policies have bugs). The architecture itself should make cross-tenant queries impossible. Separate embedding namespaces. Domain-scoped agent sessions locked to a single tenant. And output classification watermarking, where outputs inherit the classification of the highest-classified input, prevents silent laundering of confidential data into public outputs.
For organizations running multiple client engagements, intra-tenant isolation matters too. Information barriers (borrowed from financial services "Chinese wall" patterns) enforce bidirectional access restrictions between groups within the same organization, with every cross-barrier access attempt logged.
How do you improve AI across clients without sharing their data?
The architecture separates patterns from data.
Client data stays physically isolated. Embeddings, documents, conversation histories, and generated outputs never leave the tenant boundary. No cross-tenant retrieval, even if semantically relevant.
Patterns get extracted anonymously. What you learn isn't "Client A's authentication architecture" but "projects with microservice architectures tend to benefit from this prompting strategy" or "code review agents produce better output when given the dependency graph alongside the diff." These are methodology-level insights: which agent configurations work, which prompt structures produce better results, which tool descriptions reduce task completion time.
This is not federated learning (which operates on model weights). It's simpler. You track which operational decisions (model routing, context strategies, prompt structures) correlate with better outcomes across your anonymized performance data. The client's IP never leaves their silo. What crosses the boundary is the equivalent of "4-door sedans get better highway mileage than 2-door coupes," not any individual car's GPS history.
Anthropic's API data policy helps here: API data is retained for 7 days only, never used for model training, and not shared with third parties. This means client data sent through the API doesn't end up in the foundation model's weights.
What the research says
"LLMs are already able to discover vulnerabilities in systems and write code to exploit them."
"Providers and deployers of AI systems shall take measures to ensure, to the best extent possible, a sufficient level of AI literacy of their staff."
"Organizations implementing ABAC experience 73% fewer access-related security incidents compared to those using RBAC alone."
Our take
The teams that get privacy right treat it as an architectural constraint, not a compliance checkbox. The isolation boundary should be enforced by infrastructure (separate embedding namespaces, physically sharded vector stores, no cross-tenant query paths), not by policy alone. Policies have bugs. Infrastructure has guarantees.
What we've found is that data minimization, the GDPR requirement to only process the minimum necessary data, is also the best cost optimization strategy. Every token you don't send to the API is both a privacy improvement and a cost saving. Code review that sends only changed files plus direct dependencies instead of the entire codebase achieves 80-95% token reduction while being more compliant and more accurate.
The improvement-without-leakage problem is solvable because the most valuable learnings aren't client-specific. They're operational: which prompting strategies work, which agent configurations produce better results, which tool designs reduce errors. You don't need Client A's data to improve Client B's experience. You need the anonymized pattern that emerged from serving both.
Key takeaway
Client data stays in its silo. Embeddings stay in their namespace. No cross-tenant query paths exist architecturally, not just by policy. What crosses boundaries are anonymized operational patterns, not data. And with EU AI Act enforcement beginning August 2026, the window to get this architecture right is closing.
FAQ
Can you delete someone's data from a vector embedding?
Not surgically. You can delete the source document and either re-embed the entire collection (thorough but expensive) or maintain a deletion log that filters out affected embeddings at query time (practical for daily operations). Most production systems use filtering day-to-day and periodic re-embedding on a schedule.
Does GDPR apply to AI-generated outputs?
Yes. If an AI system generates output that contains or is derived from personal data, that output is subject to GDPR. Purpose limitation means data collected for one task (code review) can't be reused for another (model training) without separate legal basis.
What is the EU AI Act and when does it take effect?
The EU AI Act classifies AI systems by risk level. Prohibited practices (social scoring, real-time biometric surveillance) are already enforceable since February 2025. High-risk AI system requirements (including systems used for employment decisions) take effect August 2026. Fines range from €15M/3% (high-risk violations) to €35M/7% (prohibited practices).
How do you handle data residency requirements?
Deploy compute and storage in the required jurisdiction. EU data stays in EU regions. South African data can stay in SA or EU (POPIA recognizes EU as having adequate protection). Cross-border transfers to LLM providers (like Anthropic's US-based API) require Standard Contractual Clauses and a Transfer Impact Assessment.
What's the difference between silo, bridge, and pool isolation?
Silo: separate database per tenant (highest isolation, highest cost). Bridge: separate schemas in shared database (medium). Pool: shared tables with row-level security (lowest cost, highest risk if RLS has a bug). For AI embeddings specifically, silo is recommended because a single misconfigured query in a pool model could expose all tenants' semantic data.
Does Anthropic train on API data?
No. Anthropic's API data policy specifies 7-day retention with no model training use and no sharing with third parties. This is a meaningful privacy advantage for enterprise deployments.
How much does compliance cost?
Budget roughly $80K-$150K for the first 18 months covering SOC 2 Type II, ISO 27001, DPAs with SCCs, DPIAs, and POPIA registration. Annual ongoing costs are $25K-$50K. A compliance automation platform (Vanta, Drata) reduces staff time by 60-70%.