Encrypted Vector Databases Without Sacrificing Search

AI

Mar 9, 2026

vector-databases-image

Modern AI products don't search for words anymore, they search for meaning. Rather than indexing documents by keywords, they embed text, images, code, and user activity into vectors in Rd\mathbb{R}^d, then retrieve the nearest neighbors to a query vector. That single primitive, "find the top-kk closest vectors", quietly powers semantic search, recommendation systems, and retrieval-augmented generation (RAG) pipelines that feed relevant context into large language models.

To be useful, a vector database must compute similarity on the server side. And embeddings are not harmless "compressed representations." In many applications they encode deeply sensitive signals, including medical notes, internal documents, user messages, and proprietary code. The geometry of the embedding space, meaning who is close to whom and how clusters form, can itself be revealing. Even if raw text is never stored, allowing an untrusted service to observe embeddings and similarity structure can enable membership inference, semantic leakage, and, in some settings, partial reconstruction of the original inputs.

So the core question is not simply "can we encrypt embeddings?" It is: can we enable nearest-neighbor search while leaking as little as possible about the vectors and their geometry, and can we do it without destroying the efficiency properties that made vector databases attractive in the first place?

To understand why this question is hard, it helps to first understand how vector databases work, why embeddings are sensitive, and what "distance-preserving encryption" even means.

What a Vector Database Is

The AI wave of recent years has forced a rethink of something that once seemed solved, data retrieval. Large language models and generative AI systems need to fetch the right information quickly, often from enormous corpora, often in real time. The bottleneck is no longer just storing data, but finding relevant context efficiently so models can reason, answer, and act.

The key enabler is the vector embedding, a representation of data as a point in Rd\mathbb{R}^d that encodes semantic information. These representations are produced by AI models, including LLMs, and their many dimensions capture patterns, relationships, and latent structure, precisely the kind of "meaning" that powers similarity search. But that same richness makes embeddings awkward for traditional databases. They are high-dimensional, they do not fit cleanly into scalar indexing, and "querying" them means nearest-neighbor search rather than exact lookup.

Vector databases are purpose-built for this workload. They provide optimized similarity search over high-dimensional vectors and the operational features one expects from a mature database system, including metadata filters, updates, durability, and horizontal scaling. Standalone vector indexes can do fast nearest-neighbor retrieval, but they typically lack database ergonomics. Traditional scalar databases have mature infrastructure, but they were not designed for high-dimensional similarity workloads. Vector databases bridge that gap.

How Similarity Search Works in Practice

The formal task is KK-Nearest Neighbor Search (K-NNS). Given a distance function and a query, the goal is to find the KK elements in the dataset that minimize the distance to that query. The naïve approach computes distances from the query to every stored element, which scales linearly with dataset size and quickly becomes infeasible. Exact K-NNS methods offer only limited speedups in high dimensions due to the "curse of dimensionality," which motivates Approximate Nearest Neighbor Search (K-ANNS), algorithms that allow small, bounded errors in exchange for dramatically faster retrieval.

Among the most widely deployed approaches is Hierarchical Navigable Small World (HNSW), a fully graph-based, incremental K-ANNS structure. HNSW organizes embeddings into a multi-layer hierarchy of proximity graphs over nested subsets of elements. Each element is assigned a maximum layer drawn from an exponentially decaying probability distribution, and search proceeds by greedy routing from the top layer downward. Starting from an entry point, the algorithm repeatedly moves to the neighbor that minimizes distance to the query, tracking the best candidates discovered along the way. This hierarchical scale separation gives HNSW logarithmic complexity scaling and strong performance across a wide range of workloads.

The Security Problem with Embeddings

Embeddings are often assumed to be safe because they map raw inputs to abstract, floating-point vectors. That assumption is wrong. These vectors retain, and can leak, sensitive information about the original data through at least three distinct classes of attack.

Embedding Inversion. The fact that embedding models encode information about exact words, not just abstract semantics, means adversaries can attempt to recover the original input. In white-box scenarios, where the adversary has access to the model's architecture and parameters, this is done by solving an optimization problem: find the sequence of words whose embedding minimizes distance to the target vector. In black-box scenarios, where the adversary has only query access, they can train a multi-label classification or multi-set prediction model on an auxiliary dataset, using embeddings as input and original word sets as targets. Both approaches can reconstruct a significant fraction of the original text.

Sensitive Attribute Inference. Embeddings can unintentionally expose attributes of the input that are entirely independent of the downstream task, such as the specific author of a text fragment or the demographic characteristics of a user. A simple inference classifier trained on a small number of labeled embedding vectors can often reveal these attributes. This leakage is exacerbated by unsupervised training frameworks such as contrastive learning, which maximize similarity for data appearing in the same context and thereby group data sharing latent sensitive classes, like the same author, closer together in embedding space.

Membership Inference. Embedding models also leak information about whether specific data was used to train them. Because models are trained to maximize the semantic similarity of data appearing in the same context, adversaries can use similarity scores as a signal. For word embeddings, measuring similarity scores across a sliding window of words can reveal whether that specific context appeared in the training corpus. For sentence embeddings, evaluating pairwise similarity scores, or aggregating similarity across a collection of sentences from a single user, can accurately determine whether that data was part of the training set. Rare or infrequent training inputs are particularly susceptible to this kind of memorization.

Why Exact Distance-Preserving Encryption Fails

In the classical distance-preserving perturbation setting the data owner releases a transformed dataset

Y=MX,Y = M^\top X,

where XRn×mX \in \mathbb{R}^{n\times m} is the private data matrix, where columns are records, and MO(n)M \in O(n) is a secret orthogonal map, so all Euclidean distances are preserved exactly. This is the dream scenario for utility, anything that depends only on distances such as kkNN or clustering behaves identically on YY as on XX.

But from an attacker's point of view, exact distance preservation collapses privacy because it preserves too much rigid structure. The paper makes this precise by giving two realistic prior-knowledge models under which an attacker can recover the original records to small error, and sometimes perfectly.

Attack 1: Known Input–Output

Attacker capability. The attacker knows:

  • the full released perturbed database YY
  • a subset of kk original records XkX_k together with their corresponding perturbed versions YkY_k, meaning she knows which column maps to which.

Think of situations where a few users disclose their own vectors, a few public records exist in both spaces, or some rows leak internally. This is exactly the kind of auxiliary information that shows up in practice.

Perfect recovery when k=nk = n

If the attacker knows nn linearly independent original records, meaning XkX_k spans Rn\mathbb{R}^n, then the orthogonal map is fully determined and every other record can be recovered exactly. Concretely, the attacker can solve for the transform and invert it on all remaining columns. In other words, nn linearly independent input–output pairs are enough to decrypt the entire database.

The interesting case: k<nk < n still leaks a lot

When k<nk<n, there are many orthogonal matrices consistent with the known pairs:

M(Xk,Yk)={MO(n) : MXk=Yk}.\mathcal{M}(X_k, Y_k) = \{\, M \in O(n) \ :\ MX_k = Y_k \,\}.

The attacker does not know which one is the true MM^\top, but she can still recover some unknown records with high probability.

A natural attack strategy in the paper is:

  1. Sample M^\hat M uniformly from M(Xk,Yk)\mathcal{M}(X_k, Y_k).
  2. For some unknown perturbed column y^i\hat y_i from YmkY_{m-k}, output the estimate
x^=M^y^i.\hat x = \hat M^\top \hat y_i.
  1. Choose which target record to attack by maximizing the probability of an ε\varepsilon-accurate reconstruction.

The paper defines an ε\varepsilon-breach event as

x^xiεxi\|\hat x - x_i\| \le \varepsilon \|x_i\|

and derives a closed-form breach probability ρ(xi,ε)\rho(x_i,\varepsilon) that depends on a single geometric quantity, the distance of xix_i from the subspace spanned by the known plaintext records.

Let Col(Xk)\mathrm{Col}(X_k) be the kk-dimensional subspace spanned by known records, and define

d(x,Xk):=dist(x,Col(Xk)).d(x, X_k) := \mathrm{dist}(x,\mathrm{Col}(X_k)).

Then the breach probability behaves like

ρ(x,ε)2πarcsin(εx2d(x,Xk))\rho(x,\varepsilon) \approx \frac{2}{\pi}\arcsin\left(\frac{\varepsilon\|x\|}{2d(x,X_k)}\right)

when εx<2d(x,Xk)\varepsilon\|x\| < 2d(x,X_k), and it becomes 11 once εx2d(x,Xk)\varepsilon\|x\| \ge 2d(x,X_k).

Interpretation. As the victim's record gets closer to the known records, the probability of a breach skyrockets.

Attack 2: Known Sample

The second attack does not assume the attacker knows any specific plaintext record. Instead, it assumes she has samples from the same distribution as the private dataset.

Attacker capability. The attacker knows:

  • Y=MXY = M^\top X
  • another dataset SS consisting of pp independent samples drawn from the same underlying random vector VV that generated columns of XX.

Think of situations where the attacker scrapes a similar population, has historical logs, buys third-party data, or simply uses the fact that embeddings of typical text from your domain have similar statistics.

Why PCA works against exact distance preservation

Let VV be a random vector of plaintext records with covariance matrix ΣV:=Cov(V)\Sigma_V := \operatorname{Cov}(V). Assume the released (perturbed) vectors are obtained by an orthogonal distance-preserving map

Y=MVwith MO(n) and MM=I.Y = M^\top V \qquad\text{with } M \in O(n)\text{ and } M^\top M = I.

Then the transformed covariance satisfies

Cov(MV)=E[(MVE[MV])(MVE[MV])]=E[(M(VE[V]))(M(VE[V]))]=ME[(VE[V])(VE[V])]M=MCov(V)M.\begin{aligned} \operatorname{Cov}(M^\top V) &= \mathbb{E}\left[\left(M^\top V-\mathbb{E}[M^\top V]\right)\left(M^\top V-\mathbb{E}[M^\top V]\right)^\top\right] \\ &= \mathbb{E}\left[\left(M^\top (V-\mathbb{E}[V])\right)\left(M^\top (V-\mathbb{E}[V])\right)^\top\right] \\ &= M^\top \mathbb{E}\left[(V-\mathbb{E}[V])(V-\mathbb{E}[V])^\top\right] M \\ &= M^\top \operatorname{Cov}(V)\, M. \end{aligned}

Because MM is orthogonal, ΣMV\Sigma_{M^\top V} is orthogonally similar to ΣV\Sigma_V. By the spectral theorem, we can write

ΣV=ZΛZΣMV=(MZ)Λ(MZ),\Sigma_V = Z\Lambda Z^\top \quad\Longrightarrow\quad \Sigma_{M^\top V} = (M^\top Z)\Lambda(M^\top Z)^\top,

so ΣV\Sigma_V and ΣMV\Sigma_{M^\top V} share the same eigenvalues and their eigenvectors are related by multiplication with MM^\top. Under the common assumption that ΣV\Sigma_V has distinct eigenvalues, each eigenspace is one-dimensional, so eigenvectors are uniquely determined up to sign. PCA can therefore recover the principal directions in both plaintext and ciphertext space, and the hidden orthogonal transform is determined up to a diagonal sign-flip matrix.

Concretely, the attacker proceeds as follows:

  1. From auxiliary plaintext-like samples SS, estimate ΣV\Sigma_V and run PCA to obtain its eigenvector matrix ZZ.
  2. From the released perturbed dataset YY, estimate ΣMV\Sigma_{M^\top V} and run PCA to obtain its eigenvector matrix WW.
  3. By the covariance conjugation identity, the two eigenbases are related by the hidden orthogonal map up to an unknown diagonal sign matrix D0{±1}n×nD_0 \in \{\pm1\}^{n\times n}, yielding M=WD0ZM^\top = W D_0 Z^\top.

Breaking the scheme then reduces to identifying the correct sign pattern and inverting:

X^=MY=(ZD0W)Y.\hat X = MY = (Z D_0 W^\top)Y.

The paper resolves the sign ambiguity by searching over candidate DD and scoring how well the mapped auxiliary samples align with the observed distribution of YY, selecting the DD that maximizes the match. As the number of samples grows, PCA estimates concentrate and the reconstruction becomes highly accurate.

The Takeaway for Embeddings

Exact distance-preserving "encryption" is not encryption in any cryptographic sense. It is closer to a rigid geometric disguise. If even a modest number of plaintext embeddings and their ciphertext counterparts leak, an attacker can reconstruct additional embeddings with high probability and can target the easiest victims using geometric proximity. Even without direct plaintext leakage, auxiliary data from the same distribution enables PCA-based attacks that effectively recover the hidden orthogonal transform. Exact distance preservation is fundamentally incompatible with strong privacy guarantees because it preserves precisely the global structure an attacker needs to reverse-engineer the space.

Approximate Distance-Comparison-Preserving Encryption

The solution to these problems is to relax exactness. Rather than guaranteeing that all relative distance comparisons are preserved, Approximate Distance-Comparison-Preserving Encryption (β\beta-DCPE) guarantees only that comparisons are preserved when the difference in distances exceeds a specified approximation factor β\beta. Formally, if three plaintext vectors xx, yy, and zz satisfy xy<xzβ\|x - y\| < \|x - z\| - \beta, then after encryption the ciphertext of xx remains closer to the ciphertext of yy than to that of zz. When distances are too similar to distinguish given the noise budget, no guarantee is made.

This relaxation is not just a convenient concession, it is a deliberate and well-motivated design choice. Even on plaintext data, approximate nearest neighbor search is the norm in high-dimensional settings, since the concept of exact distances becomes less meaningful as dimensionality grows. β\beta-DCPE simply brings the same approximation tolerance into the encrypted setting, and the paper shows formally that any nearest-neighbor algorithm run over β\beta-DCP encrypted vectors returns a point whose plaintext distance from the query is at most β\beta larger than the distance to the true nearest neighbor. In other words, the approximation factor of the overlying search algorithm is preserved.

This relaxation also unlocks something that exact schemes cannot have: security parameters that are independent of the underlying data. Exact distance-comparison-preserving encryption requires parameters tailored to specific datapoints in the message space, making the system rigid. Because β\beta is a fixed constant bounding the permissible perturbation, β\beta-DCPE has no such dependency. The bounds on perturbation are set once and apply uniformly across the entire dataset, regardless of what the data looks like.

The Scale-and-Perturb (SAP) Scheme

The core algorithm that achieves β\beta-DCPE in practice is the Scale-and-Perturb (SAP) scheme. Its intuition follows directly from the structure of distance-comparison-preserving functions. The paper establishes that any DCP function must be approximately distance-preserving, and that isometries of Euclidean space (rotations, reflections, translations, and scaling) are the natural candidates for building such functions. Scaling is the key operation here: multiplying all vectors by a constant factor scales all pairwise distances by the same amount, leaving relative distance comparisons perfectly unchanged. Adding a bounded noise vector to the result then obscures the exact transformation without violating the β\beta-DCP guarantee, provided the noise is kept within a strict radius.

Parameters and Primitives

Before encryption, the scheme defines:

  • Message space M\mathcal{M}: a discrete dd-dimensional space [M,M]d[-M, M]^d
  • Approximation factor β\beta with β<2Md\beta < 2M\sqrt{d}
  • Keyspace S\mathcal{S} from which the scaling factor is drawn
  • Pseudorandom function (PRF) PRF:{0,1}k×{0,1}l{0,1}\text{PRF}: \{0,1\}^k \times \{0,1\}^l \rightarrow \{0,1\}^*, used to derandomize the perturbation so the same noise can be reconstructed at decryption time

The master key is a tuple (s,K)(s, K) where ss is sampled uniformly from S\mathcal{S} and KK is a PRF key sampled from {0,1}k\{0,1\}^k.

Encryption

To encrypt plaintext vector mm:

  1. Sample random nonce n{0,1}ln \in \{0,1\}^l.
  2. Compute coins1coins2=PRF(K,n).\texttt{coins}_1 | \texttt{coins}_2 = \text{PRF}(K, n).
  3. Sample noise vector λm\lambda_m uniformly from a ball of radius sβ4\frac{s\beta}{4}. Procedure:
    • sample uN(0,Id)u \sim \mathcal{N}(0,I_d)
    • sample xU(0,1)x' \sim U(0,1)
    • compute x=sβ4(x)1/dx = \frac{s\beta}{4}(x')^{1/d}
    • set λm=uxu\lambda_m = u \frac{x}{|u|} This ensures uniform sampling across the sphere volume.
  4. Compute ciphertext c=sm+λmc = sm + \lambda_m
  5. Output (c,n)(c,n).

Decryption

Given (c,n)(c,n) and key (s,K)(s,K):

  1. Recompute the same pseudorandom coins with the PRF.
  2. Reconstruct λm\lambda_m.
  3. Recover m=cλms.m = \frac{c-\lambda_m}{s}.

The Mathematical Guarantee

Because the perturbation radius is bounded by sβ4\frac{s\beta}{4}, the triangle inequality ensures that if

xy<yzβ\|x-y\| < \|y-z\| - \beta

in plaintext space, the same comparison holds after encryption.

Strengthening SAP with Preprocessing

SAP, as described so far, encrypts individual vectors on-the-fly given only the secret key. This is already sufficient for the core β\beta-DCP guarantee, but it leaves two residual attack surfaces open. The first is identity leakage: even without knowing plaintext values, an adversary who observes a sequence of ciphertexts can potentially match them to known records based on position, timing, or other contextual signals. The second is frequency leakage: because SAP is a deterministic function of the plaintext and key (up to the nonce), an adversary who can observe many ciphertexts may be able to infer which plaintext values appear frequently by building a histogram of ciphertext statistics, the kind of attack that has broken order-preserving and order-revealing encryption schemes in practice.

The paper addresses both of these with two preprocessing techniques applied to the dataset before encryption. Crucially, the two techniques target distinct threats and are designed to compose cleanly.

Shuffling

Before encrypting, the entire dataset is randomly permuted according to a uniformly sampled permutation π\pi:

(m1,,mn)(mπ(1),,mπ(n)).(m_1, \dots, m_n) \mapsto (m_{\pi(1)}, \dots, m_{\pi(n)}).

The shuffled dataset is then encrypted and returned in that permuted order. Because the permutation is secret, the adversary receives an unordered multiset of ciphertexts with no reliable correspondence between positions and identities. Even if the adversary knows some plaintext values in advance, they cannot determine which ciphertext encodes which record.

This is the step that enables the paper's main indistinguishability result. The security notion used is called Real-or-Replaced (RoR): an adversary submits a dataset, and the challenger either encrypts and shuffles it as-is, or first replaces one randomly chosen point with a nearby point sampled uniformly within distance δ\delta, then shuffles and encrypts. The adversary's inability to tell these two worlds apart corresponds directly to resistance against membership inference, meaning it cannot determine whether a particular individual's data is present in the database. The paper proves concrete bounds on the RoR advantage as a function of β\beta, δ\delta, and the dataset size, and the shuffling step is essential to achieving those bounds, without it, ciphertexts remain implicitly labeled by position and the security argument breaks down.

Normalization

The second preprocessing step converts the plaintext distribution to approximate multivariate normality, for example using the BoxCox transform. This is not a correctness requirement for SAP, the encryption algorithm works correctly on any input, but it is a requirement for the paper's positive security results against frequency-finding attacks.

The concern with frequency leakage is well-established in the property-preserving encryption literature. An adversary who can estimate the empirical distribution of ciphertexts can construct an approximate histogram of plaintext values, and from there partially or completely reconstruct the data. The paper's defense against this is a security notion called Attribute Window One-Wayness (AWOW), which guarantees that it is computationally hard to guess an approximate frequency for any plaintext attribute. The proof of AWOW for SAP, however, assumes that the plaintext vectors are sampled i.i.d. from a multivariate normal distribution. Normalization is the step that makes this assumption true in practice.

If the embeddings being encrypted already follow a distribution close to multivariate normal, which is not uncommon for high-dimensional embedding models trained with certain objectives, this step can be skipped. Otherwise, it serves as the bridge between the messy statistical reality of real embedding data and the distributional assumptions needed for the formal security proofs to apply.

Practical Demo: Encrypting Embeddings for Pinecone with IronCore Alloy

In this section we’ll walk through an end-to-end integration using Pinecone as the vector store and IronCore Alloy as the client-side encryption layer. The workflow mirrors what you would deploy in production: embeddings are generated (or loaded) on the client, encrypted locally, and only then uploaded to Pinecone. Queries follow the same rule: encrypt the query embedding locally, run the ANN search over ciphertext vectors, and decrypt the returned metadata back on the client.

IronCore Alloy gives you two primitives:

  1. Vector encryption (approximate search still works)
  2. Standard encryption (AES-style encryption for metadata)
import os
import time
import base64
import asyncio

import pinecone
from pinecone_datasets import load_dataset

import torch
from sentence_transformers import SentenceTransformer

import ironcore_alloy as alloy


## -----------------------------

## 1) Alloy SDK setup (Standalone)

## -----------------------------

def make_alloy_sdk(*, master_key_b64: str, approximation_factor: float) -> tuple[alloy.Standalone, alloy.AlloyMetadata]:
    """
    Standalone mode: you manage the master secret locally.
    - approximation_factor controls the accuracy/privacy tradeoff for vector search.
    """
    key_bytes = base64.b64decode(master_key_b64)
    if len(key_bytes) != 32:
        raise ValueError("ALLOY_MASTER_KEY_B64 must decode to exactly 32 bytes")

    # Vector secrets are scoped by secret_path ("namespace") and derivation_path ("vector type").
    vector_secrets = {
        "quora": alloy.VectorSecret(
            approximation_factor,
            alloy.RotatableSecret(
                alloy.StandaloneSecret(1, alloy.Secret(key_bytes)),
                None,
            ),
        )
    }

    # Standard (non-vector) encryption secrets for metadata / documents.
    standard_secrets = alloy.StandardSecrets(
        1,
        [alloy.StandaloneSecret(1, alloy.Secret(key_bytes))],
    )

    deterministic_secrets = {}  # fill only if you need equality-searchable encrypted filters

    config = alloy.StandaloneConfiguration(standard_secrets, deterministic_secrets, vector_secrets)
    sdk = alloy.Standalone(config)

    # Tenant metadata: used for key scoping in multi-tenant apps.
    # If you have tenants/users/orgs, put an ID here (e.g., "org_123").
    tenant = alloy.AlloyMetadata.new_simple("")
    return sdk, tenant


## -----------------------------

## 2) Helper: encrypt one record

## -----------------------------

async def encrypt_record(
    sdk: alloy.Standalone,
    tenant: alloy.AlloyMetadata,
    *,
    vec: list[float],
    text: str,
    secret_path: str = "quora",
    derivation_path: str = "sentence",
):
    # Encrypt the embedding for similarity search.
    plaintext_vec = alloy.PlaintextVector(
        plaintext_vector=vec,
        secret_path=secret_path,
        derivation_path=derivation_path,
    )
    enc_vec = await sdk.vector().encrypt(plaintext_vec, tenant)

    # Encrypt the metadata (here, the original text). This is normal encryption (not searchable).
    enc_doc = await sdk.standard().encrypt({"text": text.encode("utf-8")}, tenant)

    # Pinecone metadata must be JSON-serializable → store bytes as base64 strings.
    meta = {
        "text": base64.b64encode(enc_doc.document["text"]).decode("ascii"),
        "edek": base64.b64encode(enc_doc.edek).decode("ascii"),
    }
    return enc_vec.encrypted_vector, meta


## -----------------------------

## 3) Main: encrypt → upsert → encrypted query → decrypt metadata

## -----------------------------

async def main():
    # ----- Dataset -----
    dataset = load_dataset("quora_all-MiniLM-L6-bm25")
    docs = dataset.documents.copy()

    # the dataset stores original text in `blob`; rename to `metadata` to match your script style
    docs.drop(["metadata"], axis=1, inplace=True)
    docs.rename(columns={"blob": "metadata"}, inplace=True)

    # use a slice (80k rows)
    docs = docs.iloc[240_000:320_000].reset_index(drop=True)

    dim = len(docs.iloc[0]["values"])

    # ----- Pinecone -----
    pinecone.init(
        api_key=os.environ["PINECONE_API_KEY"],
        environment=os.getenv("PINECONE_ENV", "gcp-starter"),
    )

    index_name = "semantic-search-fast-encrypted"

    if index_name not in pinecone.list_indexes():
        pinecone.create_index(name=index_name, dimension=dim, metric="cosine")
        time.sleep(2)

    index = pinecone.GRPCIndex(index_name)

    # ----- Alloy -----
    # IMPORTANT: set this env var to a secure random 32-byte key (base64-encoded)
    # e.g. python -c "import os,base64; print(base64.b64encode(os.urandom(32)).decode())"
    sdk, tenant = make_alloy_sdk(
        master_key_b64=os.environ["ALLOY_MASTER_KEY_B64"],
        approximation_factor=2.0,
    )

    # ----- Embedding model (only used for queries here; dataset already has vectors) -----
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

    # ----- Encrypt + upsert in batches -----
    print("Encrypting and uploading encrypted vectors + encrypted metadata...")

    batch_size = 200
    to_upsert = []

    for i, row in enumerate(docs.itertuples(index=False), start=1):
        vec = row.values
        text = row.metadata["text"]

        enc_vec, meta = await encrypt_record(sdk, tenant, vec=vec, text=text)
        # Use dataset's existing id field (commonly `id`). If your dataset uses a different field, change here.
        doc_id = str(row.id)

        to_upsert.append((doc_id, enc_vec, meta))

        if len(to_upsert) >= batch_size:
            index.upsert(vectors=to_upsert)
            to_upsert.clear()

        if i % 5000 == 0:
            print(f"  processed {i}/{len(docs)} records")

    if to_upsert:
        index.upsert(vectors=to_upsert)

    # ----- Encrypted query + decrypt results -----
    async def run_query(q: str, top_k: int = 5):
        print(f"\nQuery: {q}")

        q_vec = model.encode(q).tolist()

        # IMPORTANT: you do NOT call `encrypt` for queries.
        # You call `generate_query_vectors`, which may output 1+ encrypted query vectors depending on scheme.
        plaintext_q = alloy.PlaintextVector(
            plaintext_vector=q_vec,
            secret_path="quora",
            derivation_path="sentence",
        )
        qvs = (await sdk.vector().generate_query_vectors({"q": plaintext_q}, tenant))["q"]
        enc_q = qvs[0].encrypted_vector

        res = index.query(vector=enc_q, top_k=top_k, include_metadata=True)

        for match in res["matches"]:
            # Rebuild EncryptedDocument from base64 fields
            enc_doc = alloy.EncryptedDocument(
                edek=base64.b64decode(match["metadata"]["edek"]),
                document={"text": base64.b64decode(match["metadata"]["text"])},
            )
            dec = await sdk.standard().decrypt(enc_doc, tenant)
            print(f"{match['score']:.3f}: {dec['text'].decode('utf-8')}")

    await run_query("which city has the highest population in the world?")
    await run_query("which metropolis has the highest number of people?")

    # Optional cleanup:
    # pinecone.delete_index(index_name)


if __name__ == "__main__":
    asyncio.run(main())

The output looks something like

Encrypting embeddings and their associated text.

Query: which city has the highest population in the world?
0.65:  What's the world's largest city?
0.6:  What is the biggest city?
0.56:  What are the world's most advanced cities?
0.54:  Where is the most beautiful city in the world?
0.54:  What is the greatest, most beautiful city in the world?

Query: which metropolis has the highest number of people?
0.51:  What is the biggest city?
0.5:  What is the most dangerous city in USA?
0.49:  How many people to in the United States?
0.49:  What's the world's largest city?
0.48:  What are some of the most dangerous cities in America?

Conclusion

Vector databases have become core infrastructure for modern AI systems, but they introduce new privacy risks that are easy to underestimate. Embeddings are not harmless abstractions. They can be inverted to recover original text, mined for sensitive attributes, and used to infer training membership. And exact distance-preserving encryption, the most intuitive response to this problem, fails because it preserves the global geometric structure that makes these attacks possible in the first place.

Approximate Distance-Comparison-Preserving Encryption provides a principled alternative. By tolerating a bounded approximation error β\beta and adding uniform spherical noise to scaled plaintexts, the Scale-and-Perturb scheme breaks the rigid structure that enables reconstruction while preserving enough relational information to support approximate nearest-neighbor search. Paired with shuffling and normalization as preprocessing steps, it achieves provable resistance against both membership inference and frequency-finding attacks, and does so with security parameters that are dataset-independent and bit-security that compares favorably to order-preserving encryption. The result is a scheme that enables privacy-preserving retrieval for AI systems without sacrificing the practical performance that makes vector databases worth using in the first place.