Privacy-Preserving LLMs in Practice: A Full-Stack Approach with …

Privacy-Preserving LLMs in Practice: A Full-Stack Approach with dstack and GPU TEE

July 25th, 2025

Intro: Why do we need privacy-preserving LLM?

AI services are now woven into daily life, and users routinely pour highly personal information into their prompts. Because state-of-the-art AI models need clusters of high-end GPUs, they usually run in centralized clouds rather than on local devices. This cloud-first architecture gives the provider complete visibility into every token a user submits - unless strong technical and contractual barriers are in place.

Traditional safeguards - on-premise deployment, data siloing, and rigorous regulatory compliance - can all be layered onto an LLM stack, but they address only part of the threat surface. To guarantee that an AI system cannot leak user data even if its operators act maliciously, cryptographic protections must be embedded in the infrastructure itself. When you can offer a mathematical proof that ‘the provider cannot see a user’s sensitive data’, individuals and enterprises alike can trust that their most valuable information remains private.

These cryptography-based protections deliver that level of assurance and provide several decisive advantages over conventional methods, as detailed below:

We’ve built Panda, a privacy-preserving LLM inference service that never compromises confidentiality of user data, using Trusted Execution Environment (TEE). This article outlines how it was achieved using GPU TEEs and verifiable infrastructures.

Technical Foundation: GPU TEE & DStack

Panda is a privacy-first LLM service that ensures that interactions remain entirely private and inaccessible to any third party, including us, the platform provider. This is achieved by combining end-to-end encryption with TEE, enabling AI models to run securely on the server side without ever exposing user data in plaintext.

CPU TEE

The foundation of TEE technologies used in Panda starts from CPU TEEs. These are designed to create secure enclaves inside the processor that protect both code execution and data storage, ensuring the privacy and integrity of any program running within. These enclaves provide strong isolation from the host operating system, hypervisor, and even the cloud provider itself. A well-known example is Intel SGX, which offers a protected memory region called the Enclave Page Cache (EPC). This encrypted and isolated space enables confidential computing by safeguarding both the data and control flow inside the enclave.

To ensure not just privacy but also trust in what’s running, CPU TEEs support remote attestation. This process allows the enclave to produce a signed proof of its own identity and integrity - verifying exactly which code is loaded and that the enclave hasn’t been tampered with. The attestation report includes a hash of the enclave’s code and configuration (called the measurement), which external parties can check before sending sensitive data inside.

However, SGX imposes several limitations; its enclave size is small, and the enclave cannot directly execute syscalls, forcing developers to refactor applications and manage complex proxying mechanisms.

To address these limitations, hardware vendors such as Intel and AMD have introduced VM-based TEEs such as Intel TDX and AMD SEV-SNP. Rather than securing a small isolated region, these technologies encrypt the entire guest virtual machine’s memory, conceal the CPU state using a separate privilege layer, and treat the host hypervisor as untrusted. This allows for a more scalable confidential computing model compared to enclave-based approaches.

GPU TEE

One of the most viable technologies for ensuring complete confidentiality of user-related data, even from the service provider at LLM services, is GPU TEE.

(NVIDIA H100 Confidential Computing initialization process | Source: NVIDIA Technical Blog)

GPU TEE builds on the security foundation of CPU TEE to extend data protection into the GPU, allowing sensitive workloads to remain confidential across the full compute stack. One of the most advanced and practical implementations of this today is NVIDIA Confidential Computing (CC), which currently stands as the only viable TEE technology for GPUs.

NVIDIA CC begins with a secure setup process based on a hardware Root-of-Trust (RoT), which is physically embedded on-die within the GPU itself. This RoT securely holds cryptographic secrets and the unique identity of the GPU, serving as the anchor for establishing trust. To create a secure connection between the CPU and the GPU, the system uses a protocol called SPDM (Security Protocol and Data Model). SPDM performs mutual authentication and sets up encrypted communication using a session key between the GPU and a Confidential Virtual Machine (CVM) running inside a CPU TEE, such as AMD SEV-SNP or Intel TDX.

Data transfer between the CPU and GPU is handled by Direct Memory Access (DMA) engines built into the GPU. When CC mode is activated, these DMA engines are restricted to only read from and write to a specific region of GPU-accessible memory called the “encrypted bounce buffer.” This memory region is explicitly designated and isolated by the GPU’s memory controller and is not accessible by any other host processes or applications. As a result, any data moving between the CPU and GPU stays within a secure, encrypted channel, ensuring that it cannot be observed or tampered with by the host system or external software.

By combining CPU-based encryption and attestation with the GPU’s hardware Root-of-Trust, NVIDIA’s GPU TEE creates a secure, end-to-end environment where data remains protected throughout its entire lifecycle, even from infrastructure operators or hardware providers themselves.

Dstack & private-ml-sdk

One of the ways to implement privacy-preserving AI solution with GPU TEEs is to use private-ml-sdk, built on top of dstack.

Dstack, built by Phala and Flashbot, is an intuitive SDK that streamlines the deployment of arbitrary containerized apps within TEEs. It offers a set of tools needed for seamless operation and transparency/verifiability of applications running on TEEs.

With dstack, users can launch multiple CVMs, each having a single docker compose file. These CVMs utilize a custom, reproducible OS image that is pre-integrated with dstack’s runtime components. Upon boot, the system automatically launches docker containers as specified in the docker compose file.

Verifiability of Dstack

(Architecture of dstack-os to ensure verifiability | Source: Phala Blog)

To support full verifiability of the overall architecture, including the application running inside the CVM, dstack generates attestation including the hash of the docker compose file. Intel TDX, which is the only TEE hardware that is currently supported by dstack, provides multiple cryptographic measurements in its remote attestation for verifiability of launch config and runtime integrity of a TDX guest VM. Dstack leverages the following measurements provided by the Intel TDX:

MRTD: Measurement of Trust Domain. It provides a static measurement of the guest VM build process and the initial contents of the guest VM, ensuring the integrity of hypervisor and serving as the base trust anchor.
RTMR: Run-time measurement register. It is an array of general-purpose measurement registers to enable measuring additional logic and data loaded into the guest VM at run-time.

There are 4 RTMRs from RTMR0 to RTMR3, where dstack stores the following information:

RTMR0 - VM hardware setup
RTMR1 - Linux kernel image measurement
RTMR2 - kernel cmdline and initrd measurements
RTMR3 - Dstack app details

The RTMR3 includes the hash of the docker compose file that the operator has injected into the CVM with other necessary information about the application, so verifying the attestation generated inside the dstack CVM can ensure that the hardware as well as the application running inside the CVM is not manipulated by any malicious actors.

Also, TDX allows us to add an arbitrary custom data in the attestation quote, which is called ReportData. Dstack provides an API to request an attestation generation with custom ReportData through a guest agent.

dstack-kms

In addition to its attestation capabilities, dstack also ensures the transparency and resilience of applications running within the TEEs by integrating with blockchain infrastructure. The central component responsible for this feature is dstack-kms, a key management service designed specifically for confidential computing use cases.

One of the attack vectors in privacy-sensitive applications is code substitution. A malicious service provider could temporarily replace the legitimate application with one that exfiltrates sensitive user data to an external server. Such an attack may leave no trace if it occurs between attestation checks. Detecting this kind of manipulation is extremely difficult unless clients frequently and consistently verify the attestation reports of the TDX-enabled environment. Dstack can be used to solve this problem, by enforcing application transparency through an onchain commitment scheme. It requires that every application running inside the TEE submit a hash of its container configuration to a smart contract known as AppAuth. Since this contract is publicly visible, any observer, including users, can audit which application hashes are registered and look for any discrepancies. If a hash is observed that doesn’t correspond to a known or trusted version, users can immediately suspect foul play, effectively creating a decentralized watchdog mechanism.

Beyond integrity, dstack also addresses the problem of key compromise and recovery. Dstack has a strict assumption that ‘even TEE can be compromised’. When the key used to encrypt the data inside the TEE is compromised, it is hard to recover the service since keys are tied to the specific hardware. Dstack solves this through dstack-kms, which manages cryptographic keys and the corresponding derivation function inside a separate CVM instance. It can be used to derive same keys over multiple TEE instances or rotate the keys.

To do these jobs, dstack-kms interacts with another smart contract called KmsAuth. This contract emits events whenever the application code changes or a new TEE instance is registered, notifying KMS nodes when a key needs to be provisioned or rotated. Upon such an event, the KMS node securely transmits the appropriate key to the target TEE instance. The communication channel between the KMS and the TEE is protected using RA-TLS, a remote attestation-based TLS protocol where each endpoint verifies that the other is a genuine TEE instance before proceeding. This guarantees end-to-end confidentiality even in the presence of network-level adversaries.

To further mitigate the risks associated with KMS compromise, dstack supports MPC-based operation of multiple KMS nodes. This removes the single point of failure inherent in centralized key management, ensuring that even if one KMS TEE is compromised, the attacker cannot reconstruct the full key without compromising a threshold of additional nodes.

private-ml-sdk

Through the verification method with remote attestation and integration with blockchain, dstack provides a foundation for running privacy-preserving applications in a secure and verifiable way. However, it does not provide the full environment needed for large-scale privacy-preserving LLMs, particularly those leveraging GPUs.

For applications that utilizes GPU operation, private-ml-sdk - an extension of dstack developed by Phala and Near - can be utilized. It extends dstack to support GPU TEE by adding the necessary components for GPU operations like NVIDIA drivers and CUDA libraries to the OS image.

Panda’s Technical Design

We’ve built Panda leveraging private-ml-sdk. The following demonstrations are what we’ve considered during the design.

Using GPU TEE & Dstack

We believe that GPU TEE and dstack form the strongest available foundation for a privacy-preserving AI inference service, among available solutions.

Other approaches to privacy-preserving LLM inference outside of TEEs have yet to achieve production-grade performance. Relevant cryptography-driven research includes:

FHE (Fully Homomorphic Encryption)
- Polynomial Transformers: Introducing the first polynomial transformer to convert models into polynomial form and run inference with transformer and homomorphic encryption.
- NEXUS: Non-interactive protocol for secure transformer inference.
MPC (Multi-Party Computation)
- CrypTen: Uses secure MPC, where multiple nodes compute on an encrypted data. Researched by Meta.
- Fission: Enhances CrypTen by separating linear and non-linear operations to different parties, where privacy is still preserved by the basic concept of MPC. Researched by Meta and Nillion.

Although these techniques operate directly on encrypted user inputs - and thus provide strong privacy guarantees - their latency remains prohibitive. FHE inference still requires minutes even for mid-sized BERT-class models. MPC-based Fission lowers latency to roughly 14 seconds on Llama 3.1, which is still far from mainstream AI chat apps that deliver responses in a few hundreds of milliseconds. By contrast, GPU TEE inference achieves ChatGPT-class responsiveness with only about a 10 percent performance overhead.

Dstack currently appears to be the most mature SDK for GPU TEEs.

Traditional TEE offerings like AWS Nitro have been used in some cases like op-enclave, but are not compatible with NVIDIA’s CC mode.
Some projects for general TEE use cases - like confidential-containers and cc-api - are in development and not ready for the production usage yet.

Dstack and private-ml-sdk also show powerful use cases at GPU TEEs, such as Redpill.

Extending Chain of Trust to the Client

Panda extends the chain of trust established by dstack all the way to the client, allowing users to verify in real-time that they are communicating with a genuine, attested TEE instance.

The process of client verification takes the following steps:

TLS Certificate Generation

When the inference server starts, it generates a TLS certificate for its registered domain (e.g., panda.chat) and a ECDSA private key.
Embedding Public Key in TDX Quote

The public key, derived from the ECDSA private key, is embedded in the TDX quote as ReportData. The inference server then registers both the quote and the public key with the external application server, outside of TEE.
Onchain Attestation Verification

The application server submits a transaction containing the quote to the AutomataDcapAttestationFee contract on OP Mainnet (developed by Automata). This smart contract performs onchain verification of the Intel TDX quote.
Storing Verification Result

If the attestation succeeds, the application server stores the resulting transaction hash.
Client Fetches Valid Hashes

The client fetches the list of valid quote verification transaction hashes from the AppAuth contract on OP Mainnet.
Challenge-Response Setup

The client sends an inference request to the TDX instance, including a Panda-Challenge header containing a random 32-byte hex string. This string will be signed by the inference server.
Server Signs the Challenge

The inference server constructs the following message to sign:
```
PROOF_PREFIX | SERVER_TIMESTAMP | RANDOM_SALT | CLIENT_CHALLENGE
```
- SERVER_TIMESTAMP and RANDOM_SALT help prevent signature replay attacks.
- The server signs this message using the private TLS key generated in Step 1.
- The signature, along with the SERVER_TIMESTAMP and RANDOM_SALT, is returned in the response headers.
Client Retrieves Attestation Hash

The client queries the application server to get the transaction hash associated with the server’s attestation.
Final Verification

Using the hash, the client verifies on OP Mainnet that:
- The attestation transaction exists and is valid.
- The app hash retrieved from AppAuth contract matches the app hash from the quote.
- The public key in the verified TDX quote matches the public key used by the server.

Through this mechanism, the client can ensure:

The TEE server is genuine and untampered

The Intel TDX quote is cryptographically verified onchain, proving that the server is running inside a trusted and unmodified TEE.
The server the client is connected to is the verified TEE server

By verifying the TLS public key against the attested quote, the client ensures that it is communicating with the legitimate TEE server, with an assumption that attestations for all TLS certs of the domain are provided and can be verified by the user.

Enhancing UX without sacrificing Privacy

To provide useful AI inference service, there are a few additional features to be supported as follows:

User chat history
Vector database for advanced RAG features (e.g. web search, PDF analysis)

Panda follows a strict security posture on these features.

Encryption by user-side secret

One requirement for AI inference services like Panda is management of each user’s chat history. Since these transcripts include sensitive information, they should also be end-to-end encrypted at the application level.

The client-side encryption workflow is as follows:

A user creates a master password which is used to encrypt and decrypt every chat history and associated user data.
This password is used to derive an encryption key.
Each chat message is encrypted with the encryption key, then upload the encrypted ciphertext to the remote storage.
The encryption key never leaves the client; Panda’s servers only see opaque ciphertext.

Because Panda cannot decrypt these blobs, even privileged operators or a compromised backend gain no insight into users’ private data. Lost keys simply render the data unreadable, mirroring the security model of modern end-to-end encrypted messengers.

To prevent XSS-driven key theft or any other client-side mischief at runtime, we off-load all cryptographic work - key unwrapping, encryption, decryption - into a sandboxed iframe**.** Because this sandboxed iframe is hosted on a opaque origin, it allows zero direct reads, zero ambient storage, and every cookie is hidden from JavaScript. Every encryption / decryption is done via explicit message passing between main app and the sandbox, where each request is protected by strict CSP.

Also, to prevent the situation that user has to re-enter the password every time client memory is cleared (e.g. due to page refresh or tab closure), Panda encrypts the user password with server-provided, periodically rotating key, and stores the encrypted password in the browser's local storage. Server provides the old_key and new_key when the rotation happens, so that the client can decrypt the saved password using the old_key and re-encrypt it with the new_key. These key rotation are also done inside the sandboxed iframe, significantly increasing the security of these operations while also improving the UX by avoiding frequent password prompts. The use of a server-side rotating key acts as a safeguard, ensuring that even if the saved encrypted password is retrieved by client-side attacks, long-term access to the user’s chat history remains protected.

Strict TTL at the vector DB

Vector DB is a purpose-built system designed to store and search machine-generated embeddings. Unlike traditional databases that organize data in rows and columns, a vector DB represents each document chunk as a high-dimensional numeric vector and indexes them using an approximate nearest neighbor (ANN) algorithm. These systems are optimized to efficiently compare vectors using predefined similarity metrics.

This can be used to enhance the accuracy of LLM inference service. When prior chats or uploaded documents are embedded and retrieved at inference time, LLMs gain rich, user-specific context, dramatically improving answer quality.

Although the embeddings themselves are numerically transformed, they can still leak private information if an attacker extracts and decodes them. If the root key of the TEE were ever to be compromised, an adversary could scrape the entire vector store.

To mitigate this risk, Panda enforces a strict time-to-live (TTL) policy at its vector DB, of 1 day. This minimizes the attack surface by ensuring that even in the event of a breach over the TEE instance, only data added within the past 24 hours can be leaked.

Conclusion

TEEs provide hardware-enforced isolation that allows models to run without exposing plaintext data to infrastructure operators, system software, or external observers. System architectures that extend these guarantees to GPUs have now reached a level of maturity where they can support production-grade confidential LLM inference in real-world deployments.

Building such systems, however, is non-trivial. It requires deep familiarity with both the capabilities and limitations of the underlying hardware - ranging from attestation protocols and root-of-trust initialization to the secure orchestration of containerized environments. It also demands practical solutions for challenges like runtime verifiability, rollback resistance, and integration with transparency and auditability mechanisms.

Panda, built on top of dstack and modern GPU TEE infrastructure, demonstrates that secure, production-grade confidential LLM inference is not only possible, but operationally viable. It extends the chain of trust from the Intel TDX hardware enclave all the way to the client, ensuring that every layer of the system is verifiably secure. By leveraging custom ReportData fields and onchain attestation verification, Panda delivers end-to-end confidentiality and integrity across the full inference lifecycle.

Subscribe to Test In Prod

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

t4QNMVozwNh2yEj…52JQOp0Pc7mdd2U

Author Address

0x0583EeA7C089A7f…CF1072FA016b50A

Content Digest

78nOK-e3DTBTXZG…PUiecthFa5oEa2Q