Scaling Real-Time AI Decisioning: High-Throughput Architecture for 15K+ RPS

A deep dive into engineering a distributed, event-driven system capable of sub-100ms latency at scale using Kafka, Bloom filters, and in-memory caching.

8 minadvanced

Designing a system to handle 10,000+ Requests Per Second (RPS) with sub-100ms latency is a significant engineering challenge, particularly in the high-stakes environment of AdTech. In this domain, every millisecond of delay directly impacts revenue and user experience.

At 10,000+ RPS, the bottleneck shifts from network I/O to memory bandwidth, garbage collection (GC) pauses, and cache contention. Here is a more advanced take.

Engineering for Ultra-High Throughput: Lessons from a 10K+ RPS AdTech Engine

In high-frequency bidding environments, "fast" is a moving target. To achieve a sub-100ms P99 at a scale of 10,000+ RPS, standard Spring Boot or REST patterns collapse under the weight of thread context switching and GC overhead.

We architected a solution that prioritizes mechanical sympathy—aligning software design with the underlying hardware and network realities of GCP and Aerospike.

1. The Reactive Manifest: Backpressure and Event Loops

At 10k RPS, the "thread-per-request" model is a liability. Even with platform threads, the memory overhead and context-switching latency are prohibitive.

2. Aerospike: Beyond Simple Key-Value Lookups

While many use Aerospike as a simple cache, we treated it as a Real-Time Data Platform.

3. The "Silent Killer": Solving GC Pressure with Heap-Bypass

At 15K+ RPS, the primary bottleneck isn't just network I/O; it’s the Garbage Collection (GC) Tax. Creating short-lived objects for every request leads to frequent "Young Gen" pauses. Even a 20ms "Stop-the-World" event causes a request pile-up that breaches a 100ms SLA.

The Virtual Zero-Copy Strategy

In our architecture, we couldn't use standard hardware-level sendfile() because data was fetched from a remote Aerospike cluster. Instead, we implemented a Heap-Bypass approach:

The Niche Tip: Serialization Overhaul

For the data that did need to reach the application layer, we eliminated reflection-based overhead. We replaced standard Jackson with Jackson Afterburner and moved toward Protobuf for internal service communication, drastically reducing the CPU cycles spent on metadata inspection.

4. Kubernetes Topology-Aware Routing on GCP

Standard Kubernetes Service routing is often "random," which can introduce unnecessary cross-AZ (Availability Zone) latency.

5. Efficient State Injection via Batching

To keep the cache hydrated without locking, we built a Sidecar Batcher. Data from our upstream OLAP systems (BigQuery/Spark) was streamed into a Kafka topic and consumed by a dedicated "Writer" service.


graph TD %% Define User User([User Request]) %% Main Architecture Flow LB[GCP Global Load Balancer] subgraph GKE ["GKE Cluster: Heap-Bypass Application"] direction TB Netty[Netty Event Loop] DirectBuffer[(Direct ByteBuffer<br/>Off-Heap Memory)] AppLogic{App Decision Logic<br/>Pointer-Access} %% Internal flow Netty -.->|1. Virtual Zero-Copy DMA| DirectBuffer DirectBuffer ==>|2. Lazy Eval| AppLogic end subgraph AS ["Aerospike: Real-Time Data Layer"] direction TB ExpEngine[Aerospike Expression Engine] ASNodes[(AS Data Nodes<br/>Hybrid Memory)] %% Internal flow ExpEngine --- ASNodes end subgraph Pipeline ["Offline Hydration"] BQ[(BigQuery/OLAP)] BatchWriter[Batch Writer Service] BQ -.-> BatchWriter end %% Request Path User --> LB LB -->|Topology-Aware Routing| Netty %% Data Path (The 'Niche' Logic) AppLogic <==>|3. Remote Fetch| ASNodes AppLogic == "4. Push Logic (Expressions)" ==> ExpEngine ExpEngine == "5. True/False Result" ==> AppLogic %% Response Path AppLogic --> LB LB -->|Response < 100ms P99| User %% Hydration Path BatchWriter -->|Aerospike Batch Writes| ASNodes %% Styling style DirectBuffer fill:#fff3e0,stroke:#e65100,stroke-width:2px,stroke-dasharray: 5 5 style AppLogic fill:#003366,color:#fff style Netty fill:#003366,color:#fff style ExpEngine fill:#d32f2f,color:#fff style LB fill:#003366,color:#fff

Summary Table for Architects

ComponentStandard ApproachOur High-Scale ApproachWhy?
I/O ModelImperative/BlockingReactive (Non-blocking)Maximizes CPU utilization per core.
Data StoreRedis (RAM only)Aerospike (HMA)Persistence with RAM-like speed at 1/5th the cost.
NetworkCross-AZ RoutingTopology-Aware RoutingEliminates inter-zone latency overhead.
UpdatesRead-Modify-WriteAtomic Server-side OpsPrevents race conditions and saves RTTs.