InfoQ Homepage Architecture & Design Content on InfoQ
-
Understanding ML Model Poisoning: How It Happens and How to Detect It
In this article, the author explores data poisoning as a threat to machine learning systems, covering techniques such as label flipping, backdoors, clean-label poisoning, and gradient manipulation. The article reviews real-world incidents, discusses the challenges of detecting poisoned data, and presents practical defenses, tools, and operational practices for securing ML training pipelines.
-
Designing Continuous Authorization for Sensitive Cloud Systems
Most cloud systems make one authorization decision at login. Everything after runs on trust established at authentication time. For systems handling regulated data, that gap is where breaches happen. This article presents a continuous authorization architecture covering risk-tiered evaluation, behavioral baselines, privacy-preserving audit trails, and a phased and incremental rollout.
-
Governing AI in the Cloud: a Practical Guide for Architects
In this article, the author outlines a practical approach to AI governance in the cloud, covering discovery of shadow AI, data classification at creation, IAM-based enforcement, policy-as-code, and operational controls. The article shows how organizations can embed governance into delivery pipelines, balancing security, compliance, and developer productivity without relying on manual processes.
-
The Technology Adoption Curve, Twenty Years On
Today, June 8th, InfoQ celebrates 20 years. This is not a comprehensive history, but a deliberately selective look at the technologies and practices InfoQ identified early, where they sit on the adoption curve in 2026, and how that curve may evolve over the next five to ten years.
-
Architectural Change Cases: a Practical Tool for Evolutionary Architectures
Architectural change cases extend architecture decision record (ADR) thinking by evaluating how decisions may evolve over time. Change cases expose hidden assumptions and help teams estimate the reversibility and cost of change.
-
Two Misconfigurations That Caused Spark OOM Failures on Kubernetes
After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.
-
Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent
In fan-out microservice architectures, slow-but-completing requests accumulate across services and drive p99 latency far higher than per-service metrics suggest. This article presents an adaptive hedging mechanism that uses DDSketch for real-time quantile estimation, windowed rotation to handle distribution drift, and a token-bucket budget to prevent load amplification.
-
Architecting Cloud-Native Kafka: from Tiered Storage towards a Diskless Future
This article explores Kafka's transition toward a cloud-native architecture, examining how tiered storage, FinOps telemetry, elastic consumer scaling, virtual clusters, and Share Groups reshape the operational and economic model of event streaming platforms. It also analyzes emerging diskless-storage proposals and their architectural trade-offs.
-
The Schema Proliferation Problem in Kafka and Flink Pipelines: How to Solve It
Schema proliferation builds slowly and gets expensive fast. One schema per event type feels right until there are ten tables, union queries spanning all of them, and a single field rename touching every schema. Discriminator-based schema consolidation collapses that to two tables, turning multi-table unions into a single query, while new variants are additive and don't break existing consumers.
-
The Mathematics of Backlogs: Capacity Planning for Queue Recovery
Backlogs in distributed systems are arithmetic problems, not mysteries. This article provides practical formulas for calculating backlog drain time, sizing consumer headroom, and setting auto-scaling triggers. It covers key failure modes — retry amplification, metastable states, and cascading pipeline bottlenecks — plus when to shed load instead of draining.
-
Three Pillars of Platform Engineering: a Virtuous Cycle
Platform engineering succeeds when reliability and ergonomics reinforce each other rather than compete. This article explores three foundational pillars: automated reliability, developer ergonomics, and operator ergonomics. Together, they establish a virtuous cycle that strengthens system stability, reduces operational burden, and empowers teams to scale infrastructure with confidence.
-
Securing Autonomous AI Agents on Kubernetes: Trust Boundaries, Secrets, and Observability for a New Category of Cloud Workload
Autonomous AI agents break Kubernetes security assumptions with dynamic dependencies, multi-domain credentials, and unpredictable resource use. This article covers production-tested patterns: Job-based isolation, Vault for scoped short-lived credentials, a four-phase trust model from shadow mode to autonomous operation, and observability for non-deterministic reasoning cycles.