InfoQ Homepage Infrastructure Content on InfoQ
-
AWS Replaces Fat-Tree Data Center Networks with Random Graph Theory, Cutting Routers by 69%
AWS disclosed that Resilient Network Graphs, a flat network architecture based on quasi-random graph theory, is now the default for most new data center builds. The design replaces fat-tree hierarchies with direct ToR-to-ToR mesh connections using passive optical ShuffleBoxes, cutting routers by 69%, boosting throughput by 33%, and reducing network power consumption by 40%.
-
Inside Google’s System for Coordinated A/B Testing across its Global Service Fleet
Google has shared details of its fleet wide large scale A/B experimentation system designed to standardize experiment assignment, exposure logging, and configuration propagation across distributed services. The approach enables consistent measurement across products, reduces experiment conflicts, and improves reliability of data driven decision making at scale.
-
AI-Assisted Migration Tool Helps Teams Move from ingress-nginx to Higress in Minutes
The Cloud Native Computing Foundation has highlighted a new AI-assisted migration approach that enabled engineers to migrate 60 ingress-nginx resources to Higress in roughly 30 minutes, demonstrating how artificial intelligence is increasingly being applied to modernize Kubernetes networking and gateway infrastructure.
-
Swiggy Improves Search Autocomplete Using Real Time Machine Learning Ranking
Swiggy detailed real-time machine-learning ranking system for autocomplete built on OpenSearch. The architecture separates candidate generation and ranking, uses feature stores for real time signals, and applies learning to rank models for improved relevance. It replaces heuristic ranking while maintaining strict latency constraints and enabling continuous model updates from user behavior signals.
-
Cloudflare Optimizes Edge Stack for High-Core CPUs instead of Large Cache
Cloudflare recently introduced its Gen 13 servers, marking a shift in how its network handles traffic. Instead of relying on large CPU caches for speed, the company redesigned its software to leverage many more processor cores working in parallel in its latest AMD-based servers.
-
Dropbox Collaborates with GitHub to Reduce Monorepo Size from 87GB to 20GB
Dropbox reduced its backend monorepo from 87GB to 20GB by optimizing Git delta compression in collaboration with GitHub. The changes improved clone times, CI performance, and developer velocity, highlighting how repository storage inefficiencies can impact large-scale engineering workflows.
-
From Minutes to Seconds: Uber Boosts MySQL Cluster Uptime with Consensus Architecture
Uber redesigned its MySQL fleet using a consensus-driven architecture based on MySQL Group Replication, reducing cluster failover time from minutes to seconds. By moving leader election and failure detection into the database layer, Uber improved availability, simplified external orchestration, and strengthened consistency across thousands of production clusters.
-
How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage
In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.
-
GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic
GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.
-
Cloudflare Scales Infrastructure as Code with Shift-Left Security Practices
Cloudflare has eliminated manual configuration errors across hundreds of production accounts by implementing Infrastructure as Code with automated policy enforcement, processing approximately 30 merge requests daily while catching security violations before deployment rather than after incidents occur.
-
Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus
Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.
-
NVIDIA Dynamo Addresses Multi-Node LLM Inference Challenges
Serving Large Language Models (LLMs) at scale is complex. Modern LLMs now exceed the memory and compute capacity of a single GPU or even a single multi-GPU node. As a result, inference workloads for 70B+, 120B+ parameter models, or pipelines with large context windows, require multi-node, distributed GPU deployments.
-
Azure API Management Premium v2 GA: Simplified Private Networking and VNet Injection
Microsoft has launched API Management Premium v2, redefining security and ease-of-use in cloud API gateways. This new architecture enhances private networking by eliminating management traffic from customer VNets. With features like Inbound Private Link, availability zone support, and custom CA certificates, users gain unmatched networking flexibility, resilience, and significant cost savings.
-
Airbnb Adds Adaptive Traffic Control to Manage Key Value Store Spikes
Airbnb upgraded Mussel, its multi-tenant key-value store, replacing static per-client rate limits with an adaptive, resource-aware traffic control system. The redesign ensures resilience during traffic spikes, protects critical workflows, and maintains fair usage across thousands of tenants while scaling efficiently.
-
KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI
AIOps and Agentic AI technologies can help in developing solutions to intelligently analyze Kubernetes cluster health, automatically diagnose problems, and orchestrate issue resolutions with minimal human intervention. Vikram Venkataraman and Srikanth Rajan spoke at KubeCon + CloudNativeCon NA 2025 Conference about Salesforce’s approach to self-healing systems using AIOps and AI Agents.