Why this team left App Service for AKS
A mid-size ed-tech company ran five ASP.NET Core APIs on Azure App Service. It worked until enrollment spikes caused noisy-neighbor scaling limits and deployment slots became coordination bottlenecks. They needed independent scaling for a video metadata service, tighter network policies for PII, and consistent deployments across dev/staging/prod. Azure Kubernetes Service (AKS) was chosen—not because Kubernetes is trendy, but because their SRE team already operated clusters for other products.
This case study walks through what they actually built: cluster topology, .NET container images, GitHub Actions deploy pipelines, and the surprises in week three of production.
Target architecture
- Gateway API — ASP.NET Core YARP or dedicated BFF; TLS termination at Application Gateway Ingress Controller (AGIC).
- Identity service — OpenID Connect wrapper around existing user store; issues internal JWTs.
- Catalog service — Courses, modules, read-heavy; Redis cache fronting SQL.
- Enrollment service — Writes with outbox pattern to Service Bus for notifications.
- Media service — Blob storage SAS generation; CPU-light, IO-heavy.
Each service is a .NET 8 minimal API in its own repo, multi-stage Docker image under 120 MB, health endpoints at /health/live and /health/ready.
AKS cluster design decisions
Node pools
System pool: three D4s_v5 nodes for kube-system. User pool: autoscaling 3–12 nodes for workloads. Spot pool for batch reindex jobs—never for payment-adjacent paths. Azure CNI overlay saved IP exhaustion in a crowded hub-spoke VNet.
Identity and secrets
Workload Identity federated each service account to a managed identity. Azure Key Vault Provider for Secrets Store CSI mounted connection strings and API keys as volumes—no secrets in Helm values committed to git. Rotation became an ops runbook instead of redeploying twelve apps.
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: catalog-api-secrets
spec:
provider: azure
parameters:
usePodIdentity: "false"
clientID: "${WORKLOAD_IDENTITY_CLIENT_ID}"
keyvaultName: "prod-edtech-kv"
objects: |
array:
- objectName: SqlCatalogConnection
objectType: secret
Deploying .NET services with Helm
One Helm chart template parameterized image, replicas, resources, and probes. Example values for catalog-api in production:
replicaCount: 4
image:
repository: acredtech.azurecr.io/catalog-api
tag: "1.14.2"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
memory: 1Gi
env:
ASPNETCORE_ENVIRONMENT: Production
livenessProbe:
httpGet:
path: /health/live
port: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
They pinned CPU requests after discovering throttling inflated p95 latency without obvious CPU metrics in Application Insights—always correlate Kubernetes cgroup limits with .NET GC pauses.
Networking and ingress
AGIC integrated with an existing Application Gateway WAF. Internal-only services used ClusterIP; only the gateway and BFF had public listeners. Network policies denied east-west traffic except from the gateway namespace to service ports explicitly labeled. mTLS between services was deferred to phase two; JWT validation at the BFF was sufficient for their threat model.
Observability stack
OpenTelemetry exporters sent traces to Azure Monitor. Structured logging via Serilog with trace_id in every log line. Dashboards per service: request rate, 5xx ratio, dependency duration to SQL and Redis. Alerts fired on SLO burn for enrollment checkout—not on CPU alone. During the first incident, traces showed a downstream catalog timeout; without them, the team would have scaled enrollment pods uselessly.
CI/CD from GitHub Actions
Build matrix per service: dotnet test, docker build, Trivy scan, push to ACR. Deploy job used azure/k8s-set-context and helm upgrade --install with image digest tags—never :latest in production. Canary via Flagger was planned; they shipped blue-green with manual promotion for the first quarter to reduce moving parts.
Costs and operational load
AKS control plane fee plus nodes exceeded App Service spend by roughly thirty percent initially. Autoscaling and spot batch work brought it closer after rightsizing. They hired one platform engineer half-time; without that capacity, managed App Service would have been cheaper overall. The win was deployment frequency and blast-radius isolation, not raw dollar savings month one.
AI-assisted operations (carefully)
They experimented with Azure Monitor intelligent insights and an internal GPT runbook assistant that suggests probable causes from trace excerpts. Human approval remained mandatory for scale events and config changes. AI shortened MTTR on known failure modes; it did not replace runbooks for etcd or ingress certificate expiry.
What they would do differently
- Start with a platform chart and golden Dockerfile before service five.
- Load-test enrollment with realistic JWT validation and WAF rules enabled.
- Document rollback: Helm history and database migration compatibility per service.
- Keep a modular monolith path for features that do not need separate scale curves.
AKS for .NET microservices is viable when you have platform skills, clear service boundaries, and observability Day One—not when you want Kubernetes on a résumé. This team's production cutover succeeded because they treated cluster operations as a product, not a weekend experiment.