AWS Projects
Most of my early career leaned Azure, but the deep end of my current work is AWS, running a production EKS cluster, the Prometheus stack that watches over it, and a fleet of bastions across client environments. The bigger projects:
EKS 1.29 to 1.35 upgrade. Five Kubernetes versions across a multi-session sprint. Recurring eu-west-2c CPU saturation during node group upgrades, eventually resolved by temporarily scaling out for the duration. AWS Load Balancer Controller had its own adventures along the way (broken
--watch-namespaceflags, missing IAM permissions) which I documented as I went. The whole experience eventually became a Python CLI called eks-upgrade-check, which is open source and lives under Personal Projects.GitLab 16 to 18.x. An eight-stop migration path because GitLab is allergic to skipping versions. PostgreSQL 14 on RDS is the gating factor for the 18 jump, so that one's still on the runway. The pattern across stops became routine after the first few: delete jobs, annotate the IngressClass, use explicit values files (never
--reuse-valuesalone), restore IngressClass ownership at the end.GitLab Pages on EKS. Set up internal docs hosting backed by S3 with an ACM wildcard cert. Mid-build I caught Pages reachable via the public NLB, which was not the plan, and re-routed its DNS to a VPN-restricted classic ELB with a dual-SAN cert. The whole technical-docs site now sits behind the VPN, MkDocs Material with custom navy branding and a Jurassic Park 404 page for good measure.
kube-prometheus-stack rebuild. Cleaned up the observability stack from the inside out. Node-exporter enabled, the
PrometheusOutOfOrderTimestampsissue fixed viatsdb.outOfOrderTimeWindow, scrape configs slimmed from 2,139 lines to 837. Blackbox exporter got a separate scrape job for Vercel after their bot detection started returning 403s on the default User-Agent.The DiskPressure incident. A node hit DiskPressure, which cascaded into 29 stale autossh tunnels and 5 GitLab pod crash loops on top of it. Cordoning, eviction cleanup, autossh restart, NLB target group re-registration for every new node ID, and a CTO-approved bump from 30GB to 50GB on the root volumes that became the new fleet baseline. The kind of incident you only have to live through once.
Bastion fleet & Ansible migration. Roughly 28 bastions across client environments, currently mid-migration from Puppet (made proprietary at v8 by Perforce, with EOL imminent) to Ansible. Dry-run parity first, then parallel runs alongside Puppet, no decommissioning until Ansible has been stable for a sustained period.
Multi-account TLS audit. Swept four AWS accounts for certificate ownership and expiry after a flagged WAF alert. Found a DigiCert wildcard expiring on an NLB, an expired wildcard on a CloudFront distribution, and a small graveyard of orphaned ACM certs. Cleanup ongoing.
Alert enrichment for Slack. A Python FastAPI service that intercepts Prometheus alerts in Slack and adds first-pass diagnostic context (current pod state, recent log lines, related metrics) before I open the thread. Helm chart with Vault static secret integration. Phase 1 targets the
k8s-high-priorityreceive

