blog

Personal notes on site reliability engineering, infrastructure, and building things that last.

about ->

Incident Review Without Theater

By Hoang-Long Nguyen · May 3, 2026 · SRE, Incidents, Postmortems, Reliability

A useful incident review turns a messy production event into fewer surprises next time.

Cloud Cost Automation Field Notes

By Hoang-Long Nguyen · April 24, 2026 · Cloud, GCP, FinOps, Automation

Cost control gets easier when billing signals show up where engineers already work.

Observability That Helps On-Call

By Hoang-Long Nguyen · April 17, 2026 · Observability, OpenTelemetry, Alerts, On-call

Dashboards and alerts should reduce decision time, not decorate a wall of screens.

GitOps Rollout Lessons from Non-Prod to Production

By Hoang-Long Nguyen · April 6, 2026 · Platform, GitOps, ArgoCD, Delivery

Notes on introducing GitOps gradually without turning every deployment into a process migration.

GKE Upgrade Runbook Notes

By Hoang-Long Nguyen · March 29, 2026 · Kubernetes, GKE, Runbooks

The upgrade checklist I want nearby before moving Kubernetes node pools through production versions.

SLO Design for Small Platform Teams

By Hoang-Long Nguyen · March 18, 2026 · SRE, SLO, Reliability, Operations

A practical way to choose service level objectives when the platform team is small, busy, and still responsible for production confidence.

How I Set Up This Site with Astro

By Hoang-Long Nguyen · March 2, 2026 · Platform, Astro, Static Sites, Cloudflare

A practical look at how this site is built — Astro Content Collections, static output, Cloudflare Pages deployment, and the few decisions that made it worth the setup.

Hello, LONG R&D

By Hoang-Long Nguyen · March 1, 2026 · Infrastructure, Reliability, Notes

An introduction to LONG R&D — a personal space for writing about site reliability engineering, infrastructure, and building systems that last.