About errorbudget
A practitioner's notes on running production AI infrastructure in regulated environments. Written by an operator, for operators.
Who writes this
An infrastructure architect with 10+ years operating production systems in fintech. Day-to-day work: 10 VMware clusters, Dell VxRail and HPE Synergy 12000 hardware, increasingly NVIDIA A100 and H100 deployments for AI workloads.
The systems I run get audited annually for PCI DSS, ISO 27001, and SBV (State Bank of Vietnam) compliance. I sit on the auditee side of the table — supporting evidence, proving controls, surviving findings.
The publication is intentionally anonymous. This protects employer relationships and ensures content stays vendor-neutral. The work speaks for itself.
Why this exists
Most AI infrastructure content assumes greenfield deployments. Spin up some H100s in the cloud, configure Kubernetes, ship a model. Done in a weekend.
This is not how it works in fintech, insurance, healthcare, or any regulated environment. Real deployments involve:
- Existing investments in VMware, VxRail, or Synergy
- Compliance frameworks that prohibit certain architectural patterns
- Audit cycles that demand evidence for every control
- Change management that takes months, not minutes
- Vendor licensing that does not fit cloud-first assumptions
There is surprisingly little practical content for this reality. NVIDIA documentation assumes greenfield. VMware marketing speaks in abstractions. Compliance consultants write frameworks, not deployment guides.
This site fills that gap — from the operator perspective.
What you will find here
Articles focus on what actually works in production:
- VMware Private AI patterns — vSphere AI Enterprise, Tanzu Kubernetes, vGPU configurations
- Hardware integration — VxRail and Synergy GPU expansion in existing fleets
- NVIDIA stack — AI Enterprise licensing math, DCGM monitoring, MIG strategies
- Audit-ready operations — what auditors actually ask, evidence preparation that scales
- Production lessons — incidents survived, gotchas discovered, patterns refined over years
- NCA-AIIO preparation — currently studying, sharing what I learn along the way
Editorial principles
- Production-tested only. Every recommendation comes from systems I have operated, not vendor whitepapers.
- Auditee perspective. I write from the operator side — building systems that pass audit, not conducting audits.
- Honest about scope. I am not a CISSP, not a QSA, not an ISO Lead Auditor. I am an architect who has supported many audit cycles and shares what works.
- Vendor-neutral. Affiliate disclosures are explicit. Recommendations follow technical merit, not commercial relationships.
- Long-form depth. Most articles are 4,000 to 6,000 words with diagrams, configurations, and worked examples.
What this is not
This is not compliance consulting content. I do not write checklists pretending to be an auditor. I do not sell audit prep services. I do not promise regulatory blessing.
What I share: patterns that worked when my systems faced auditors, configurations that scaled in production, decisions I would make differently in hindsight.
Get in touch
For content suggestions or technical discussions, reach me at hello@errorbudget.io.
Anonymity does not mean inaccessibility. Genuine technical conversations are always welcome.
Get deep technical insights weekly
Join 1,200+ infrastructure architects from banks, insurance, and enterprise IT teams. One email every Friday. No fluff.
Free. Unsubscribe anytime. No spam, ever.