About | errorbudget

A practitioner's notes on running production AI infrastructure in regulated environments. Written by an operator, for operators.

Who writes this

An infrastructure architect with 10+ years operating production systems in fintech. Day-to-day work: 10 VMware clusters, Dell VxRail and HPE Synergy 12000 hardware, increasingly NVIDIA A100 and H100 deployments for AI workloads.

The systems I run get audited annually for PCI DSS, ISO 27001, and SBV (State Bank of Vietnam) compliance. I sit on the auditee side of the table — supporting evidence, proving controls, surviving findings.

The publication is intentionally anonymous. This protects employer relationships and ensures content stays vendor-neutral. The work speaks for itself.

Why this exists

Most AI infrastructure content assumes greenfield deployments. Spin up some H100s in the cloud, configure Kubernetes, ship a model. Done in a weekend.

This is not how it works in fintech, insurance, healthcare, or any regulated environment. Real deployments involve:

Existing investments in VMware, VxRail, or Synergy
Compliance frameworks that prohibit certain architectural patterns
Audit cycles that demand evidence for every control
Change management that takes months, not minutes
Vendor licensing that does not fit cloud-first assumptions

There is surprisingly little practical content for this reality. NVIDIA documentation assumes greenfield. VMware marketing speaks in abstractions. Compliance consultants write frameworks, not deployment guides.

This site fills that gap — from the operator perspective.

What you will find here

Articles focus on what actually works in production:

VMware Private AI patterns — vSphere AI Enterprise, Tanzu Kubernetes, vGPU configurations
Hardware integration — VxRail and Synergy GPU expansion in existing fleets
NVIDIA stack — AI Enterprise licensing math, DCGM monitoring, MIG strategies
Audit-ready operations — what auditors actually ask, evidence preparation that scales
Production lessons — incidents survived, gotchas discovered, patterns refined over years
NCA-AIIO preparation — currently studying, sharing what I learn along the way

Editorial principles

Production-tested only. Every recommendation comes from systems I have operated, not vendor whitepapers.
Auditee perspective. I write from the operator side — building systems that pass audit, not conducting audits.
Honest about scope. I am not a CISSP, not a QSA, not an ISO Lead Auditor. I am an architect who has supported many audit cycles and shares what works.
Vendor-neutral. Affiliate disclosures are explicit. Recommendations follow technical merit, not commercial relationships.
Long-form depth. Most articles are 4,000 to 6,000 words with diagrams, configurations, and worked examples.

What this is not

This is not compliance consulting content. I do not write checklists pretending to be an auditor. I do not sell audit prep services. I do not promise regulatory blessing.

What I share: patterns that worked when my systems faced auditors, configurations that scaled in production, decisions I would make differently in hindsight.

Get in touch

For content suggestions or technical discussions, reach me at hello@errorbudget.io.

Anonymity does not mean inaccessibility. Genuine technical conversations are always welcome.