About xAI
XAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. ABOUT THE ROLE: You will build the tooling that turns a hardware listing and a deployment profile into a complete, self-contained software bundle capable of standing up xAI's full AI inference platform — from bare metal provisioning through GPU workloads — at any site, in any environment, with no internet access required.
xAI operates GPU infrastructure across public cloud, on-premise, and classified environments. Today, these targets are served by separate codebases that drift with every release. You will build the unified deployment platform that eliminates this divergence: a single generator that reads a thin profile (site topology, compliance requirements, connectivity model) and produces everything needed to deploy — Kubernetes manifests, switch configurations, OS provisioning configs, monitoring stacks, signed container image bundles, and acceptance tests. One source, every target.
You work on the unclassified (low) side. You build the tooling; cleared engineers at classified sites execute it. The quality of what you build directly determines how effectively those engineers can operate in environments where they cannot call you for help. Your tooling must be deterministic, complete, well-tested, and foolproof.
Design and build the deployment generator: a Go CLI that reads a YAML profile (6 deployment axes + site topology) and produces a fully-resolved deployment manifest with pinned image digests, rendered Helm values, switch configs, OSP inventory, network telemetry configuration, and AlertManager grouping/inhibition rules computed from the site topology.
Build the bundle pipeline: collect all referenced container images, Helm charts, OS boot images, NVIDIA drivers, and model weights into a signed, self-contained tarball with CycloneDX SBOM and cosign signatures. Build the update bundle pipeline for delta-only updates: diff against the previously shipped baseline manifest, package only changed artifacts, sign, and include apply-update scripts and machine-readable changelogs.
Implement profile-driven rendering: the same model deployment YAML, the same operator charts, the same monitoring stack produce correct output for public cloud (ArgoCD), enterprise on-prem (Pulumi), and classified air-gap (static manifests) targets based on profile selection.
Build the testing and validation framework: manifest validation against CRD schemas (kubeconform), profile-specific constraint checks (no external dependencies in air-gap profiles, FIPS requirements for gov profiles), acceptance test generation, and shadow cluster pre-transfer testing.
Own the monitoring stack migration: transition the on-prem K8s monitoring from Prometheus (kube-prometheus-stack) to VictoriaMetrics (VM Operator + VMSingle + VMAgent + VMAlert) to align with the public baseline and network telemetry stack. Ensure dashboards, alert rules, and ServiceMonitors work unchanged after migration.
COMPENSATION AND BENEFITS: $180,000 - $440,000 USD
Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks. xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice .