- Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes
- Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density
- Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale
- Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others
- Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design
Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture