Lerna Ekmekcioglu
Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies working on authentication systems, distributed caching, and multi region deployments using IaC and CI/CD to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.
Session
Race cars are engineered for peak performance, designed to push the limits of speed while maintaining control and stability. Similarly, AI/ML jobs need a high-speed, reliable network fabric to deliver results efficiently. Just like a race car depends on a well-maintained race track to perform at its best, AI/ML jobs rely on network fabrics—such as RoCE and InfiniBand—to ensure fast, reliable communication.
In this session, I’ll cover key networking challenges impacting the performance and reliability of AI/ML jobs, such as NIC flapping and network contention. These challenges are like debris on a race track, causing slowdowns, disruptions, and even crashes not only impacting job completion times but also eating into ROI due to costly rollbacks to previous checkpoints.
Through demos, I'll illustrate how these networking challenges directly impact the reliability and performance of AI/ML jobs. Just as a pit crew stays ahead of problems by constantly monitoring and tuning the race car, DevOps teams must be equipped with the knowledge of these critical network challenges to ensure AI/ML jobs perform at full speed!