In this video from the 2019 OpenFabrics Workshop in Austin, Xiaoyi Lu from Ohio State University presents: Accelerating TensorFlow with RDMA for High-Performance Deep Learning.
Google’s TensorFlow is one of the most popular Deep Learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network. In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then, we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMAgRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as Message Pipelining, Message Coalescing, Zero-Copy Transmission etc. The performance evaluations show that our proposed design can significantly speedup gRPC throughput by up to 2.6x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 56% performance improvement for TensorFlow training with CNN models.
Xiaoyi Lu is a Research Scientist in the Department of Computer Science and Engineering at the Ohio State University. He is currently working with Prof. Dhabaleswar K. (DK) Panda in the Network Based Computing Lab. His research interests include High Performance Interconnects and Protocols, Big Data Computing (Hadoop/Spark Ecosystem), Parallel Computing (MPI/PGAS), Virtualization, and Cloud Computing.