Neural Network Training
- Feedforward and back propagation gradient descent
Batch Gradient Descent
Stochastic Gradient Descent
Minibatch-Based SGD
Model Training Cost
Distributed Training is Necessary
- Developers / Researchers’ time are more valuable than hardware .
- If a training takes 10 GPU days • Parallelize with distributed training
- 1024 GPUs can finish in 14 minutes (ideally)!
- The develop and research cycle will be greatly boosted
Introduction to Distributed Training
Data Parallelism
- Train by splitting the training data over a bunch of GPUs
Scaling Distributed Machine Learning with the Parameter Server
- All worker nodes synchronize to a single point
- Two different roles in framework:
- Parameter Server: receive gradients from workers and send back the aggregated results
- Workers: compute gradients using splitted dataset and send to parameter server
- Problems
Distributed Communication
Note: last two are typically used in industry atm