Scale Neural Network Training with SageMaker Distributed

As a machine learning practitioner, you might find yourself in the following situations. You might have found just the perfect state of the art transformer-based model, only to find that when you try to fine-tune it you run into memory issues. You might have just added billions of parameters to your model, which should improve your model performance, but this too only gets you an out of memory issue. You might be able to comfortably fit a model on a single GPU, but are struggling to take advantage of all your GPU’s and find that your model still takes days to train.

Should you just accept the status quo and limit your applications to models that fit within the existing hardware capacity or that train within an acceptable time? Of course not! Enter model parallelism and data parallelism on Amazon SageMaker. Read More

#mlaas