You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
What is the outcome that you are trying to reach?
A tutorial that shows how to launch a distributed PyTorch Lightning (PTL) neuronx-distributed pre-training job on a Ray cluster with multiple Trn1 nodes within an Amazon Elastic Kubernetes Service (EKS) cluster. Many customers are looking for examples associated with the combination of these technologies (Ray + PTL + Neuron) on AWS AI Accelerator.
Describe the solution you would like
The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.
The text was updated successfully, but these errors were encountered:
Community Note
What is the outcome that you are trying to reach?
A tutorial that shows how to launch a distributed PyTorch Lightning (PTL) neuronx-distributed pre-training job on a Ray cluster with multiple Trn1 nodes within an Amazon Elastic Kubernetes Service (EKS) cluster. Many customers are looking for examples associated with the combination of these technologies (Ray + PTL + Neuron) on AWS AI Accelerator.
Describe the solution you would like
The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.
The text was updated successfully, but these errors were encountered: