This module orchestrates submission of a Ray training job to the Ray Cluster and an inference job using AWS Step Functions.
namespace
- Kubernetes namespace nameeks_cluster_admin_role_arn
- ARN of EKS admin role to authenticate kubectleks_handler_role_arn
- ARN of EKS admin role to authenticate kubectleks_cluster_name
- Name of the EKS cluster to deploy toeks_cluster_endpoint
- EKS cluster endpointeks_oidc_arn
- ARN of EKS OIDC provider for IAM roleseks_cert_auth_data
- Auth certificate
step_function_timeout
- Step function timeout in minutes. Defaults to360
data_bucket_name
- Name of the bucket to grant service account permissions topvc_name
- Persistent volume claim name. Empty by defeault. If no PVC is provided, the volume will not be mounted.dra_export_path
- Persistent volume mount path. Defaults to/ray/export/
. Must start with a/
.tags
- A dictionary of additional tags to apply to all resources. Defaults to None
- Navigate to AWS Step Functions and find step function starting with "TrainingOnEks"
- Start a new Step Function execution
To observe the progress of the job using Ray Dashboard,
- Connect to EKS cluster
aws eks update-kubeconfig --region us-east-1 --name eks-cluster-xxx
- Get Ray service endpoint:
kubectl get endpoints -n ray
NAME ENDPOINTS AGE
kuberay-head-svc ...:8080,...:10001,...:8000 + 2 more... 98s
kuberay-operator ...:8080 6m37s
- Start port forwarding:
kubectl port-forward -n ray --address 0.0.0.0 service/kuberay-head-svc 8265:8265
- Access the Ray Dashboard at
http://localhost:8265
:
name: ray-orchestrator
path: modules/eks/ray-orchestrator
parameters:
- name: Namespace
valueFrom:
parameterValue: rayNamespaceName
- name: EksClusterAdminRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: EksHandlerRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksHandlerRoleArn
- name: EksClusterName
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: EksClusterEndpoint
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterEndpoint
- name: EksOidcArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: EksCertAuthData
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterCertAuthData
- name: DataBucketName
valueFrom:
moduleMetadata:
group: base
name: buckets
key: ArtifactsBucketName
- name: PvcName
valueFrom:
moduleMetadata:
group: integration
name: lustre-on-eks
key: PersistentVolumeClaimName
- name: DraExportPath
valueFrom:
parameterValue: draExportPath
EksServiceAccountName
: Service Account Name.EksServiceAccountRoleArn
: Service Account Role ARN.StateMachineArn
: Step Function ARN.LogGroupArn
: log group ARN.