Skip to content

Commit

Permalink
fix: orchestrator logs (#266)
Browse files Browse the repository at this point in the history
* do not return full training log

* revert 0.01 epochs

* changelog

* update cluster resources
  • Loading branch information
kukushking authored Nov 11, 2024
1 parent 24de2ba commit 35bac20
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 11 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### **Changed**
- changed `ray-image` to pull from AWS Public ECR to avoid docker pull rate limits
- changed `ray-orchestrator` sample script epochs to 0.01 to support demo scenarios
- changed `ray-orchestrator` to not retrieve full training job logs and avoid `States.DataLimitExceeded`
- update `ray-on-eks` manifest cluster resources

## v1.7.0

Expand Down
4 changes: 2 additions & 2 deletions manifests/ray-on-eks/core-modules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ parameters:
eks_node_labels:
usage: core
- eks_ng_name: ng-gpu
eks_node_quantity: 1
eks_node_quantity: 5
eks_node_max_quantity: 10
eks_node_min_quantity: 1
eks_node_min_quantity: 5
eks_node_disk_size: 400
eks_node_instance_type: "g4dn.4xlarge"
eks_node_labels:
Expand Down
12 changes: 6 additions & 6 deletions manifests/ray-on-eks/ray-cluster-modules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ parameters:
- name: HeadResources
value:
requests:
cpu: "1"
memory: "8G"
cpu: "8"
memory: "24G"
limits:
cpu: "4"
memory: "16G"
cpu: "8"
memory: "24G"
- name: WorkerReplicas
value: 1
- name: WorkerMinReplicas
Expand All @@ -45,8 +45,8 @@ parameters:
- name: WorkerResources
value:
requests:
cpu: "4"
memory: "8G"
cpu: "14"
memory: "60G"
limits:
cpu: "14"
memory: "60G"
Expand Down
2 changes: 1 addition & 1 deletion modules/eks/ray-orchestrator/ray_orchestrator_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ def __init__(
"Namespace": namespace_name,
"CertificateAuthority": eks_cert_auth_data,
"Endpoint": eks_cluster_endpoint,
"LogOptions": {"RetrieveLogs": True},
"LogOptions": {"RetrieveLogs": False},
"Job": training_body,
},
},
Expand Down
2 changes: 1 addition & 1 deletion modules/eks/ray-orchestrator/scripts/training-6B.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ def compute_metrics(eval_pred):
trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config={
"epochs": 0.01,
"epochs": 1,
"batch_size": batch_size, # per device
"steps_per_epoch": steps_per_epoch,
},
Expand Down

0 comments on commit 35bac20

Please sign in to comment.