You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading to DGL 2.4.0, encountering serialization errors when attempting to use GraphBolt's OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.
To Reproduce
Steps to reproduce the behavior:
Create an on-disk dataset using gb.OnDiskDataset
Initialize distributed training environment
Attempt to distribute data across GPUs
Pseudo code sample:
def load_data(graph_path):
dataset = gb.OnDiskDataset(graph_path).load(tasks="link_prediction")
graph = dataset.graph
features = dataset.feature
train_set = dataset.tasks[0].train_set
validation_set = dataset.tasks[0].validation_set
test_set = dataset.tasks[0].test_set
return graph, features, train_set, validation_set, test_set
graph, features, train_set, validation_set, test_set = load_data(GRAPH_PATH)
# Error happens when TorchDistributor tries to serialize these objects
# to pass them to run_instance
distributor = TorchDistributor(
num_processes=world_size,
local_mode=False,
use_gpu=True
)
distributor.run(
run_instance,
-1,
world_size,
graph, # These objects trigger serialization error
features,
train_set,
validation_set,
test_set
)
This gives the following error: RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!
Expected behavior
GraphBolt's OnDiskNpyArray should be properly serializable to support distributed training scenarios, as it worked in DGL 2.3.0.
Environment
DGL Version (e.g., 1.0): 2.4.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch==2.3.1, cuda 12.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip (via Databricks wheel)
Build command you used (if compiling from source): N/A
Python version: 3.11
CUDA/cuDNN version (if applicable): 12.1
GPU models and configuration (e.g. V100): g5.48large (A10G)
Any other relevant information:
The text was updated successfully, but these errors were encountered:
🐛 Bug
After upgrading to DGL 2.4.0, encountering serialization errors when attempting to use GraphBolt's OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.
To Reproduce
Steps to reproduce the behavior:
Pseudo code sample:
This gives the following error:
RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!
Expected behavior
GraphBolt's OnDiskNpyArray should be properly serializable to support distributed training scenarios, as it worked in DGL 2.3.0.
Environment
conda
,pip
, source): pip (via Databricks wheel)The text was updated successfully, but these errors were encountered: