OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

ceeskaan · 2025-01-20T19:47:09Z

🐛 Bug

After upgrading to DGL 2.4.0, encountering serialization errors when attempting to use GraphBolt's OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.

To Reproduce

Steps to reproduce the behavior:

Create an on-disk dataset using gb.OnDiskDataset
Initialize distributed training environment
Attempt to distribute data across GPUs

Pseudo code sample:

def load_data(graph_path):
    dataset = gb.OnDiskDataset(graph_path).load(tasks="link_prediction")
    graph = dataset.graph
    features = dataset.feature
    train_set = dataset.tasks[0].train_set
    validation_set = dataset.tasks[0].validation_set
    test_set = dataset.tasks[0].test_set
    return graph, features, train_set, validation_set, test_set

graph, features, train_set, validation_set, test_set = load_data(GRAPH_PATH)

# Error happens when TorchDistributor tries to serialize these objects
# to pass them to run_instance
distributor = TorchDistributor(
    num_processes=world_size, 
    local_mode=False,
    use_gpu=True
)

distributor.run(
    run_instance,
    -1,
    world_size,
    graph,  # These objects trigger serialization error
    features,
    train_set,
    validation_set,
    test_set
)

This gives the following error:
RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!

Expected behavior

GraphBolt's OnDiskNpyArray should be properly serializable to support distributed training scenarios, as it worked in DGL 2.3.0.

Environment

DGL Version (e.g., 1.0): 2.4.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch==2.3.1, cuda 12.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip (via Databricks wheel)
Build command you used (if compiling from source): N/A
Python version: 3.11
CUDA/cuDNN version (if applicable): 12.1
GPU models and configuration (e.g. V100): g5.48large (A10G)
Any other relevant information:

The text was updated successfully, but these errors were encountered:

ceeskaan · 2025-01-27T08:20:04Z

@mfbalin Any thoughts on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

ceeskaan commented Jan 20, 2025

ceeskaan commented Jan 27, 2025

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

Comments

ceeskaan commented Jan 20, 2025

🐛 Bug

To Reproduce

Expected behavior

Environment

ceeskaan commented Jan 27, 2025