Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

Open
ceeskaan opened this issue Jan 20, 2025 · 1 comment
Open

OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0 #7859

ceeskaan opened this issue Jan 20, 2025 · 1 comment

Comments

@ceeskaan
Copy link

🐛 Bug

After upgrading to DGL 2.4.0, encountering serialization errors when attempting to use GraphBolt's OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.

To Reproduce

Steps to reproduce the behavior:

  1. Create an on-disk dataset using gb.OnDiskDataset
  2. Initialize distributed training environment
  3. Attempt to distribute data across GPUs

Pseudo code sample:

def load_data(graph_path):
    dataset = gb.OnDiskDataset(graph_path).load(tasks="link_prediction")
    graph = dataset.graph
    features = dataset.feature
    train_set = dataset.tasks[0].train_set
    validation_set = dataset.tasks[0].validation_set
    test_set = dataset.tasks[0].test_set
    return graph, features, train_set, validation_set, test_set

graph, features, train_set, validation_set, test_set = load_data(GRAPH_PATH)

# Error happens when TorchDistributor tries to serialize these objects
# to pass them to run_instance
distributor = TorchDistributor(
    num_processes=world_size, 
    local_mode=False,
    use_gpu=True
)

distributor.run(
    run_instance,
    -1,
    world_size,
    graph,  # These objects trigger serialization error
    features,
    train_set,
    validation_set,
    test_set
)

This gives the following error:
RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!

Expected behavior

GraphBolt's OnDiskNpyArray should be properly serializable to support distributed training scenarios, as it worked in DGL 2.3.0.

Environment

  • DGL Version (e.g., 1.0): 2.4.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch==2.3.1, cuda 12.1
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): pip (via Databricks wheel)
  • Build command you used (if compiling from source): N/A
  • Python version: 3.11
  • CUDA/cuDNN version (if applicable): 12.1
  • GPU models and configuration (e.g. V100): g5.48large (A10G)
  • Any other relevant information:
@ceeskaan
Copy link
Author

@mfbalin Any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant