Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

Open
byingyang opened this issue Dec 31, 2024 · 3 comments
Open

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

byingyang opened this issue Dec 31, 2024 · 3 comments

Comments

@byingyang
Copy link

byingyang commented Dec 31, 2024

🐛 Bug

When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:

The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.

RuntimeError: [20:25:19] /opt/dgl/src/array/cpu/spmat_op_impl_coo.cc:749: Check failed: (coo.row->shape[0]) <= 0x7FFFFFFFL (2283022784 vs. 2147483647) : int32 overflow for argument coo.row->shape[0].
Stack trace:
  [bt] (0) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(+0x61fbc4) [0x7f34bc81fbc4]
  [bt] (1) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DGLDeviceType)1, int>(dgl::aten::COOMatrix)+0x121) [0x7f34bc82ac81]
  [bt] (2) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x451) [0x7f34bc5b43a1]
  [bt] (3) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::COOToCSC(std::shared_ptr<dgl::sparse::COO> const&)+0x17d) [0x7f3394a77f2d]
  [bt] (4) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::_CreateCSC()+0x14d) [0x7f3394a7c14d]
  [bt] (5) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCPtr()+0x5d) [0x7f3394a7c24d]
  [bt] (6) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCTensors()+0x13) [0x7f3394a7ce63]
  [bt] (7) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<dgl::sparse::SparseMatrix>::defineMethod<torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()> >(std::string, torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()>, std::string, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&)+0x82) [0x7f3394a65802]
  [bt] (8) /databricks/python/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0xa80f7e) [0x7f357f678f7e]

----> 2 dataset = gb.OnDiskDataset(base_dir, force_preprocess=True).load()
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:688, in OnDiskDataset.__init__(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    678 def __init__(
    679     self,
    680     path: str,
   (...)
    685     # Always call the preprocess function first. If already preprocessed,
    686     # the function will return the original path directly.
    687     self._dataset_dir = path
--> 688     yaml_path = preprocess_ondisk_dataset(
    689         path,
    690         include_original_edge_id,
    691         force_preprocess,
    692         auto_cast_to_optimal_dtype,
    693     )
    694     with open(yaml_path) as f:
    695         self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader)
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:407, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    404 if "graph" not in input_config:
    405     raise RuntimeError("Invalid config: does not contain graph field.")
--> 407 sampling_graph = _graph_data_to_fused_csc_sampling_graph(
    408     dataset_dir,
    409     input_config["graph"],
    410     include_original_edge_id,
    411     auto_cast_to_optimal_dtype,
    412 )
    414 # 3. Record value of include_original_edge_id.
    415 output_config["include_original_edge_id"] = include_original_edge_id
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:166, in _graph_data_to_fused_csc_sampling_graph(dataset_dir, graph_data, include_original_edge_id, auto_cast_to_optimal_dtype)
    161 sparse_matrix = spmatrix(
    162     indices=torch.stack((coo_src, coo_dst), dim=0),
    163     shape=(total_num_nodes, total_num_nodes),
    164 )
    165 del coo_src, coo_dst
--> 166 indptr, indices, edge_ids = sparse_matrix.csc()
    167 del sparse_matrix
    169 if auto_cast_to_optimal_dtype:
File /databricks/python/lib/python3.11/site-packages/dgl/sparse/sparse_matrix.py:201, in SparseMatrix.csc(self)
    172 def csc(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    173     r"""Returns the compressed sparse column (CSC) representation of the
    174     sparse matrix.
    175 
   (...)
    199     (tensor([0, 0, 0, 1, 2, 3]), tensor([1, 1, 2]), tensor([0, 2, 1]))
    200     """
--> 201     return self.c_sparse_matrix.csc()

To Reproduce

Steps to reproduce the behavior:

  1. Create OnDiskDataset with edges in npy files that have all ints casted to int32, with a # of edges > int32 size.
  2. Load dataset and preprocess

Expected behavior

In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.

Environment

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 25, 2025

This is not expected, we are successfully using int32 for the ogbn-papers100M dataset, which has over 3B edges. @Rhett-Ying what do you think is the core issue here?

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 25, 2025

Since there is a preprocessing step, cast your data to int64, then let the preprocessing run. After preprocessing, when you load the gb.CSCSamplingGraph, the dtype of the edges should be back to int32.

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 25, 2025

The preprocessing steps use DGL underneath, which does not support mixed dtypes for the indptr and indices tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants