You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:
The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.
RuntimeError: [20:25:19] /opt/dgl/src/array/cpu/spmat_op_impl_coo.cc:749: Check failed: (coo.row->shape[0]) <= 0x7FFFFFFFL (2283022784 vs. 2147483647) : int32 overflow for argument coo.row->shape[0].
Stack trace:
[bt] (0) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(+0x61fbc4) [0x7f34bc81fbc4]
[bt] (1) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DGLDeviceType)1, int>(dgl::aten::COOMatrix)+0x121) [0x7f34bc82ac81]
[bt] (2) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x451) [0x7f34bc5b43a1]
[bt] (3) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::COOToCSC(std::shared_ptr<dgl::sparse::COO> const&)+0x17d) [0x7f3394a77f2d]
[bt] (4) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::_CreateCSC()+0x14d) [0x7f3394a7c14d]
[bt] (5) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCPtr()+0x5d) [0x7f3394a7c24d]
[bt] (6) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCTensors()+0x13) [0x7f3394a7ce63]
[bt] (7) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<dgl::sparse::SparseMatrix>::defineMethod<torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()> >(std::string, torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()>, std::string, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&)+0x82) [0x7f3394a65802]
[bt] (8) /databricks/python/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0xa80f7e) [0x7f357f678f7e]
----> 2 dataset = gb.OnDiskDataset(base_dir, force_preprocess=True).load()
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:688, in OnDiskDataset.__init__(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
678 def __init__(
679 self,
680 path: str,
(...)
685 # Always call the preprocess function first. If already preprocessed,
686 # the function will return the original path directly.
687 self._dataset_dir = path
--> 688 yaml_path = preprocess_ondisk_dataset(
689 path,
690 include_original_edge_id,
691 force_preprocess,
692 auto_cast_to_optimal_dtype,
693 )
694 with open(yaml_path) as f:
695 self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader)
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:407, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
404 if "graph" not in input_config:
405 raise RuntimeError("Invalid config: does not contain graph field.")
--> 407 sampling_graph = _graph_data_to_fused_csc_sampling_graph(
408 dataset_dir,
409 input_config["graph"],
410 include_original_edge_id,
411 auto_cast_to_optimal_dtype,
412 )
414 # 3. Record value of include_original_edge_id.
415 output_config["include_original_edge_id"] = include_original_edge_id
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:166, in _graph_data_to_fused_csc_sampling_graph(dataset_dir, graph_data, include_original_edge_id, auto_cast_to_optimal_dtype)
161 sparse_matrix = spmatrix(
162 indices=torch.stack((coo_src, coo_dst), dim=0),
163 shape=(total_num_nodes, total_num_nodes),
164 )
165 del coo_src, coo_dst
--> 166 indptr, indices, edge_ids = sparse_matrix.csc()
167 del sparse_matrix
169 if auto_cast_to_optimal_dtype:
File /databricks/python/lib/python3.11/site-packages/dgl/sparse/sparse_matrix.py:201, in SparseMatrix.csc(self)
172 def csc(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
173 r"""Returns the compressed sparse column (CSC) representation of the
174 sparse matrix.
175
(...)
199 (tensor([0, 0, 0, 1, 2, 3]), tensor([1, 1, 2]), tensor([0, 2, 1]))
200 """
--> 201 return self.c_sparse_matrix.csc()
To Reproduce
Steps to reproduce the behavior:
Create OnDiskDataset with edges in npy files that have all ints casted to int32, with a # of edges > int32 size.
Load dataset and preprocess
Expected behavior
In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.
Environment
DGL Version (e.g., 1.0):
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:
Additional context
The text was updated successfully, but these errors were encountered:
This is not expected, we are successfully using int32 for the ogbn-papers100M dataset, which has over 3B edges. @Rhett-Ying what do you think is the core issue here?
Since there is a preprocessing step, cast your data to int64, then let the preprocessing run. After preprocessing, when you load the gb.CSCSamplingGraph, the dtype of the edges should be back to int32.
🐛 Bug
When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.
Environment
conda
,pip
, source):Additional context
The text was updated successfully, but these errors were encountered: