Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash configuration information for the dedup performance test of DataJuicer 2.0 #546

Open
3 tasks done
cist opened this issue Jan 14, 2025 · 3 comments
Open
3 tasks done
Assignees
Labels
question Further information is requested

Comments

@cist
Copy link

cist commented Jan 14, 2025

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

image
hi, Do you have the MinHash configuration for this deduplication performance test? How many hashes were calculated and how many bands were used?

Additional 额外信息

No response

@cist cist added the question Further information is requested label Jan 14, 2025
@chenyushuo
Copy link
Collaborator

Hi @cist , thanks for your attention and using on Data-Juicer and our paper!

In our experiment, 256 hashes were calculated and 32 bands were used.

@cist
Copy link
Author

cist commented Jan 16, 2025

Is there detailed physical configuration information for this test data? Such as memory, disk, CPU information and so on? The test results in our own cluster are much slower.

@Cathy0908
Copy link
Collaborator

Cathy0908 commented Jan 16, 2025

Hi @cist we used the PAI (https://help.aliyun.com/zh/pai/) cluster and the ray framework they support on PAI DLC. A single node has 160 vCPU and 1800 GiB. The data was stored on CPFS with a bandwidth of 12 GB/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants