-
Notifications
You must be signed in to change notification settings - Fork 2
Negative scalability nearly removed #14
Comments
ping @hcho3 for code checking, there is no test at the end therefore I cannot really check whether it is computing correctly or not. |
This is great work @Laurae2! Do you have the individual changes in a repo so we can check the diff? It will be easier, at least for me, to figure out how your changes work. |
@thvasilo Check this: https://github.com/Laurae2/xgboost-fast-hist-perf-lab With the float transformation of @SmirnovEgorRu in #15, I am getting 0.75s for 2 thread, 0.3s for 9 threads, and 0.8s for 36 threads. |
With double to float trick for gradient / hessian. Now up to 100x to original speed, negative scaling is still here past 10 threads but way lower than before. The charts depicts a strange behavior past 30 threads. Truncated table:
Full table:
|
@Laurae2, As I understand, this code can't be executed in parallel for, because each iteration here - building of one note, and it has dependencies. For example, you can't start building of 2-level nodes, when the root is not built. #pragma omp parallel for num_threads(nthread) schedule(dynamic)
for (int i = 0; i < num_instance_set; ++i) {
hist_builder.BuildHist(gpair, instance_set[i], gmat, &histogram[i]);
} Building of tree can be parallelized by nodes (and should be). However, efficient implementation of it in real code will be much harder. |
@Laurae2 @SmirnovEgorRu I really appreciate the work you have done so far. I've been lately extremely busy with work at my organization. Will get back to it in a week. We should discuss how the improvement can be merged into main XGBoost codebase. |
@SmirnovEgorRu If I understand correctly what you said:
I was suspecting this but Intel Advisor contradicted myself by saying there were no dependency in the loop.. |
@Laurae2 perhaps it would help to look at the original codebase: This update is done from updater_quantile_hist.cc::Update which proceeds to call BuildHist which in turn calls BuildBlockHist or BuildHist which is what we are focusing on. As you will see (as far as I can tell) the only parallel code is in BuildHist, where the histogram building is parallel over the number of data points, and a parallel reduction after that. I'm more familiar with other builders (updater_histmaker in particular) but this one seems similar. At the beginning of the iteration we assign each data point to a single tree leaf. When we create the histograms we aggregate them per leaf, i.e. each leaf will have one collection of histograms, one per feature. When we parallelize over data points, we first retrieve the leaf id of the data point, I think that's what row_indices does in, then use that index to update the corresponding histogram set. Not sure how to interpret the |
Using the following tricks:
The performance becomes the following.
Speed table, from left to right accumulation:
Raw
1.1. Depend
Check
1.2. Bad Loop
Removal
Outer
Loop
Parallel
+1
Pre-store
Thread nb.
+1+2
Loop Fill
+1+2+3
Private
data_
Pre-alloc
+1+2+3
Average speed:
Average speed, low thread count:
Parallel efficiency:
Inverted parallel efficiency:
build_hist.cc, lot of changes:
main.cc, only 2 changes (2 pragmas added):
Full table of performance:
Raw
1.1. Depend
Check
1.2. Bad Loop
Removal
Outer
Loop
Parallel
+1
Pre-store
Thread nb.
+1+2
Loop Fill
+1+2+3
Private
data_
Pre-alloc
+1+2+3
The text was updated successfully, but these errors were encountered: