You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WeightedPercentileFun in src/objective/regression_objective.hpp computes the weighted percentile for initial score (BoostFromScore) and leaf value refit (RenewTreeOutput) when using regression_l1, quantile, or mape objectives with sample weights. The current implementation uses the wrong CDF segment for linear interpolation, which can produce results outside the range [min(y), max(y)], and is inconsistent with the correct weighted quantile definition.
PR #5848 fixed the unweighted PercentileFun (issue #5847) by correcting the position and segment used for interpolation. WeightedPercentileFun was not updated and still has the same class of bug: it interpolates using the segment [pos, pos+1] and the formula involving (threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]), whereas the threshold lies in (weighted_cdf[pos-1], weighted_cdf[pos]]. The correct segment for interpolation is [pos-1, pos].
Current logic (simplified):
pos = upper_bound(weighted_cdf, threshold) → first index such that weighted_cdf[pos] > threshold.
So weighted_cdf[pos-1] < threshold < weighted_cdf[pos] (threshold is in the segment (cum[pos-1], cum[pos]]).
The code then uses (threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1 with v1 = value[pos-1], v2 = value[pos]. Because threshold < weighted_cdf[pos], the numerator is negative, so the result can be less than v1, i.e. outside [min(y), max(y)].
Correct behavior: Interpolation should use the segment where the threshold actually lies: (weighted_cdf[pos-1], weighted_cdf[pos]], i.e. value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]), which is always in [min(y), max(y)].
2. Example & Analysis
Example: labels y = [2, 3, 4, 5], weights w = [4, 3, 2, 1], total weight = 10. We compute the weighted median (alpha = 0.5), so threshold = 5.0.
Minimal reproduction (Python, to observe init score):
importnumpyasnpimportlightgbmaslgbX=np.zeros((4, 1)) # no meaningful featuresy=np.array([2., 3., 4., 5.])
w=np.array([4., 3., 2., 1.])
train=lgb.Dataset(X, label=y, weight=w)
params= {
"objective": "regression_l1",
"num_leaves": 1,
"min_data_in_leaf": 10,
"verbosity": 1,
}
model=lgb.train(params, train)
# Log will show e.g. "Start training from score 1.000000" instead of a value in [2, 5].
3. Impact
Initial score (BoostFromScore)
With weighted L1/quantile/MAPE, the initial prediction can be outside [min(y), max(y)], which is not a valid weighted quantile and can worsen the first iteration and convergence.
Leaf value refit (RenewTreeOutput)
Every leaf of every tree uses WeightedPercentileFun to set the leaf output. The same wrong segment is used, so leaf predictions can fall outside the range of (residual) values in that leaf, distorting the tree outputs and final predictions.
mape (always uses weighted percentile via label_weight_)
4. Proposed Fix
Use the correct segment
After pos = upper_bound(weighted_cdf, threshold) - weighted_cdf.begin() and clamping pos as needed:
Treat the segment containing the threshold as (weighted_cdf[pos-1], weighted_cdf[pos]] (with pos ≥ 1 when threshold > weighted_cdf[0]).
Use lower_bound semantics if preferred: first index such that weighted_cdf[i] >= threshold, then the segment is (cum[i-1], cum[i]]; interpolation on that segment keeps the result in [min(y), max(y)].
Interpolation formula
If threshold is in (weighted_cdf[pos-1], weighted_cdf[pos]): value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]).
If threshold == weighted_cdf[pos-1] (on a knot): optionally return midpoint (value[pos-1] + value[pos]) / 2 for stability, or the left endpoint, consistent with unweighted behavior.
Boundary cases
When pos == 0 or threshold <= weighted_cdf[0]: return value[0].
When pos >= cnt_data - 1 or threshold >= weighted_cdf[cnt_data-1]: return value[cnt_data-1].
So the result is always within [min(y), max(y)].
src/objective/regression_objective.hpp – macro WeightedPercentileFun, and its use in RegressionL1loss::BoostFromScore, RegressionL1loss::RenewTreeOutput, RegressionQuantileloss::BoostFromScore, RegressionQuantileloss::RenewTreeOutput, and RegressionMAPELOSS::BoostFromScore / RenewTreeOutput.
1. Description
WeightedPercentileFun in
src/objective/regression_objective.hppcomputes the weighted percentile for initial score (BoostFromScore) and leaf value refit (RenewTreeOutput) when using regression_l1, quantile, or mape objectives with sample weights. The current implementation uses the wrong CDF segment for linear interpolation, which can produce results outside the range [min(y), max(y)], and is inconsistent with the correct weighted quantile definition.PR #5848 fixed the unweighted PercentileFun (issue #5847) by correcting the position and segment used for interpolation. WeightedPercentileFun was not updated and still has the same class of bug: it interpolates using the segment [pos, pos+1] and the formula involving
(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]), whereas the threshold lies in (weighted_cdf[pos-1], weighted_cdf[pos]]. The correct segment for interpolation is [pos-1, pos].Current logic (simplified):
pos = upper_bound(weighted_cdf, threshold)→ first index such thatweighted_cdf[pos] > threshold.weighted_cdf[pos-1] < threshold < weighted_cdf[pos](threshold is in the segment (cum[pos-1], cum[pos]]).(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1withv1 = value[pos-1],v2 = value[pos]. Because threshold < weighted_cdf[pos], the numerator is negative, so the result can be less than v1, i.e. outside [min(y), max(y)].Correct behavior: Interpolation should use the segment where the threshold actually lies: (weighted_cdf[pos-1], weighted_cdf[pos]], i.e.
value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]), which is always in [min(y), max(y)].2. Example & Analysis
Example: labels
y = [2, 3, 4, 5], weightsw = [4, 3, 2, 1], total weight = 10. We compute the weighted median (alpha = 0.5), so threshold = 5.0.Current C++ behavior:
The code uses:
v1 = value[pos-1] = 2,v2 = value[pos] = 3(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1=
(5 - 7) / (9 - 7) * (3 - 2) + 2=(-2/2)*1 + 2= 1.0.So the returned “weighted median” is 1.0, which is below min(y) = 2. This is incorrect.
Correct formula:
Use segment (cum[0], cum[1]]:
value[0] + (threshold - cum[0]) / (cum[1] - cum[0]) * (value[1] - value[0])=
2 + (5 - 4) / (7 - 4) * (3 - 2)=2 + 1/3≈ 2.333, which lies in [2, 5].Minimal reproduction (Python, to observe init score):
3. Impact
Initial score (BoostFromScore)
With weighted L1/quantile/MAPE, the initial prediction can be outside [min(y), max(y)], which is not a valid weighted quantile and can worsen the first iteration and convergence.
Leaf value refit (RenewTreeOutput)
Every leaf of every tree uses WeightedPercentileFun to set the leaf output. The same wrong segment is used, so leaf predictions can fall outside the range of (residual) values in that leaf, distorting the tree outputs and final predictions.
Consistency
The unweighted path was fixed in PR fix percentile computation for regression objectives #5848; the weighted path remains incorrect and inconsistent.
Affected objectives
4. Proposed Fix
Use the correct segment
After
pos = upper_bound(weighted_cdf, threshold) - weighted_cdf.begin()and clampingposas needed:weighted_cdf[i] >= threshold, then the segment is (cum[i-1], cum[i]]; interpolation on that segment keeps the result in [min(y), max(y)].Interpolation formula
value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]).(value[pos-1] + value[pos]) / 2for stability, or the left endpoint, consistent with unweighted behavior.Boundary cases
So the result is always within [min(y), max(y)].
Alignment with PercentileFun
Apply the same segment and interpolation logic as in the fixed PercentileFun (PR fix percentile computation for regression objectives #5848), extended to the weighted CDF and weighted_cdf indices.
5. Environment (optional)
6. References
src/objective/regression_objective.hpp– macroWeightedPercentileFun, and its use inRegressionL1loss::BoostFromScore,RegressionL1loss::RenewTreeOutput,RegressionQuantileloss::BoostFromScore,RegressionQuantileloss::RenewTreeOutput, andRegressionMAPELOSS::BoostFromScore/RenewTreeOutput.