Skip to content

WeightedPercentileFun uses wrong segment and can return values outside [min(y), max(y)] #7151

@hanataba0217

Description

@hanataba0217

1. Description

WeightedPercentileFun in src/objective/regression_objective.hpp computes the weighted percentile for initial score (BoostFromScore) and leaf value refit (RenewTreeOutput) when using regression_l1, quantile, or mape objectives with sample weights. The current implementation uses the wrong CDF segment for linear interpolation, which can produce results outside the range [min(y), max(y)], and is inconsistent with the correct weighted quantile definition.

PR #5848 fixed the unweighted PercentileFun (issue #5847) by correcting the position and segment used for interpolation. WeightedPercentileFun was not updated and still has the same class of bug: it interpolates using the segment [pos, pos+1] and the formula involving (threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]), whereas the threshold lies in (weighted_cdf[pos-1], weighted_cdf[pos]]. The correct segment for interpolation is [pos-1, pos].

Current logic (simplified):

  • pos = upper_bound(weighted_cdf, threshold) → first index such that weighted_cdf[pos] > threshold.
  • So weighted_cdf[pos-1] < threshold < weighted_cdf[pos] (threshold is in the segment (cum[pos-1], cum[pos]]).
  • The code then uses (threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1 with v1 = value[pos-1], v2 = value[pos]. Because threshold < weighted_cdf[pos], the numerator is negative, so the result can be less than v1, i.e. outside [min(y), max(y)].

Correct behavior: Interpolation should use the segment where the threshold actually lies: (weighted_cdf[pos-1], weighted_cdf[pos]], i.e. value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]), which is always in [min(y), max(y)].


2. Example & Analysis

Example: labels y = [2, 3, 4, 5], weights w = [4, 3, 2, 1], total weight = 10. We compute the weighted median (alpha = 0.5), so threshold = 5.0.

  • Sorted by value: values = [2, 3, 4, 5], weights = [4, 3, 2, 1].
  • Cumulative weights: cum = [4, 7, 9, 10].
  • upper_bound(cum, 5) returns the first index where cum[i] > 5 → pos = 1 (since cum[1] = 7 > 5).
  • So threshold 5 lies in (cum[0], cum[1]] = (4, 7]; the correct segment is indices 0 and 1, i.e. values 2 and 3.

Current C++ behavior:
The code uses:

  • v1 = value[pos-1] = 2, v2 = value[pos] = 3
  • Interpolation: (threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1
    = (5 - 7) / (9 - 7) * (3 - 2) + 2 = (-2/2)*1 + 2 = 1.0.

So the returned “weighted median” is 1.0, which is below min(y) = 2. This is incorrect.

Correct formula:
Use segment (cum[0], cum[1]]:
value[0] + (threshold - cum[0]) / (cum[1] - cum[0]) * (value[1] - value[0])
= 2 + (5 - 4) / (7 - 4) * (3 - 2) = 2 + 1/32.333, which lies in [2, 5].

Minimal reproduction (Python, to observe init score):

import numpy as np
import lightgbm as lgb

X = np.zeros((4, 1))  # no meaningful features
y = np.array([2., 3., 4., 5.])
w = np.array([4., 3., 2., 1.])

train = lgb.Dataset(X, label=y, weight=w)
params = {
    "objective": "regression_l1",
    "num_leaves": 1,
    "min_data_in_leaf": 10,
    "verbosity": 1,
}
model = lgb.train(params, train)
# Log will show e.g. "Start training from score 1.000000" instead of a value in [2, 5].

3. Impact

  1. Initial score (BoostFromScore)
    With weighted L1/quantile/MAPE, the initial prediction can be outside [min(y), max(y)], which is not a valid weighted quantile and can worsen the first iteration and convergence.

  2. Leaf value refit (RenewTreeOutput)
    Every leaf of every tree uses WeightedPercentileFun to set the leaf output. The same wrong segment is used, so leaf predictions can fall outside the range of (residual) values in that leaf, distorting the tree outputs and final predictions.

  3. Consistency
    The unweighted path was fixed in PR fix percentile computation for regression objectives #5848; the weighted path remains incorrect and inconsistent.

  4. Affected objectives

    • regression_l1 (with sample weights)
    • quantile (with sample weights)
    • mape (always uses weighted percentile via label_weight_)

4. Proposed Fix

  1. Use the correct segment
    After pos = upper_bound(weighted_cdf, threshold) - weighted_cdf.begin() and clamping pos as needed:

    • Treat the segment containing the threshold as (weighted_cdf[pos-1], weighted_cdf[pos]] (with pos ≥ 1 when threshold > weighted_cdf[0]).
    • Use lower_bound semantics if preferred: first index such that weighted_cdf[i] >= threshold, then the segment is (cum[i-1], cum[i]]; interpolation on that segment keeps the result in [min(y), max(y)].
  2. Interpolation formula

    • If threshold is in (weighted_cdf[pos-1], weighted_cdf[pos]):
      value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1]).
    • If threshold == weighted_cdf[pos-1] (on a knot): optionally return midpoint (value[pos-1] + value[pos]) / 2 for stability, or the left endpoint, consistent with unweighted behavior.
  3. Boundary cases

    • When pos == 0 or threshold <= weighted_cdf[0]: return value[0].
    • When pos >= cnt_data - 1 or threshold >= weighted_cdf[cnt_data-1]: return value[cnt_data-1].
      So the result is always within [min(y), max(y)].
  4. Alignment with PercentileFun
    Apply the same segment and interpolation logic as in the fixed PercentileFun (PR fix percentile computation for regression objectives #5848), extended to the weighted CDF and weighted_cdf indices.


5. Environment (optional)

  • LightGBM version: e.g. 4.6.0
  • macos Python3.12.12

6. References

  • Issue #5847 – Median wrongly computed (unweighted).
  • PR #5848 – Fix percentile computation for regression objectives (PercentileFun only).
  • src/objective/regression_objective.hpp – macro WeightedPercentileFun, and its use in RegressionL1loss::BoostFromScore, RegressionL1loss::RenewTreeOutput, RegressionQuantileloss::BoostFromScore, RegressionQuantileloss::RenewTreeOutput, and RegressionMAPELOSS::BoostFromScore / RenewTreeOutput.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions