WeightedPercentileFun uses wrong segment and can return values outside [min(y), max(y)]

## 1. Description

**WeightedPercentileFun** in `src/objective/regression_objective.hpp` computes the weighted percentile for initial score (BoostFromScore) and leaf value refit (RenewTreeOutput) when using **regression_l1**, **quantile**, or **mape** objectives with sample weights. The current implementation uses the **wrong CDF segment** for linear interpolation, which can produce results **outside the range [min(y), max(y)]**, and is inconsistent with the correct weighted quantile definition.

PR [#5848](https://github.com/microsoft/LightGBM/pull/5848) fixed the **unweighted** PercentileFun (issue [#5847](https://github.com/microsoft/LightGBM/issues/5847)) by correcting the position and segment used for interpolation. **WeightedPercentileFun was not updated** and still has the same class of bug: it interpolates using the segment **[pos, pos+1]** and the formula involving `(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos])`, whereas the threshold lies in **(weighted_cdf[pos-1], weighted_cdf[pos]]**. The correct segment for interpolation is **[pos-1, pos]**.

**Current logic (simplified):**
- `pos = upper_bound(weighted_cdf, threshold)` → first index such that `weighted_cdf[pos] > threshold`.
- So `weighted_cdf[pos-1] < threshold < weighted_cdf[pos]` (threshold is in the segment **(cum[pos-1], cum[pos]]**).
- The code then uses `(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1` with `v1 = value[pos-1]`, `v2 = value[pos]`. Because **threshold < weighted_cdf[pos]**, the numerator is **negative**, so the result can be **less than v1**, i.e. **outside [min(y), max(y)]**.

**Correct behavior:** Interpolation should use the segment where the threshold actually lies: **(weighted_cdf[pos-1], weighted_cdf[pos]]**, i.e. `value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1])`, which is always in **[min(y), max(y)]**.

---

## 2. Example & Analysis

**Example:** labels `y = [2, 3, 4, 5]`, weights `w = [4, 3, 2, 1]`, total weight = 10. We compute the weighted median (alpha = 0.5), so threshold = 5.0.

- Sorted by value: values = [2, 3, 4, 5], weights = [4, 3, 2, 1].
- Cumulative weights: cum = [4, 7, 9, 10].
- **upper_bound(cum, 5)** returns the first index where cum[i] > 5 → **pos = 1** (since cum[1] = 7 > 5).
- So threshold 5 lies in **(cum[0], cum[1]] = (4, 7]**; the correct segment is indices 0 and 1, i.e. values 2 and 3.

**Current C++ behavior:**  
The code uses:
- `v1 = value[pos-1] = 2`, `v2 = value[pos] = 3`
- Interpolation: `(threshold - weighted_cdf[pos]) / (weighted_cdf[pos+1] - weighted_cdf[pos]) * (v2 - v1) + v1`  
  = `(5 - 7) / (9 - 7) * (3 - 2) + 2` = `(-2/2)*1 + 2` = **1.0**.

So the returned “weighted median” is **1.0**, which is **below min(y) = 2**. This is incorrect.

**Correct formula:**  
Use segment (cum[0], cum[1]]:  
`value[0] + (threshold - cum[0]) / (cum[1] - cum[0]) * (value[1] - value[0])`  
= `2 + (5 - 4) / (7 - 4) * (3 - 2)` = `2 + 1/3` ≈ **2.333**, which lies in [2, 5].

**Minimal reproduction (Python, to observe init score):**

```python
import numpy as np
import lightgbm as lgb

X = np.zeros((4, 1))  # no meaningful features
y = np.array([2., 3., 4., 5.])
w = np.array([4., 3., 2., 1.])

train = lgb.Dataset(X, label=y, weight=w)
params = {
    "objective": "regression_l1",
    "num_leaves": 1,
    "min_data_in_leaf": 10,
    "verbosity": 1,
}
model = lgb.train(params, train)
# Log will show e.g. "Start training from score 1.000000" instead of a value in [2, 5].
```

---

## 3. Impact

1. **Initial score (BoostFromScore)**  
   With weighted L1/quantile/MAPE, the initial prediction can be outside [min(y), max(y)], which is not a valid weighted quantile and can worsen the first iteration and convergence.

2. **Leaf value refit (RenewTreeOutput)**  
   Every leaf of every tree uses WeightedPercentileFun to set the leaf output. The same wrong segment is used, so leaf predictions can fall outside the range of (residual) values in that leaf, distorting the tree outputs and final predictions.

3. **Consistency**  
   The unweighted path was fixed in PR #5848; the weighted path remains incorrect and inconsistent.

4. **Affected objectives**  
   - **regression_l1** (with sample weights)  
   - **quantile** (with sample weights)  
   - **mape** (always uses weighted percentile via label_weight_)

---

## 4. Proposed Fix

1. **Use the correct segment**  
   After `pos = upper_bound(weighted_cdf, threshold) - weighted_cdf.begin()` and clamping `pos` as needed:
   - Treat the segment containing the threshold as **(weighted_cdf[pos-1], weighted_cdf[pos]]** (with pos ≥ 1 when threshold > weighted_cdf[0]).
   - Use **lower_bound** semantics if preferred: first index such that `weighted_cdf[i] >= threshold`, then the segment is (cum[i-1], cum[i]]; interpolation on that segment keeps the result in [min(y), max(y)].

2. **Interpolation formula**  
   - If threshold is in **(weighted_cdf[pos-1], weighted_cdf[pos])**:  
     `value[pos-1] + (threshold - weighted_cdf[pos-1]) / (weighted_cdf[pos] - weighted_cdf[pos-1]) * (value[pos] - value[pos-1])`.
   - If threshold == weighted_cdf[pos-1] (on a knot): optionally return midpoint `(value[pos-1] + value[pos]) / 2` for stability, or the left endpoint, consistent with unweighted behavior.

3. **Boundary cases**  
   - When pos == 0 or threshold <= weighted_cdf[0]: return value[0].  
   - When pos >= cnt_data - 1 or threshold >= weighted_cdf[cnt_data-1]: return value[cnt_data-1].  
   So the result is always within **[min(y), max(y)]**.

4. **Alignment with PercentileFun**  
   Apply the same segment and interpolation logic as in the fixed PercentileFun (PR #5848), extended to the weighted CDF and weighted_cdf indices.

---

## 5. Environment (optional)

- LightGBM version: e.g. 4.6.0
- macos Python3.12.12

---

## 6. References

- Issue [#5847](https://github.com/microsoft/LightGBM/issues/5847) – Median wrongly computed (unweighted).
- PR [#5848](https://github.com/microsoft/LightGBM/pull/5848) – Fix percentile computation for regression objectives (PercentileFun only).
- `src/objective/regression_objective.hpp` – macro `WeightedPercentileFun`, and its use in `RegressionL1loss::BoostFromScore`, `RegressionL1loss::RenewTreeOutput`, `RegressionQuantileloss::BoostFromScore`, `RegressionQuantileloss::RenewTreeOutput`, and `RegressionMAPELOSS::BoostFromScore` / `RenewTreeOutput`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeightedPercentileFun uses wrong segment and can return values outside [min(y), max(y)] #7151

1. Description

2. Example & Analysis

3. Impact

4. Proposed Fix

5. Environment (optional)

6. References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

WeightedPercentileFun uses wrong segment and can return values outside [min(y), max(y)] #7151

Description

1. Description

2. Example & Analysis

3. Impact

4. Proposed Fix

5. Environment (optional)

6. References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions