Skip to content

[improve][misc] Migrate Yahoo DataSketches to Apache DataSketches#25965

Merged
lhotari merged 1 commit into
apache:masterfrom
lhotari:lh-migrate-to-apache-datasketches
Jun 8, 2026
Merged

[improve][misc] Migrate Yahoo DataSketches to Apache DataSketches#25965
lhotari merged 1 commit into
apache:masterfrom
lhotari:lh-migrate-to-apache-datasketches

Conversation

@lhotari
Copy link
Copy Markdown
Member

@lhotari lhotari commented Jun 8, 2026

Motivation

Pulsar still depends on com.yahoo.datasketches:sketches-core:0.8.3. The DataSketches project moved to the Apache Software Foundation and was renamed to org.apache.datasketches:datasketches-java; the old Yahoo coordinates have been unmaintained for years.

This mirrors the motivation of apache/bookkeeper#4774. Pulsar master is still on BookKeeper 4.17.3 but will soon migrate to 4.18.0, which uses Apache DataSketches — and there is an intent to drop Yahoo DataSketches on the Pulsar side as well. Migrating now removes the unmaintained dependency and aligns Pulsar with BookKeeper.

The Apache DataSketches maintainers recommend the KLL sketch over the classic quantiles sketch for new code (datasketches-java#398): it is faster and more memory efficient with comparable accuracy. This change therefore ports to KllDoublesSketch rather than a 1:1 replacement with the classic DoublesSketch.

Modifications

  • Version catalog (gradle/libs.versions.toml): removed com.yahoo.datasketches:sketches-core:0.8.3; added org.apache.datasketches:datasketches-java:7.0.1 and datasketches-memory:4.1.0. Build files now reference libs.datasketches.java.
  • Sketch usages ported to org.apache.datasketches.kll.KllDoublesSketch in DataSketchesOpStatsLogger (broker + bookkeeper-prometheus-metrics-provider), DataSketchesSummaryLogger, ThreadLocalAccessor, and the client ProducerStatsRecorderImpl:
    • On-heap throughout (newHeapInstance()), matching prior behavior.
    • KLL has no DoublesUnion; aggregation uses merge() into a freshly allocated heap sketch.
    • KLL has no reset(); per-thread sketches are reassigned to a fresh newHeapInstance() under the existing StampedLock write lock.
    • KllDoublesSketch.getQuantile()/getQuantiles() throw on an empty sketch (the classic sketch returned NaN), so callers now guard with isEmpty() to preserve the previous NaN return.
  • Shading: updated relocation/include rules in pulsar.client-shade-conventions.gradle.kts and pulsar-functions/localrun-shaded from com.yahoo* to org.apache.datasketches.
  • LICENSE (distribution/server, distribution/shell): updated the bundled jar entries to the Apache coordinates/versions.
  • Tests: ThreadLocalAccessorTest ported to KllDoublesSketch.

Note: metric names and formats are unchanged; only the internal sketch algorithm changes, so approximate quantile values may differ slightly.

Verifying this change

This change is already covered by existing tests (e.g. ThreadLocalAccessorTest and the broker stats/metrics tests). Checkstyle and spotless pass locally.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)

@lhotari lhotari added this to the 5.0.0-M1 milestone Jun 8, 2026
@lhotari lhotari requested a review from liangyepianzhou June 8, 2026 08:56
@lhotari lhotari merged commit c78f832 into apache:master Jun 8, 2026
74 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants