Page MenuHomePhabricator

Broken data in edits_hourly Druid data cube
Closed, ResolvedPublic


The edits_hourly data cube in Druid contains data that when queried results in errors in both Superset and Turnilo.

In Superset, querying edits_hourly for data that spans 2021-06-12 to 2021-06-30 (inclusive on both ends) will throw this error (as reported by @MNeisler in T346636#9179412):

java.lang.RuntimeException: net.jpountz.lz4.LZ4Exception: Error decoding offset 53533 of input buffer

In Turnilo, querying edits_hourly for data at any point within June 2021 results in an error:

Query error
Request failed with status
code 500

At this point, we don't know if other date ranges will also results in errors.

Event Timeline

VirginiaPoundstone triaged this task as High priority.EditedSep 20 2023, 1:31 PM

Thanks @nettrom_WMF, this is underview as part of our ops week.

Initial checks:

  • Error is reproducible in turnilo for month 2021-06.
  • This month is the only generating error in turnilo (other time-periods show data).
  • The data can be queried in the Hive edit_hourly table.

Interesting counter-intuitive result:

  • SQL queries on the druid dataset from SQLlab work!

Continuing investigations

The issue is with the __time druid field, but the error doesn't always show up depending on how it is accessed.
When accessed for filtering only at month level, no error, while it errors at day level for days between 2021-06-11 and 2021-06-30 (both included).
Given the error from druid (decompression error), this feels like a corrupted file from indexation.

I will launch a re-indexation of the dataset, which hopefully should fix the issue.