Page MenuHomePhabricator

Bug: User History has mismatching order of fields in Parquet vs. Hive
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

Data Engineering Bug Report or Data Problem Form.

Please fill out the following
For a data related problem:
  • Is this a data quality issue? Yes
  • What datasets and/or dashboards are affected? wmf.mediawiki_user_history
  • What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots.

caused_by_user_text and caused_by_anonymous_user are not in the same order in Hive vs. Parquet (via Spark) (verified by looking at actual parquet files).

This causes select * from wmf.mediawiki_user_history where snapshot='2022-08' limit 10; to fail in Presto and Hive, because the schema in Hive metastore doesn't match the actual Parquet files.

For the DE Team to fill out
Which systems does this effect?
  • Hive
  • Druid
  • Superset
  • Turnilo
  • WikiDumps
  • Wikistats
  • Airflow
  • HDFS
  • Goblin
  • Scqoop
  • Dashiki
  • DataHub
  • Spark
  • Jupyter
  • Modern Event Platform
  • Event Logging
  • Other - presto
Impact Assessment:

Does this problem qualify as an incident?

  • Yes
  • No

Does this violate an SLO?

  • Yes
  • No
Value CalculatorRank
Will this improve the efficiency of a teams workflow?1
Does this have an effect of our Core Metrics?1
Does this align with our strategic goals?2
Is this a blocker for another team?2

Event Timeline

EChetty updated the task description. (Show Details)
EChetty moved this task from Backlog to Pipelines on the Data-Engineering-Planning board.
EChetty added a project: Data Pipelines.
EChetty set the point value for this task to 3.Nov 7 2022, 4:32 PM

Change 862295 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use parquet column names to order results from hive in presto

https://gerrit.wikimedia.org/r/862295

Change 862295 merged by Btullis:

[operations/puppet@production] Use parquet column names to order results from hive in presto

https://gerrit.wikimedia.org/r/862295

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T16:19:37Z] <btullis> restarting presto-server on an-coord1001 for T321960 and T321231

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T16:20:41Z] <btullis> roll-restarting presto workers for T321960 and T321231

Change 862305 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the presto server catalogs with new parquet settings

https://gerrit.wikimedia.org/r/862305

Change 862305 merged by Btullis:

[operations/puppet@production] Update the presto server catalogs with new parquet settings

https://gerrit.wikimedia.org/r/862305

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T16:44:58Z] <btullis> roll-restarting presto workers again for T321960 and T321231

Hopefully this is fixed now. @Milimetric are you able to confirm? Thanks.

image.png (331×1 px, 72 KB)

Is this still an issue for hive?

BTullis claimed this task.