Page MenuHomePhabricator

[BUG] Presto Error: parquet file declares column as wrong type for wmf..mediawiki_history
Closed, ResolvedPublicBUG REPORT

Description

Using SQL Editor on Superset, when querying wmf.mediawiki_history for field revision_tags getting presto error:
"presto error: The column revision_tags is declared as type array<string>, but the Parquet file declares the column as type BOOLEAN"

Also getting different presto error when querying wmf.mediawiki_history for field revision_id:
presto error: The column revision_id is declared as type bigint, but the Parquet file declares the column as type BINARY

This happens on all 2020 snapshots

Event Timeline

SNowick_WMF renamed this task from Presto Error: parquet file declares column as wrong type for wmf..mediawiki_history to [BUG] Presto Error: parquet file declares column as wrong type for wmf..mediawiki_history.Aug 26 2020, 1:57 AM
Restricted Application changed the subtype of this task from "Task" to "Bug Report". · View Herald TranscriptAug 26 2020, 1:57 AM

Change 622474 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add page-artificial-id to mediawiki-history hive schema

https://gerrit.wikimedia.org/r/622474

Change 622474 merged by Joal:
[analytics/refinery@master] Add page-artificial-id to mediawiki-history hive schema

https://gerrit.wikimedia.org/r/622474

Indeed it was a bug!
We on purpose left out artificial-page-id field from the mediawiki-history hive schema.
The field allows to link events to a single page when we can't link those events to a real page-id.
However presto doesn't reconcile hive schema with parquet schema and assumes that they match, which was not the case.
The field is now added and queries for revision_tags to presto work.

@elukey to update the setting in presto that assumes order on columns on hive metastore when looking fields up on parquet as this is likely to cause a problem in the future as tables evolve.

Change 622598 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] presto: set hive.parquet.use-column-names to true

https://gerrit.wikimedia.org/r/622598

Change 622598 merged by Elukey:
[operations/puppet@production] presto: set hive.parquet.use-column-names to true

https://gerrit.wikimedia.org/r/622598

Change 622607 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: set hive.parquet.use-column-names to true

https://gerrit.wikimedia.org/r/622607

Change 622607 abandoned by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: set hive.parquet.use-column-names to true

Reason:
not needed!

https://gerrit.wikimedia.org/r/622607

@elukey to update the setting in presto that assumes order on columns on hive metastore when looking fields up on parquet as this is likely to cause a problem in the future as tables evolve.

Done! @SNowick_WMF can you re-test and tell us if the issue is fixed?

Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric moved this task from Operational Excellence to Incoming on the Analytics board.
Milimetric subscribed.

@SNowick_WMF please reopen if you see something wrong

Looks good, queries are working now, thanks!