Page MenuHomePhabricator

Default hive table creation to parquet - needs hive 2.3.0
Closed, ResolvedPublic

Description

To make it easier for our users, we should setup Hive so that it default to Parquet when creating tables.

Event Timeline

JAllemandou renamed this task from Default hive table creation to parquet to Default hive table creation to parquet - needs hive 2.3.0.Jun 29 2017, 8:37 AM
JAllemandou moved this task from Operational Excellence Future to Blocked on the Analytics board.

Yes! The only concern I have is with the null-value in struct bug we've hit. It seems related to parquet. I think we should do it and possibly revert if too many problems show up :)

0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
set
hive.default.fileformat=TextFile
1 row selected (0.223 seconds)
0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat=Parquet;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
No rows affected (0.013 seconds)
0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat.managed=Parquet;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
No rows affected (0.01 seconds)
0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
set
hive.default.fileformat=Parquet
1 row selected (0.025 seconds)
0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat.managed;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
set
hive.default.fileformat.managed=Parquet
1 row selected (0.019 seconds)

I confirm the change of property does something (newly created table without format is stored as parquet). We need to implement the change in main hive-site.xml. Doing it.

Change 665054 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Use parquet as Hive default file format

https://gerrit.wikimedia.org/r/665054

Change 665054 merged by Elukey:
[operations/puppet@production] Use parquet as Hive default file format

https://gerrit.wikimedia.org/r/665054

Change 665072 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Move analytics-hive.eqiad.wmnet to an-coord1002

https://gerrit.wikimedia.org/r/665072

Change 665072 merged by Elukey:
[operations/dns@master] Move analytics-hive.eqiad.wmnet to an-coord1002

https://gerrit.wikimedia.org/r/665072

elukey triaged this task as Medium priority.
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.

Change 665425 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Make hive temporary-tables storage format explicit

https://gerrit.wikimedia.org/r/665425

The change has broken some jobs:

  • mobile_apps-uniques-daily-coord (job failed on 2021-02-18)
  • pageview-daily_dump-coord (successful job but incorrect data - to be rerun for 2021-02-18)
  • pageview_historical (not running, used for backfilling old data)
  • pageview_historical_raw (not running, used for backfilling old data)

Change 665425 merged by Milimetric:
[analytics/refinery@master] Make hive temporary-tables storage format explicit

https://gerrit.wikimedia.org/r/665425