To make it easier for our users, we should setup Hive so that it default to Parquet when creating tables.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | JAllemandou | T168554 Default hive table creation to parquet - needs hive 2.3.0 | |||
Resolved | elukey | T203693 Update to CDH 6 or other up-to-date Hadoop distribution | |||
Resolved | elukey | T273711 Upgrade the Analytics Hadoop cluster to Apache Bigtop | |||
Resolved | elukey | T274345 Move the puppet codebase from cdh to bigtop | |||
Resolved | Milimetric | T274322 Clean up issues with jobs after Hadoop Upgrade | |||
Resolved | Ottomata | T274384 Repackage spark without hadoop, use provided hadoop jars | |||
Resolved | elukey | T276121 asoranking timer failed on stat1007 | |||
Resolved | JAllemandou | T260409 Establish what data must be backed up before the HDFS upgrade | |||
Resolved | elukey | T260411 Create a temporary hadoop backup cluster | |||
Resolved | JAllemandou | T272846 Backup HDFS data before BigTop upgrade | |||
Resolved | elukey | T244499 Upgrade the Hadoop test cluster to BigTop | |||
Duplicate | elukey | T263814 Create temporary cluster to hold a copy of data for backup purposes |
Event Timeline
Parquet as default has been added only in version 2.3.0 (see https://stackoverflow.com/questions/44038151/hive-how-to-set-parquet-orc-as-default-output-format).
Yes! The only concern I have is with the null-value in struct bug we've hit. It seems related to parquet. I think we should do it and possibly revert if too many problems show up :)
0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! set hive.default.fileformat=TextFile 1 row selected (0.223 seconds) 0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat=Parquet; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! No rows affected (0.013 seconds) 0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat.managed=Parquet; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! No rows affected (0.01 seconds) 0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! set hive.default.fileformat=Parquet 1 row selected (0.025 seconds) 0: jdbc:hive2://analytics-test-hive.eqiad.wmn> set hive.default.fileformat.managed; going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! set hive.default.fileformat.managed=Parquet 1 row selected (0.019 seconds)
I confirm the change of property does something (newly created table without format is stored as parquet). We need to implement the change in main hive-site.xml. Doing it.
Change 665054 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Use parquet as Hive default file format
Change 665054 merged by Elukey:
[operations/puppet@production] Use parquet as Hive default file format
Change 665072 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Move analytics-hive.eqiad.wmnet to an-coord1002
Change 665072 merged by Elukey:
[operations/dns@master] Move analytics-hive.eqiad.wmnet to an-coord1002
Change 665425 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Make hive temporary-tables storage format explicit
The change has broken some jobs:
- mobile_apps-uniques-daily-coord (job failed on 2021-02-18)
- pageview-daily_dump-coord (successful job but incorrect data - to be rerun for 2021-02-18)
- pageview_historical (not running, used for backfilling old data)
- pageview_historical_raw (not running, used for backfilling old data)
Change 665425 merged by Milimetric:
[analytics/refinery@master] Make hive temporary-tables storage format explicit