Currently, transformed XML dumps are stored in parquet on HDFS.
This leads to memory errors in Spark as read-optimizations are being applied for parquet to perform best for analytics data.
It feels avro would probably be a better fit than parquet here.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Move wikitext_history from parquet to avro | analytics/refinery | master | +2 -2 |
Related Objects
Event Timeline
I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without issue on Hive. Parquet read-optmizations were definitely what was causing issues here.
I however hit some interesting finding:
- In Spark
spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\\\[\\\\[.*\\\\]\\\\].*'").show() +--------+ |count(1)| +--------+ | 1223458| +--------+ spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%'").show() +--------+ |count(1)| +--------+ | 1073455| +--------+
- In Hive
SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\[\\[.*\\]\\].*'; _c0 1223458 SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%'; _c0 982415
If someone has any idea, I'd be glad to hear it!
Change 547184 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Move wikitext_history from parquet to avro
I think the rlike comparison does not require the matched string to start with the pattern, unless you use ^.
The like comparison, however, does.
So, it can be that the extra counts in rlike results are records with revision texts that have something before #redirect.
Still a mystery why Hive's result is even lower than Spark's.
Change 547184 merged by Joal:
[analytics/refinery@master] Move wikitext_history from parquet to avro
Table is filled with data converted manually, new job seems configured correctly. We'll know about it when dumps are fully imported.
This bothered me so I looked into it more. It can be explained by the fact that Spark matches newlines with % and Hive does not:
scala> spark.sql("select '#redirect [[blah\n]] blah' like '#redirect [[%]]%'").show() --> true hive(milimetric)> select '#redirect [[blah\n]] blah' like '#redirect [[%]]%'; --> false
(ps: yes I did delete the space between hive and (milimetric) to get those two lines to line up :))