Page MenuHomePhabricator

Check Avro as potential better file format for wikitext-history
Closed, ResolvedPublic3 Estimated Story Points

Description

Currently, transformed XML dumps are stored in parquet on HDFS.
This leads to memory errors in Spark as read-optimizations are being applied for parquet to perform best for analytics data.
It feels avro would probably be a better fit than parquet here.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2019, 1:37 PM
JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to 3.
fdans triaged this task as High priority.Oct 28 2019, 3:46 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without issue on Hive. Parquet read-optmizations were definitely what was causing issues here.
I however hit some interesting finding:

  • In Spark
spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\\\[\\\\[.*\\\\]\\\\].*'").show()
+--------+                                                                      
|count(1)|
+--------+
| 1223458|
+--------+

spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%'").show()
+--------+                                                                      
|count(1)|
+--------+
| 1073455|
+--------+
  • In Hive
SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\[\\[.*\\]\\].*';
_c0
1223458

SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%';
_c0
982415

If someone has any idea, I'd be glad to hear it!

Change 547184 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Move wikitext_history from parquet to avro

https://gerrit.wikimedia.org/r/547184

mforns added a subscriber: mforns.Oct 30 2019, 4:57 PM

@JAllemandou

I think the rlike comparison does not require the matched string to start with the pattern, unless you use ^.
The like comparison, however, does.
So, it can be that the extra counts in rlike results are records with revision texts that have something before #redirect.

Still a mystery why Hive's result is even lower than Spark's.

Created T236985 about spark/hive like difference

Change 547184 merged by Joal:
[analytics/refinery@master] Move wikitext_history from parquet to avro

https://gerrit.wikimedia.org/r/547184

Table is filled with data converted manually, new job seems configured correctly. We'll know about it when dumps are fully imported.

Nuria closed this task as Resolved.Nov 7 2019, 11:13 PM

This bothered me so I looked into it more. It can be explained by the fact that Spark matches newlines with % and Hive does not:

scala> spark.sql("select '#redirect [[blah\n]] blah' like '#redirect [[%]]%'").show() --> true
hive(milimetric)> select '#redirect [[blah\n]] blah' like '#redirect [[%]]%';         --> false

(ps: yes I did delete the space between hive and (milimetric) to get those two lines to line up :))