Check Avro as potential better file format for wikitext-history
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	JAllemandou
	Oct 28 2019, 1:37 PM

Description

Currently, transformed XML dumps are stored in parquet on HDFS.
This leads to memory errors in Spark as read-optimizations are being applied for parquet to perform best for analytics data.
It feels avro would probably be a better fit than parquet here.

Details

	Subject	Repo	Branch	Lines +/-
	Move wikitext_history from parquet to avro	analytics/refinery	master	+2 -2

Customize query in gerrit

Related Objects

Mentioned Here: T236985: Understand why SQL string pattern matching differ from Hive to Spark

Event Timeline

JAllemandou created this task.Oct 28 2019, 1:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2019, 1:37 PM

JAllemandou claimed this task.Oct 28 2019, 1:37 PM

JAllemandou added a project: Analytics-Kanban.

JAllemandou set the point value for this task to 3.

• fdans triaged this task as High priority.Oct 28 2019, 3:46 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 29 2019, 7:42 PM

I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without issue on Hive. Parquet read-optmizations were definitely what was causing issues here.
I however hit some interesting finding:

In Spark

spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\\\[\\\\[.*\\\\]\\\\].*'").show()
+--------+                                                                      
|count(1)|
+--------+
| 1223458|
+--------+

spark.sql("SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%'").show()
+--------+                                                                      
|count(1)|
+--------+
| 1073455|
+--------+

In Hive

SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) rlike '#redirect \\[\\[.*\\]\\].*';
_c0
1223458

SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki' and lower(revision_text) like '#redirect [[%]]%';
_c0
982415

If someone has any idea, I'd be glad to hear it!

Change 547184 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Move wikitext_history from parquet to avro

https://gerrit.wikimedia.org/r/547184

gerritbot added a project: Patch-For-Review.Oct 30 2019, 12:09 PM

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 30 2019, 12:09 PM

@JAllemandou

I think the rlike comparison does not require the matched string to start with the pattern, unless you use ^.
The like comparison, however, does.
So, it can be that the extra counts in rlike results are records with revision texts that have something before #redirect.

Still a mystery why Hive's result is even lower than Spark's.

Created T236985 about spark/hive like difference

Change 547184 merged by Joal:
[analytics/refinery@master] Move wikitext_history from parquet to avro

https://gerrit.wikimedia.org/r/547184

JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Oct 31 2019, 2:24 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 31 2019, 3:10 PM

Table is filled with data converted manually, new job seems configured correctly. We'll know about it when dumps are fully imported.

• Nuria closed this task as Resolved.Nov 7 2019, 11:13 PM

This bothered me so I looked into it more. It can be explained by the fact that Spark matches newlines with % and Hive does not:

scala> spark.sql("select '#redirect [[blah\n]] blah' like '#redirect [[%]]%'").show() --> true
hive(milimetric)> select '#redirect [[blah\n]] blah' like '#redirect [[%]]%';         --> false

(ps: yes I did delete the space between hive and (milimetric) to get those two lines to line up :))

Check Avro as potential better file format for wikitext-historyClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Check Avro as potential better file format for wikitext-history
Closed, ResolvedPublic3 Estimated Story Points
Actions