Provide a spark job processing history and text to extract citations diffs
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	JAllemandou
	Feb 23 2017, 7:22 PM

Description

Create a spark job replicating @Halfak citation extraction[1] over wiki text.

[1] https://github.com/mediawiki-utilities/python-mwrefs

Event Timeline

JAllemandou created this task.Feb 23 2017, 7:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2017, 7:22 PM

I produced this dataset with mwrefs for comparison to the spark job: https://datasets.wikimedia.org/public-datasets/all/wp10/20170101/frwiki-20170101.diffs.json.bz2

Comparing the two datasets (in spark-shell):

spark.read.json("/user/joal/frwiki-20170101.diffs.json.bz2").createOrReplaceTempView("halfs")
spark.read.json("/user/joal/wmf/data/wmf/mediawiki_fr/citations").createOrReplaceTempView("joals")

spark.table("halfs").cache()
spark.table("joals").cache()

spark.sql("SELECT COUNT(1) FROM halfs WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect
spark.sql("SELECT COUNT(1) FROM joals WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect

halfs	7164493	100%
joals	7122404	99.41%

spark.sql("""
SELECT
 -- Sort those arrays to compare them
  (sort_array(h.references_added) = sort_array(j.references_added)) as added_equals,
  (sort_array(h.references_removed) = sort_array(j.references_removed)) as removed_equals,
  COUNT(1) as c
FROM joals j
  JOIN halfs h
    ON (j.rev_id = h.rev_id)
GROUP BY
  (sort_array(h.references_added) = sort_array(j.references_added)),
  (sort_array(h.references_removed) = sort_array(j.references_removed))
""").collect.foreach(println)

false	false	2509	0.04%
true	false	2314	0.03%
false	true	80029	1.12%
true	true	7041219	98.81%
Total		7126071	100%

Remark: There are 7126071 - 7122404 = 3667 rows in the joint dataset that have no added/removed/ references in joal dataset (not present in the first count) and still differ from halfak ones.

• Nuria moved this task from Incoming to Radar on the Analytics board.Mar 2 2017, 5:26 PM

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 12:20 AM

Removing myself as assignee - will get back to this when we tackle text-processing intermediate datasets.

JAllemandou removed JAllemandou as the assignee of this task.Mar 19 2020, 1:42 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

Aklapper added a project: Data-Engineering-Icebox.Feb 10 2023, 5:44 PM

Provide a spark job processing history and text to extract citations diffsOpen, Needs TriagePublicActions

Description

Event Timeline

Provide a spark job processing history and text to extract citations diffs
Open, Needs TriagePublic
Actions