Page MenuHomePhabricator

Provide a spark job processing history and text to extract citations diffs
Open, Needs TriagePublic

Description

Create a spark job replicating @Halfak citation extraction[1] over wiki text.

[1] https://github.com/mediawiki-utilities/python-mwrefs

Event Timeline

Comparing the two datasets (in spark-shell):

spark.read.json("/user/joal/frwiki-20170101.diffs.json.bz2").createOrReplaceTempView("halfs")
spark.read.json("/user/joal/wmf/data/wmf/mediawiki_fr/citations").createOrReplaceTempView("joals")

spark.table("halfs").cache()
spark.table("joals").cache()

spark.sql("SELECT COUNT(1) FROM halfs WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect
spark.sql("SELECT COUNT(1) FROM joals WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect
halfs7164493100%
joals712240499.41%
spark.sql("""
SELECT
 -- Sort those arrays to compare them
  (sort_array(h.references_added) = sort_array(j.references_added)) as added_equals,
  (sort_array(h.references_removed) = sort_array(j.references_removed)) as removed_equals,
  COUNT(1) as c
FROM joals j
  JOIN halfs h
    ON (j.rev_id = h.rev_id)
GROUP BY
  (sort_array(h.references_added) = sort_array(j.references_added)),
  (sort_array(h.references_removed) = sort_array(j.references_removed))
""").collect.foreach(println)
falsefalse25090.04%
truefalse23140.03%
falsetrue800291.12%
truetrue704121998.81%
Total7126071100%

Remark: There are 7126071 - 7122404 = 3667 rows in the joint dataset that have no added/removed/ references in joal dataset (not present in the first count) and still differ from halfak ones.

Removing myself as assignee - will get back to this when we tackle text-processing intermediate datasets.