Create a spark job replicating @Halfak citation extraction[1] over wiki text.
Description
Description
Event Timeline
Comment Actions
I produced this dataset with mwrefs for comparison to the spark job: https://datasets.wikimedia.org/public-datasets/all/wp10/20170101/frwiki-20170101.diffs.json.bz2
Comment Actions
Comparing the two datasets (in spark-shell):
spark.read.json("/user/joal/frwiki-20170101.diffs.json.bz2").createOrReplaceTempView("halfs") spark.read.json("/user/joal/wmf/data/wmf/mediawiki_fr/citations").createOrReplaceTempView("joals") spark.table("halfs").cache() spark.table("joals").cache() spark.sql("SELECT COUNT(1) FROM halfs WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect spark.sql("SELECT COUNT(1) FROM joals WHERE SIZE(references_added) > 0 OR SIZE(references_removed) > 0").collect
halfs | 7164493 | 100% |
joals | 7122404 | 99.41% |
spark.sql(""" SELECT -- Sort those arrays to compare them (sort_array(h.references_added) = sort_array(j.references_added)) as added_equals, (sort_array(h.references_removed) = sort_array(j.references_removed)) as removed_equals, COUNT(1) as c FROM joals j JOIN halfs h ON (j.rev_id = h.rev_id) GROUP BY (sort_array(h.references_added) = sort_array(j.references_added)), (sort_array(h.references_removed) = sort_array(j.references_removed)) """).collect.foreach(println)
false | false | 2509 | 0.04% |
true | false | 2314 | 0.03% |
false | true | 80029 | 1.12% |
true | true | 7041219 | 98.81% |
Total | 7126071 | 100% | |
Remark: There are 7126071 - 7122404 = 3667 rows in the joint dataset that have no added/removed/ references in joal dataset (not present in the first count) and still differ from halfak ones.
Comment Actions
Removing myself as assignee - will get back to this when we tackle text-processing intermediate datasets.