Page MenuHomePhabricator

Provide edit tags in the Data Lake edit data
Closed, ResolvedPublic5 Estimated Story Points

Description

It would be very useful to have edit tags in the Data Lake. For example, in Editing, we use these tags to determine which platform was used to make an edit (e.g. the mobile web editor, the mobile app editor, the visual editor), which is extremely useful for analyzing editor behavior on a specific platform (for example, are mobile edits reverted more often than desktop edits?).

A couple things to note:

  • Tags can be changed at any time by administrators.
  • Some wikis may have tags with the same name but different meanings (for exampled, "matched abuse filter 34" would have a different meaning on each wiki, although it seems like in practice most wikis give abuse filter tags descriptive names that wouldn't need to be disambiguated).

Event Timeline

is this data available in mediawiki? It used to be that tags were linked to how users registeted rather than where did edit happen, is that now fixed?

Nuria triaged this task as Medium priority.Mar 27 2017, 3:43 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.

is this data available in mediawiki?

Yes, it's available in the change_tag and tag_summary tables.

It used to be that tags were linked to how users registeted rather than where did edit happen, is that now fixed?

I don't think that was ever the case. I know researchers have sometimes used whether an editor registered on a mobile device (which they knew from the ServerSideAccountCreation logs) as a proxy for being a mobile editor, but edit tags, linked to specific revisions, have been around for at least as long as I've been at the WMF (2 years).

Ok, we will keep this one in mind to add once data is being populated in a recurrent schedule w/o issues.

RFC about tags: https://phabricator.wikimedia.org/T185355 we probably want the new schema to be stable before doing these imports

mforns raised the priority of this task from Medium to High.Apr 19 2018, 4:56 PM
mforns lowered the priority of this task from High to Medium.Oct 8 2018, 4:13 PM
mforns raised the priority of this task from Medium to Needs Triage.Oct 8 2018, 4:13 PM
mforns moved this task from Smart Tools for Better Data to Blocked on the Analytics board.

For the record, I actually think this is unblocked now!

As I commented in T205940:

There is a refactor of the change_tag tables underway (T185355), but the new ct_tag_id columns and change_tag_def tables are actually already being written, so as long as we avoid select * from change_tag we won't have to change the workflow again.

So, for example, this query should now work permanently:

SELECT
database(),
ct_rev_id,
ct_tag_id,
ctd_name,
FROM change_tag
LEFT JOIN change_tag_def
ON ct_tag_id = ctd_id

I think you are talking about : https://phabricator.wikimedia.org/T205940 which is just adding tag tables to data lake and you are right that is not blocked, in order to have tag info per revision I think we probably want to make sure the refactor is finished.

FYI that code to add edit tags to data lake is in our scripts but disabled thus far due to performance issues. Once we resolve more pressing issues with the scooping of data that have surfaced after the refactor of comment table we will go back to looking into chnage tags performance problems. cc @Milimetric @JAllemandou

JAllemandou triaged this task as High priority.
JAllemandou moved this task from Blocked to Smart Tools for Better Data on the Analytics board.
JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to 5.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Ping @JAllemandou while task is in CR there is no patch , maybe we need to push to gerrit/update bug in commit?

Just FYI: not having this has created a slight problem with the February board metrics (T218055), since I could no longer use the change tags from dbstore1002 and I had been counting on this as a replacement. It's my fault for not thinking about that when I agreed to the dbstore1002 shutdown, but do let me know as soon as possible if there's a risk this won't be ready for the March snapshot either 😁

Change tag tables (not as part of mw history) are in scooped in hadoop.

See:
select count(*) from mediawiki_change_tag where wiki_db="eswiki" and snapshot="2019-02";

So even if change tags are not incorporated to march snapshot they are available in hadoop and scooped monthly so you could calculate metrics, i would certainly do a backup plan of using scooped data, we hope change tags will be available but things happen.

So even if change tags are not incorporated to march snapshot they are available in hadoop and scooped monthly so you could calculate metrics, i would certainly do a backup plan of using scooped data, we hope change tags will be available but things happen.

Thanks, good to know.

Update: March snapshot (that will trigger on April 1st) but the "official" version will still not have change tags in the denormalize data. We will be trying some data quality fixes plus change tags addition on another "testing snapshot" that will be available also around this time. You are welcome to give a try to the test snapshot and let us know if you see some obvious problems, @JAllemandou will also be testing some quality fixes too.

We are doing it this way cause we need to update our data guards for the snapshots and we do not feel 100% confident we can get it done in time for the March snapshot.

Update: March snapshot (that will trigger on April 1st) but the "official" version will still not have change tags in the denormalize data. We will be trying some data quality fixes plus change tags addition on another "testing snapshot" that will be available also around this time. You are welcome to give a try to the test snapshot and let us know if you see some obvious problems, @JAllemandou will also be testing some quality fixes too.

We are doing it this way cause we need to update our data guards for the snapshots and we do not feel 100% confident we can get it done in time for the March snapshot.

Thanks for the update! This sounds like a sensible plan which gives us access to new data but protects data quality.

Hi @Neil_P._Quinn_WMF - Test data is available :)
Here is an example of accessing it in scala-spark2:

val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
history.where("event_entity = 'revision' and wiki_db = 'enwiki' and revision_tags is not null and size(revision_tags) > 0").select("event_timestamp", "revision_id", "revision_tags").show(100, false)

+---------------------+-----------+---------------------------------------------------+
|event_timestamp      |revision_id|revision_tags                                      |
+---------------------+-----------+---------------------------------------------------+
|2017-11-08 21:45:37.0|809401678  |[mobile edit, mobile web edit]                     |
|2010-01-05 17:52:03.0|336034032  |[autobiography]                                    |
|2019-01-26 08:40:10.0|880247468  |[mobile edit, mobile web edit, visualeditor]       |
|2019-01-25 21:00:06.0|880176047  |[mw-new-redirect]                                  |
|2019-01-26 23:02:59.0|880347157  |[mw-new-redirect]                                  |
|2017-04-12 01:38:11.0|775006690  |[visualeditor]                                     |
|2018-12-07 08:38:55.0|872463663  |[mw-new-redirect]                                  |
|2019-02-26 18:14:16.0|885216360  |[AWB]                                              |
|2013-05-12 11:12:55.0|554718296  |[possible libel or vandalism]                      |
|2018-02-13 16:33:48.0|825474766  |[OAuth CID: 542]                                   |
|2015-10-05 04:03:03.0|684192157  |[mobile edit, mobile web edit]                     |
|2009-06-02 00:32:23.0|293840945  |[removal of speedy deletion templates]             |
|2018-02-06 08:33:06.0|824262970  |[mobile edit, mobile web edit]                     |
|2017-01-06 15:53:19.0|758625268  |[huggle]                                           |
|2017-06-05 17:54:58.0|783958395  |[mobile edit, mobile web edit]                     |
|2015-09-04 01:38:21.0|679352344  |[mobile app edit, mobile edit]                     |
|2019-01-26 23:02:42.0|880347098  |[mw-changed-redirect-target]                       |
|2014-03-24 02:31:00.0|600972439  |[gettingstarted edit]                              |
|2014-02-23 09:03:25.0|596749381  |[Section blanking, gettingstarted edit]            |
|2016-02-04 18:05:25.0|703299985  |[mobile edit, mobile web edit]                     |
|2009-07-30 20:37:11.0|305145742  |[Nonsense movies?]                                 |
|2015-08-15 17:10:11.0|676236245  |[mobile app edit, mobile edit]                     |
|2018-11-20 00:49:38.0|869707620  |[massmessage-delivery]                             |
|2018-11-22 11:33:03.0|870102535  |[mobile edit, mobile web edit]                     |
|2017-06-02 20:37:05.0|783518016  |[visualeditor]                                     |
|2016-02-13 10:03:48.0|704748058  |[mobile edit, mobile web edit]                     |
|2014-12-04 22:11:37.0|636670267  |[HHVM]                                             |
|2016-04-06 01:56:08.0|713822860  |[visualeditor]                                     |
|2013-11-20 06:21:22.0|582486116  |[possible link spam]                               |
|2015-08-31 16:13:04.0|678783639  |[mobile edit, mobile web edit]                     |
|2016-06-07 02:58:05.0|724092681  |[mobile edit, mobile web edit]                     |
|2014-02-19 09:12:51.0|596156740  |[very short new article]                           |
|2018-08-13 18:21:06.0|854778279  |[mobile edit, mobile web edit]                     |
|2018-05-12 15:39:01.0|840848927  |[mobile edit, mobile web edit]                     |
|2015-01-10 19:57:49.0|641912763  |[mobile edit, mobile web edit]                     |
|2017-05-24 17:49:49.0|782051054  |[visualeditor]                                     |
|2010-07-07 18:51:18.0|372257612  |[large unwikified new article]                     |
|2017-05-22 18:44:57.0|781698488  |[mobile edit, mobile web edit]                     |
|2010-01-05 12:35:13.0|335989761  |[very short new article]                           |
|2016-06-20 21:01:22.0|726225213  |[visualeditor]                                     |
|2018-08-21 00:04:36.0|855807442  |[visualeditor]                                     |
|2018-07-10 11:32:14.0|849644629  |[mw-new-redirect]                                  |
|2011-12-23 21:09:34.0|467397085  |[references removed]                               |
|2018-05-09 14:25:55.0|840379905  |[mobile edit, mobile web edit]                     |
|2017-12-03 19:56:49.0|813466652  |[massmessage-delivery]                             |
|2016-04-01 19:21:54.0|713069025  |[visualeditor]                                     |
|2016-12-10 19:05:31.0|754077996  |[canned edit summary]                              |
|2018-12-02 22:15:38.0|871702615  |[mw-new-redirect]                                  |
|2014-12-06 13:20:51.0|636882492  |[HHVM]                                             |
|2017-02-04 17:52:54.0|763689514  |[canned edit summary, mobile edit, mobile web edit]|
|2016-05-09 15:20:52.0|719416586  |[mobile edit, mobile web edit]                     |
|2017-04-28 03:53:57.0|777600327  |[visualeditor-switched]                            |
|2018-01-23 21:12:52.0|821999755  |[mobile edit, mobile web edit]                     |
|2017-04-28 00:40:36.0|777577987  |[de-userfying]                                     |
|2018-05-21 04:55:55.0|842240239  |[mobile edit, mobile web edit]                     |
|2014-12-08 11:37:15.0|637154102  |[HHVM]                                             |
|2017-02-27 12:07:05.0|767695441  |[references removed]                               |
|2019-03-09 19:37:46.0|886972371  |[php7, visualeditor-wikitext]                      |
|2019-01-15 11:46:47.0|878539372  |[visualeditor]                                     |
|2018-11-30 12:37:41.0|871343816  |[mw-new-redirect]                                  |
|2014-11-11 12:59:08.0|633375491  |[HHVM]                                             |
|2018-11-22 13:31:59.0|870112586  |[mw-new-redirect]                                  |
|2018-11-04 15:27:54.0|867241864  |[mw-rollback]                                      |
|2019-01-23 15:17:47.0|879812067  |[mobile edit, mobile web edit]                     |
|2017-03-24 18:22:56.0|771996155  |[mobile edit, mobile web edit]                     |
|2017-07-09 20:41:28.0|789823393  |[mobile edit, mobile web edit]                     |
|2017-07-06 11:38:21.0|789269180  |[mobile edit, mobile web edit]                     |
|2016-09-01 18:32:23.0|737271792  |[visualeditor]                                     |
|2018-07-02 13:21:23.0|848521846  |[mw-new-redirect]                                  |
|2014-11-23 18:59:05.0|635133128  |[HHVM]                                             |
|2018-07-03 04:37:54.0|848621980  |[visualeditor]                                     |
|2015-12-20 21:16:43.0|696089753  |[wikilove]                                         |
|2010-09-28 15:53:44.0|387550310  |[references removed]                               |
|2018-11-20 01:03:55.0|869724239  |[massmessage-delivery]                             |
|2014-11-23 12:44:26.0|635095281  |[HHVM]                                             |
|2010-03-31 06:49:43.0|353110558  |[very short new article]                           |
|2014-04-14 09:00:53.0|604131959  |[possible vandalism]                               |
|2018-06-23 01:53:34.0|847115693  |[references removed]                               |
|2015-10-27 17:57:04.0|687778391  |[mobile edit, mobile web edit]                     |
|2019-01-27 03:19:25.0|880380268  |[AWB]                                              |
|2018-11-25 18:59:41.0|870575771  |[AWB]                                              |
|2015-10-05 01:39:49.0|684174619  |[visualeditor]                                     |
|2018-03-05 14:24:02.0|828911205  |[mobile edit, mobile web edit, visualeditor]       |
|2018-11-04 18:16:08.0|867265355  |[mw-undo]                                          |
|2017-08-01 01:15:04.0|793321340  |[mobile edit, mobile web edit]                     |
|2019-01-10 20:01:24.0|877771707  |[mw-new-redirect]                                  |
|2015-11-22 20:43:57.0|691883011  |[huggle]                                           |
|2018-12-08 03:31:44.0|872618998  |[mobile edit, mobile web edit, visualeditor]       |
|2016-09-15 21:26:58.0|739625395  |[visualeditor]                                     |
|2017-06-27 10:06:33.0|787753511  |[mobile edit, mobile app edit]                     |
|2019-03-16 17:47:13.0|888062936  |[AWB]                                              |
|2016-01-12 03:07:45.0|699404108  |[mobile edit, mobile web edit]                     |
|2018-03-14 05:06:13.0|830335060  |[mobile edit, mobile web edit]                     |
|2016-02-12 17:28:53.0|704636871  |[visualeditor]                                     |
|2016-01-11 04:39:25.0|699246316  |[OAuth CID: 99]                                    |
|2015-06-09 05:14:06.0|666142037  |[mobile edit, mobile web edit]                     |
|2018-10-12 21:27:33.0|863765510  |[mw-new-redirect]                                  |
|2015-10-17 07:10:20.0|686136603  |[mobile edit, mobile web edit]                     |
|2016-10-11 15:46:14.0|743842598  |[visualeditor]                                     |
|2018-07-14 18:40:25.0|850250934  |[visualeditor]                                     |
+---------------------+-----------+---------------------------------------------------+

Please let me know if you need help on accessing this :)