Page MenuHomePhabricator

Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal
Closed, ResolvedPublic

Description

Context: We've developed a Wikipedia discussion corpus of the complete history of conversational actions (https://arxiv.org/abs/1810.13181)

We'd like to extend this to a 'live' corpus derived from wikipedia talk pages; we think this can provide valuable signal for abuse / harassment detection—something multiple teams at the Wikimedia Foundation and NDA'ed research collaborators have been working on.

We would like to explore the technical feasibility and the security implications of exposing deleted or suppressed revIDs through the Mediawiki API event stream. We don't want any more information about suppressed and deleted revisions - just to know the ID when they are removed so that we can remove corresponding revisions in our corpora and any in any copies we might have.

This might look something like: A new kind of entry in the "recentchange" event stream entry with something like:

{
  "type": "deletion"
  "data": {
  "revision": {
    "deleted": <revision id>,
    }
  }
}

cc: @Jalexander @APalmer_WMF @leila @Iislucas @PEarleyWMF @jrbs @DarTar @JBennett

Event Timeline

This information is already exposed on public tool lab replicas (there was some talk of hiding it for oversighted revs but nothing came of it)

cool, thanks! Can you share links/info to that, no one else I asked seems to have been aware of this tool, or how it is exposed. In particular, how can it be used to get a list of the deleted and suppressed revision IDs?

If you go to https://quarry.wmflabs.org and do something like select * from archive limit 10 - for deleted revisions look at archive table. For revdel/suppress they will either be in revision table or archive table depending on if they are normal deleted or not.

Thanks for the pointer! I have couple questions about it:

  1. There's a data field named rev_deleted, is there a separate field indicating if the revision is suppressed?
  2. Is there a way to access Quarry through API?

Thanks,
Yiqing

There's a data field named rev_deleted, is there a separate field indicating if the revision is suppressed?

rev_deleted/ar_deleted specify revision delete/oversight. Which is different from normal deletion which is revision table vs archive table. See https://www.mediawiki.org/wiki/Manual:Archive_table https://www.mediawiki.org/wiki/Manual:Revision_table https://www.mediawiki.org/wiki/Bitfields_for_rev_deleted

Is there a way to access Quarry through API?

No. You can get direct access using the command line mysql client if you sign up for toolforge ( https://wikitech.wikimedia.org/wiki/Portal:Toolforge )

There are also "research" sql replicas, which do not have any redaction. In certain circumstances (and with an NDA) researchers can be granted access to them, but requires a "formal collaboration" or something (I don't know much about that, you'd have to ask the research people at WMF): https://wikitech.wikimedia.org/wiki/Analytics/Data_access

See also https://meta.wikimedia.org/wiki/Research:Data

Thanks very helpful!

A couple more questions as we're making some progress... !

When we try and use https://quarry.wmflabs.org to find suppressed revisions, or even count them, we hit timeouts. Can I get access to a SQL research replica? I'm already under NDA, and both Yiqing and myself are part of this project: https://meta.wikimedia.org/wiki/Research:Detox and https://meta.wikimedia.org/wiki/Research:Study_of_harassment_and_its_impact

Are the archive and revisions tables exported anywhere we can download and process ourselves? I looked at https://dumps.wikimedia.org/backup-index.html and saw dumps like enwiki-20180801-pages-meta-history1.xml-p10p2087.7z but I'm not sure if that contains meta-data about deleted revisions, do you know? I was looking for the schema/details for these dumps but having a hard time finding it. Do you know where that is also? I also found the log events: https://www.mediawiki.org/wiki/API:Logevents which I think there are dumps of: e.g. https://dumps.wikimedia.org/enwiki/20180720/enwiki-20180720-pages-logging.xml.gz I see it has some kind of delete events, but not sure how they relate to revision delete bits... ?

Thanks!

Some context on the queries I'm trying:

Count all kinds of deletions:
https://quarry.wmflabs.org/query/28880

Count just the suppressed entries:
https://quarry.wmflabs.org/query/28881

This comment was removed by Iislucas.

w.r.t. the more general problem we're trying to solve; it looks to me like none of these existing tools would work; an independent project needs a pub-sub of the delete events (with the delete flags), so that it can propagate the deletions to it's (processed) copy of the data. I don't see a way to do that with these tools. Does that sound right to you @Bawolff also?

OK, I think I may have figured it out...

The logs XML dumps have the data we need: e.g. https://dumps.wikimedia.org/enwiki/20180701/enwiki-20180701-pages-logging.xml.gz (2.6 GB)

In particular, from: enwiki-20180701-pages-logging15.xml

<logitem>
    <id>47481515</id>
    <timestamp>2013-02-19T00:53:48Z</timestamp>
    <contributor>
      <username>Addshore</username>
      <id>642191</id>
    </contributor>
    <comment>User edited while logged-out, revealing IP</comment>
    <type>delete</type>
    <action>revision</action>
    <logtitle>Wikipedia:Administrators' noticeboard/Incidents</logtitle>
    <params xml:space="preserve">revision
533380088
ofield=0
nfield=5</params>
  </logitem>

Can be matched to:

SELECT * FROM revision where rev_id = 533380088
rev_pagerev_idrev_commentrev_deleted
5137507533380088/* Vandalism log */ re13

So the ofield=0 and nfield=5 correspond to: rev_deleted field in revisions.

Is that sounding right?

Thanks very helpful!

A couple more questions as we're making some progress... !

When we try and use https://quarry.wmflabs.org to find suppressed revisions, or even count them, we hit timeouts. Can I get access to a SQL research replica? I'm already under NDA, and both Yiqing and myself are part of this project: https://meta.wikimedia.org/wiki/Research:Detox and https://meta.wikimedia.org/wiki/Research:Study_of_harassment_and_its_impact

You dont need research replica access to get around timeouts. Tool labs access provides commandline access to the public db replicas without the timeouts of quarry. If you still think you need research access you will have to talk to someone in the research dept - its not something I can help you with.

Are the archive and revisions tables exported anywhere we can download and process ourselves? I looked at https://dumps.wikimedia.org/backup-index.html and saw dumps like enwiki-20180801-pages-meta-history1.xml-p10p2087.7z but I'm not sure if that contains meta-data about deleted revisions, do you know? I was looking for the schema/details for these dumps but having a hard time finding it. Do you know where that is also? I also found the log events: https://www.mediawiki.org/wiki/API:Logevents which I think there are dumps of: e.g. https://dumps.wikimedia.org/enwiki/20180720/enwiki-20180720-pages-logging.xml.gz I see it has some kind of delete events, but not sure how they relate to revision delete bits... ?

I beleive there are xsd schema files for these dumps in the docs directory of the mediawiki git repo. There are other docs on wikitech.wikimedia.org . There is also a mailing list for help with those dumps.

Archive table is not dumped. I believe normal revdelete (but not actual deleted) including suppressed get included in redacted form.

Logging/logevents corresponds to https://en.wikipedia.org/wiki/special:log . Deletion actions are logged here but sometimes wont show up if redacted. This is specially true for the special:log/supress which corresponds to "oversight" actions. You need access to the research replicas to see these actions.

Thanks!

Some context on the queries I'm trying:

Count all kinds of deletions:
https://quarry.wmflabs.org/query/28880

Count just the suppressed entries:
https://quarry.wmflabs.org/query/28881

Yes, those queries have to look at hundreds of millions of rows so will be slow. You would need to get command line access to toolforge to run something like that. Quarry could only process something like that over a limitted range (e.g. WHERE rev_id BETWEEN 100000 AND 150000)

w.r.t. the more general problem we're trying to solve; it looks to me like none of these existing tools would work; an independent project needs a pub-sub of the delete events (with the delete flags), so that it can propagate the deletions to it's (processed) copy of the data. I don't see a way to do that with these tools. Does that sound right to you @Bawolff also?

I think what you are looking for is https://wikitech.wikimedia.org/wiki/EventStreams . Afaik it includes all deletions.

OK, I think I may have figured it out...

The logs XML dumps have the data we need: e.g. https://dumps.wikimedia.org/enwiki/20180701/enwiki-20180701-pages-logging.xml.gz (2.6 GB)

In particular, from: enwiki-20180701-pages-logging15.xml

<logitem>
    <id>47481515</id>
    <timestamp>2013-02-19T00:53:48Z</timestamp>
    <contributor>
      <username>Addshore</username>
      <id>642191</id>
    </contributor>
    <comment>User edited while logged-out, revealing IP</comment>
    <type>delete</type>
    <action>revision</action>
    <logtitle>Wikipedia:Administrators' noticeboard/Incidents</logtitle>
    <params xml:space="preserve">revision
533380088
ofield=0
nfield=5</params>
  </logitem>

Can be matched to:

SELECT * FROM revision where rev_id = 533380088
rev_pagerev_idrev_commentrev_deleted
5137507533380088/* Vandalism log */ re13

So the ofield=0 and nfield=5 correspond to: rev_deleted field in revisions.

Is that sounding right?

Yes (ofield = old field,nfield = new field). Keep in mind for some log types the param field has changed over time. Im not sure if revdelete is one of the logging types that changed param types over time but it might be, so you should check both old log entries and recent ones when trying to figure out the format.

In the research replica there is also the log_search tablewhich helps simplify making joins with log events but this isnt available in public replica (public replica only shows logging table with entries where log_type matches a whitelist per http://tools.wmflabs.org/tools-info/schemas.php?schema=views

t

Thanks! managing to make those long queries work now.
Looks like recent changes event stream is what we want; and it may includes deletes - I'll do a test over the weekend.
Cheers!

OK, that was helpful, maybe making more progress!

Testing the the recentchanges feed, I am able to monitor for log type events, and ones that have a log_action of delete, and then filter those for ones with that have some parameters specified (which might indicate suppression - although I haven't seen any yet), but I don't see how to find the revision they correspond to, e.g. in the following I don't see a revision id specified in the delete event... ?

{ bot: false,
  comment: '',
  id: 1069755912,
  log_action: 'delete',
  log_action_comment:
   'Etzedek24 marked [[Wole Oluyemi]] for deletion with db-person, speedy deletion-advertising and speedy deletion-copyright violation tags',
  log_id: 91740930,
  log_params:
   { tags:
      [ 'db-person',
        'speedy deletion-advertising',
        'speedy deletion-copyright violation' ] },
  log_type: 'pagetriage-deletion',
  meta:
   { domain: 'en.wikipedia.org',
     dt: '2018-07-11T18:09:36+00:00',
     id: '93f7f1b5-8535-11e8-82c8-b083fecf10f1',
     request_id: 'a07ab709-0673-4f14-bce3-80b430e4b8cf',
     schema_uri: 'mediawiki/recentchange/2',
     topic: 'codfw.mediawiki.recentchange',
     uri: 'https://en.wikipedia.org/wiki/Wole_Oluyemi',
     partition: 0,
     offset: 27123857 },
  namespace: 0,
  parsedcomment: '',
  server_name: 'en.wikipedia.org',
  server_script_path: '/w',
  server_url: 'https://en.wikipedia.org',
  timestamp: 1531332576,
  title: 'Wole Oluyemi',
  type: 'log',
  user: 'Etzedek24',
  wiki: 'enwiki' }

I'm also struggling to find where the specification of the log_params, of even the set of the possible tags for it are - do you know by any chance where I should be looking?

I think log_type = pagetriage-delete is not an actual delete, but means that a user has requested that an admin look into the page to check if it should be deleted. (Log_action is really the sub action and log_type is the main action).

Ah, I see. So, from reading this: https://www.mediawiki.org/wiki/Manual:Logging_table#log_params I think log_type probably wants to be delete, and then log_deleted would be set. So I've been trying to test if log_deleted is ever set, and it's kind of hard to know, I subscribed to the EventStream, from the max age of the stream, and even looking at all events recorded and watching it for several days, log_deleted is never set. Is there a way to test it do you know?

What I do find are deleted log events, with log_type as delete too, but still no mention of the revision being deleted:

{ bot: false,
  comment: 'No license since 3 July 2018',
  id: 1107268497,
  log_action: 'delete',
  log_action_comment:
   'deleted &quot;[[File:Wirth-Congress-culturagram-2018-congress.JPG]]&quot;: No license since 3 July 2018',
  log_id: 272882369,
  log_params: [],
  log_type: 'delete',
  meta:
   { domain: 'commons.wikimedia.org',
     dt: '2018-07-11T15:35:37+00:00',
     id: '11d40355-8520-11e8-becd-b083fecf03d7',
     request_id: '05405794-19d4-4f80-bd4f-cc195c36350a',
     schema_uri: 'mediawiki/recentchange/2',
     topic: 'eqiad.mediawiki.recentchange',
     uri:
      'https://commons.wikimedia.org/wiki/File:Wirth-Congress-culturagram-2018-congress.JPG',
     partition: 0,
     offset: 1001959900 },
  namespace: 6,
  parsedcomment: 'No license since 3 July 2018',
  server_name: 'commons.wikimedia.org',
  server_script_path: '/w',
  server_url: 'https://commons.wikimedia.org',
  timestamp: 1531323337,
  title: 'File:Wirth-Congress-culturagram-2018-congress.JPG',
  type: 'log',
  user: 'Jcb',
  wiki: 'commonswiki' }

So even when/if log_deleted is ever set, I don't see how to link it to the revision being deleted? I'm also a bit worried that log_deleted is not being set as I'd expect it to be for a log_type = deleted.

Thoughts?

log_deleted is if the log entry itself is deleted not about if the target of the log entry is. It controls whether or not the log entry shows up on http://en.wikipedia.org/wiki/special:log/delete (additionally, certain log types like suppress are never public regardless of what log_deleted is set to)

So these log entries are for deleting a full page (as opposed to revision delete which deletes a specific revision), so theres no specific revision associated as its deleting all (currently not deleted revisions)

You can link this to the list of revision ids in quarry (although there might be a short couple of seconds delay before it shows up there) https://quarry.wmflabs.org/query/28931 . The ar_page_id field can help distinguish between different deletion instances when a page with same name has been deleted multiple times. Additionally in the case of images there is some additional info in the filearchive table

)

Oh, I see, thanks, I was too quick in reading the documentation - when it said that log_deleted was comparable to rev_deleted, I assumed it had the same bit values - but it's a much looser relationship I see now.

So I've now found: https://www.mediawiki.org/wiki/Manual:Log_actions which I think tells me that I should be looking for log_type='delete' and log_action='revision', and that the parameters field will have the revision ID. And indeed, when I do this I find what I'm after, e.g.

{
  "bot": false,
  "comment": "[[WP:RD5|RD5]]: Other valid deletion under [[WP:DEL#REASON|deletion policy]]",
  "id": 1071409437,
  "log_action": "revision",
  "log_action_comment": "Amorymeltzer changed visibility of a revision on page [[Inspirations for James Bond]]: content hidden and edit summary hidden: [[WP:RD5|RD5]]: Other valid deletion under [[WP:DEL#REASON|deletion policy]]",
  "log_id": 91856635,
  "log_params": {
    "ids": [
      "848570389"
    ],
    "nfield": 3,
    "ofield": 0,
    "type": "revision"
  },
  "log_type": "delete",
  "meta": {
    "domain": "en.wikipedia.org",
    "dt": "2018-07-17T10:46:12+00:00",
    "id": "9f6afedf-89ae-11e8-b167-b083fecf0fcd",
    "request_id": "36ec241c-13b9-40a1-9fd3-525415c603d0",
    "schema_uri": "mediawiki/recentchange/2",
    "topic": "eqiad.mediawiki.recentchange",
    "uri": "https://en.wikipedia.org/wiki/Inspirations_for_James_Bond",
    "partition": 0,
    "offset": 1011680517
  },
  "namespace": 0,
  "parsedcomment": "<a href=\"/wiki/Wikipedia:RD5\" class=\"mw-redirect\" title=\"Wikipedia:RD5\">RD5</a>: Other valid deletion under <a href=\"/wiki/Wikipedia:DEL#REASON\" class=\"mw-redirect\" title=\"Wikipedia:DEL\">deletion policy</a>",
  "server_name": "en.wikipedia.org",
  "server_script_path": "/w",
  "server_url": "https://en.wikipedia.org",
  "timestamp": 1531824372,
  "title": "Inspirations for James Bond",
  "type": "log",
  "user": "Amorymeltzer",
  "wiki": "enwiki"
}

Which specifies the revision id in log_params.ids[0]. And here I see the nfield and ofield set, so I think I understand what to do now, thanks again for the help!

OK, so while I managed to find deletions in recent changes (hurray!), I think, from reading this: https://en.wikipedia.org/wiki/Wikipedia:Revision_deletion (section "RevisionDelete's own log entries"), it sounds like suppressed revisions will not be logged? Is that correct?

(If so, that's going to be problematic for any live system as we'll not be able to monitor and systematically remove suppressed revisions that were seen from the revision stream).

It looks like it's correct as I don't see any entries in the eventstream for recentchanges logs with nfield = 8 :(

OK, so while I managed to find deletions in recent changes (hurray!), I think, from reading this: https://en.wikipedia.org/wiki/Wikipedia:Revision_deletion (section "RevisionDelete's own log entries"), it sounds like suppressed revisions will not be logged? Is that correct?

(If so, that's going to be problematic for any live system as we'll not be able to monitor and systematically remove suppressed revisions that were seen from the revision stream).

It looks like it's correct as I don't see any entries in the eventstream for recentchanges logs with nfield = 8 :(

They are logged with a log_type = 'suppress'. This log is a "private" log, and might not be shown in event stream. I think T105427 suggests there is some way to get this info from eventstream though.

The nfield wouldn't be exactly 8 as it would be bitwise ORd with the other field (So most commonly its 15 or 9 for suppression).

Thanks, ok digging into code... !

So, reading the code: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/logging/LogEntry.php#770

Looks like when the restricted flag is present, the delete is not published (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/logging/LogEntry.php#774), and that the experiments in 2015 on exposing some information also didn't make it into master (and they could have to be adapted so that revision info was present in the params)...

(I also did a test to see if any log_type='suppress' exist in the last few hundred MBs of logs, but there were none)

So, as far as I can see, knowing and respecting a suppressed revision is not possible from the event stream after all. :(

So I guess this now gets back to Dario's initial purpose of opening this issue, that we make is possible to respect suppressed revisions. Sounds like the event stream is a natural place, and that it wouldn't be particularly difficult - although we'd probably only want to expose the revision ID, and no other data that might be what is being suppressed. Thoughts?

Adding @Ottomata for visibility on the EventStream side of things.

To get both deleted and suppressed revisions into EventStreams in an easier to understand way, I would create an new event stream called revision-delete. This is probably not hard to do from the Mediawiki side of things, but I don't have any understanding of the privacy implications of such a thing. We'd need someone else to ok and review that.

Also, since you have NDA access, it might be possible to get access to the Analytics Data Lake and be able to query the Mediawiki History snapshots in a more complete and faster way than using the Mediawiki databases. Adding @Milimetric to comment if what you are looking for is there too.

@PEarleyWMF: we discussed privacy implications; can you comment further on this, and if/who you think should be looped in.

@Iislucas We're looking at it - we'll get back as soon possible!

If all you're looking for is a list of revisions that have been deleted, then the data lake will indeed help you out. If you need help with Hive, let me know, but essentially start hive and do this:

select wiki_db,
       revision_id,
       revision_deleted_timestamp
  from wmf.mediawiki_history
 where revision_is_deleted
   and snapshot = '2018-07'
 limit 10
;

Obviously you can select any of the other fields you're interested in, take a close look at the schema and idea behind this table (it's to enable use cases like yours, so we're always open to adding dimensions and metrics): https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Schema. Or, if you don't read that :), at least note that all wikis are included in one table, so if you want just enwiki you need wiki_db = 'enwiki'

Also, @Iislucas, I got a little lost in the conversation above. It would be helpful if you could edit the task and state exactly what you need.

@PEarleyWMF: we discussed privacy implications; can you comment further on this, and if/who you think should be looped in.

So spoke with Legal about this and while they are still wanting to think about the "who" has access question (not 100% against it being public but have some reservations about not requiring authentication) they are, however, completely ok with you/other researchers having access to a stream like this and ok with the tool being built as they think about the authentication question.

Iislucas renamed this task from Exposing revIDs of deleted edits for research purposes to Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal.Nov 23 2018, 11:26 AM
Iislucas updated the task description. (Show Details)

Thanks! I've updated the description to reflect where I think we're at, and what I think is needed for a live system to respect deletions and suppressions.

What's the next step?

Iislucas updated the task description. (Show Details)

I'm not sure what the status of this task is. Most of the people who were working on it were doing so in a temporary research capacity, or no longer work for the Foundation. Unless there are loose ends here perhaps a close is in order?

I'd still love to see this; it would enable tools like https://github.com/conversationai/wikidetox to be completed.

@Iislucas, the rev_is_revert tag is getting added to revisions and exposing tags on a public stream is being tracked T294391. This is likely to be worked on. Suppression information is available via mediawiki.revision-visibility-change right now, unless that's restricted in some way I don't know about.

Noting another use case for this -- T333344 (private). In that case, we need the page ID, not the rev ID.

FWIW, it should be possible to do this one day with a public version of the new mediawiki.page_change stream. We'd like to make a compacted and redacted version that we could expose via EventStreams. Doing so is not currently on a project radar, but it is an intention.

Is the mediawiki.revision-visibility-change stream sufficient for this? If so, can we close this task?

@Ottomata I read the documentation but it was not clear to me... Can you show/link to an example where a page deletion appears a revision visbility stream, and I can try to confirm?
Thanks!

page deletion

This will either be in the mediawiki.page_delete stream, or better yet: the recently released mediawiki.page_change.v1 stream. page_change will have page deletes, and revision suppressions IFF the revision being suppressed is the latest revision.

Ottomata claimed this task.

I'm going to be bold and close this task.