Page MenuHomePhabricator

The streaming updater should support page deletions
Closed, ResolvedPublic

Description

When an item is deleted the streaming updater should produce a message instructing the consumer to delete the item from the graph.

Classic page deletions are made by admins and propagated through the mediawiki.page-delete stream.
Example message:

{
  "$schema": "/mediawiki/page/delete/1.0.0",
  "meta": {
    "uri": "https://test.wikidata.org/wiki/Q212433",
    "request_id": "59f87c41-7680-4f8a-bf6e-7dac91530972",
    "id": "00fcac35-5357-4c99-ba9f-e720db9f0197",
    "dt": "2020-07-01T13:16:25Z",
    "domain": "test.wikidata.org",
    "stream": "mediawiki.page-delete"
  },
  "database": "testwikidatawiki",
  "performer": {
    "user_text": "DCausse (WMF)",
    "user_groups": [
      "bureaucrat",
      "sysop",
      "*",
      "user"
    ],
    "user_is_bot": false,
    "user_id": 2490,
    "user_registration_dt": "2017-09-28T06:49:13Z",
    "user_edit_count": 7
  },
  "page_id": 302928,
  "page_title": "Q212433",
  "page_namespace": 0,
  "page_is_redirect": false,
  "rev_id": 529859,
  "rev_count": 1,
  "comment": "content was: \"Test dcausse v2\", and the only contributor was \"[[Special:Contributions/DCausse (WMF)|DCausse (WMF)]]\" ([[User talk:DCausse (WMF)|talk]])",
  "parsedcomment": "content was: &quot;Test dcausse v2&quot;, and the only contributor was &quot;<a href=\"/wiki/Special:Contributions/DCausse_(WMF)\" title=\"Special:Contributions/DCausse (WMF)\">DCausse (WMF)</a>&quot; (<a href=\"/w/index.php?title=User_talk:DCausse_(WMF)&amp;action=edit&amp;redlink=1\" class=\"new\" title=\"User talk:DCausse (WMF) (page does not exist)\">talk</a>)"
}

This task involves:

On the shared model:

  • add a new operation type "delete" to org.wikidata.query.rdf.tool.stream.MutationEventData
  • add tests to org.wikidata.query.rdf.tool.stream.MutationEventDataJsonSerializationUnitTest to make sure that it's serialized properly

On the flink pipeline:

  • add a new case class PageDelete in the IntputEvent ADT
  • add a new case class DeleteItem in the MutationOperation ADT
  • add a new stream to consume from (kafka topic mediawiki.page-delete) and produce PageDelete to downstream operators
  • add a new case in DecideMutationOperation:
    • produce a DeleteItem operation if the map contains a revision of the item and delete it from the map
    • produce a IgnoredMutation otherwise
  • add a new case in org.wikidata.query.rdf.updater.GenerateEntityDiffPatchOperation to support the DeleteItem operation and simply produce an EntityPathOp with a MutationEventData that has the type "delete".

On the pipeline consumer:

  • Refactor RDFPatch so that it has two modes: (applying a diff, delete an item)
  • Refactor org.wikidata.query.rdf.tool.stream.KafkaStreamConsumer so that it accumulates delete items
  • Adapt org.wikidata.query.rdf.tool.rdf.RdfRepositoryUpdater#applyPatch to support item deletions

AC:
When deleting an item from wikibase:

  • an event should be present in the streaming updater output indicating that this item needs to be deleted
  • the data should disappear from the query service when using the streaming updater

size: XL

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 616633 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] add delete operation

https://gerrit.wikimedia.org/r/616633

Change 616633 merged by jenkins-bot:
[wikidata/query/rdf@master] add delete operation

https://gerrit.wikimedia.org/r/616633

Change 617761 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] Add page delete stream to incoming streams

https://gerrit.wikimedia.org/r/617761

Shared model delete functionality is complete, currently in the process of adding delete functionality to the flink pipeline

Change 618159 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] Make Deserialization Schema generic

https://gerrit.wikimedia.org/r/618159

What do we want to do if we get a PageDelete event for the same item multiple times? Right now we're not tracking deletes, so I'm assuming multiple deletes for the same item probably can't do that much harm.

Change 618634 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] Adds delete case to DecideMutationOperation

https://gerrit.wikimedia.org/r/618634

Change 617761 merged by jenkins-bot:
[wikidata/query/rdf@master] Add page delete stream to incoming streams

https://gerrit.wikimedia.org/r/617761

Change 618159 merged by jenkins-bot:
[wikidata/query/rdf@master] Make Deserialization Schema generic

https://gerrit.wikimedia.org/r/618159

What do we want to do if we get a PageDelete event for the same item multiple times? Right now we're not tracking deletes, so I'm assuming multiple deletes for the same item probably can't do that much harm.

I still don't know the best way to track this, multiple deletes won't harm much indeed but what we should avoid is treating a late RevCreateEvent as a new item to import while in reality it's deleted.
Ideally we want to clear the state when we see a deletion but this might not allow us to track this kind of situation. Another possibility is to tell the state that we have seen a DeletePage for revision X by storing -X in the state (basically using the sign as flag to indicate previous deletion). I think this will have to be decided while implementing multiple unit tests simulating the various scenario we might see.

Clearing the state is out of the question, I think. Right now, we mark the revision we see a delete at and then we don't allow a Rev Create Event. However, this means that any regular RevCreate events don't allow the item to be reimported ever again. I think it should be something like, if RevCreateEvent revision > DelRev, then go ahead and do a full import.

Change 621349 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] use SuccessfulOp as higher level abstraction

https://gerrit.wikimedia.org/r/621349

Change 621356 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] add revision create to updater pipeline

https://gerrit.wikimedia.org/r/621356

Change 618634 merged by jenkins-bot:
[wikidata/query/rdf@master] Adds delete case to DecideMutationOperation

https://gerrit.wikimedia.org/r/618634

Change 621349 merged by jenkins-bot:
[wikidata/query/rdf@master] use SuccessfulOp as higher level abstraction

https://gerrit.wikimedia.org/r/621349

Wrapping up the flink pipeline work, a patch is out for the integration tests and there's some more test cases to be added for clarity on the Decide Mutation Operation to ensure that all delete cases are covered. Pipeline consumer work is up next

Change 623035 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] allow deleted object to be created

https://gerrit.wikimedia.org/r/623035

Change 623035 abandoned by Mstyles:
[wikidata/query/rdf@master] allow deleted object to be created

Reason:
I misunderstood the intended logic

https://gerrit.wikimedia.org/r/623035

Change 623452 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] refine unit tests

https://gerrit.wikimedia.org/r/623452

Change 621356 merged by jenkins-bot:
[wikidata/query/rdf@master] add page delete to updater pipeline

https://gerrit.wikimedia.org/r/621356

Change 623452 merged by jenkins-bot:
[wikidata/query/rdf@master] refine unit tests

https://gerrit.wikimedia.org/r/623452

Change 629234 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] remove entities from blazegraph

https://gerrit.wikimedia.org/r/629234

Change 629765 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] delete entity integration test

https://gerrit.wikimedia.org/r/629765

Wrapping up the last phase of deleting an entity from blazegraph. I have verified that items to be deleted are present in the streaming updater output.

Change 629234 merged by jenkins-bot:
[wikidata/query/rdf@master] remove entities from blazegraph

https://gerrit.wikimedia.org/r/629234

Change 629765 merged by jenkins-bot:
[wikidata/query/rdf@master] delete entity integration test

https://gerrit.wikimedia.org/r/629765

Change 630855 had a related patch set uploaded (by Mstyles; owner: DCausse):
[wikidata/query/rdf@master] Add some UpdatePatchAccumulator unit tests

https://gerrit.wikimedia.org/r/630855

Delete functionality is complete, optimization work in progress to ensure that patches created on the streaming updater consumer that get sent to blazegraph do not contain statements from entities that will be deleted

Change 630855 merged by jenkins-bot:
[wikidata/query/rdf@master] Add some UpdatePatchAccumulator unit tests

https://gerrit.wikimedia.org/r/630855

All patches are merged and items can be deleted successfully