Page MenuHomePhabricator

[L] Track commons deletion requests
Open, Needs TriagePublic

Description

We need an up-to-date view of commons deletion requests, so we can keep an eye on how DRs are affected by changes we make to UploadWizard

Proposed solution

  • dedicated hive db
  • use (daily) data in wmf_dumps to update a deletion request table containing
    • timestamp for when the deletion request was created
    • timestamp for when the deletion request was resolved
    • whether the file was (or files were) kept or deleted
    • the reason the request was opened
    • the reason the request was closed
    • page_titles for files that are covered by the DR

Also an up-to-date upload table that's a view of page where page_namespace=6 plus image plus filearchive, that can be linked to the deletion requests, with fields

  • page_id
  • page_title
  • upload time
  • upload source (UW, cross-wiki, etc)
  • deleted time
  • deletion comment

Everything below is out of scope but just want to write it all down here so we'll have it for again ... ultimately it'd be great to also have

  • uploader edit count at time of upload?
  • uploader groups at time of upload?
  • uploader account age at time of upload?

... and then maybe a file_tags table ("logo", "ownwork/notown", "has external source", etc) joined to the upload table
... and also maybe a deletion_tags table ("logo", "copvio", "FoP", etc) joined to the upload table
... probably also a table containing pHashes
... and maybe a table containing pHashes of files on wikipedias, as they're likely to be copyrighted

If we had all the above it'd really help us begin to figure out likelihood-of-deletion scores for uploads

Event Timeline

Cparle renamed this task from Provide up-to-date metrics for commons deletions and deletion requests to [L] Provide up-to-date metrics for commons deletions and deletion requests.Jul 25 2024, 8:16 AM
Cparle claimed this task.
Cparle renamed this task from [L] Provide up-to-date metrics for commons deletions and deletion requests to [L] Track commons deletion requests.Oct 18 2024, 1:30 PM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)

Hello!

Some questions:

  • what are your needs for eventual consistency? Do you need to be sure you don’t miss anything? Or if you missed a few (like during upstream outages) would things be okay?
  • Could you identify deletion requests in MediaWiki and just submit an event?
  • Could you just do a streaming enrichment app to filter and enrich page_content_change into a new deletion_request kind of event
  • Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates (just like your proposed iceberg table)? It is built using page_content_change, so you might be able to just consume this table directly (hourly)? and not have to do any joins between page_content_change.

I think with either option (event based or new Dumps 2 mediawiki_content_change based), you will be able to have one datasource for your table, rather than having to join multiples together.

It might be helpful to make a little design doc (in this phab ticket is fine) with some requirements for this. Providing proposed solutions is good too, but maybe start with the requirements and we can advise!

what are your needs for eventual consistency? Do you need to be sure you don’t miss anything? Or if you missed a few (like during upstream outages) would things be okay?

Missing a few is ok so long as we eventually pick them up. That's why we have the script to process the monthly dumps as well as the script to do the hourly dumps

Could you identify deletion requests in MediaWiki and just submit an event?
Could you just do a streaming enrichment app to filter and enrich page_content_change into a new deletion_request kind of event

I guess we could but what we need is a view of the current state of a deletion request rather than the events that led to the current state
(never mind the struck-through piece - the most recent change event could contain the current state, like it does for page_content_change)

Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates ...

I was not, but we only use page_content_change for the hourly dump processing, and don't do any joins on it

... so we have only one datasource already. We use 2 datasources when processing the monthly dumps, but I think we need that for eventual consistency

It might be helpful to make a little design doc (in this phab ticket is fine) with some requirements for this. Providing proposed solutions is good too, but maybe start with the requirements and we can advise!

What we want to be able do is

  • right now
    • track the proportion of uploads that are picking up deletion requests within X time (a day or a week)
    • be able to differentiate between the DRs - for example stuff that gets deleted because it's copyrighted, or stuff that gets deleted because it's out of scope (e.g. a selfie)
  • ultimately: gather a dataset that we can use to train a model to compute likelihood of deletion for an upload

I think the code we have does ticks the "right now" box pretty well, though we could probably do that with a purely event-driven stream-enrichment type approach instead if you preferred ... however I don't think that would allow us to get historical data, would it? And I think it'd be more difficult to ensure that our data is correct without being able to use the monthly dumps?

Missing a few is ok so long as we eventually pick them up.

Sounds like missing is not okay :)

I don't think that would allow us to get historical data, would it?

True, you'd have to backfill, but you'll have to do that anyway.

I was not, but we only use page_content_change for the hourly dump processing, and don't do any joins on it
... so we have only one datasource already. We use 2 datasources when processing the monthly dumps, but I think we need that for eventual consistency

You have 2 datasources for eventual consistency. If you didn't need to reconcile for consistency purposes, then ya, you'd have one data source.

Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates ...

This table will be eventually consistent. It is created from event.mediawiki.page_content_change_v1, and then reconciled for consistency against the analytics MariaDB replicas themselves. mediawiki_content_history will replace mediawiki_wikitext_history.

If you consume directly from mediawiki_content_history, you shouldn't need to think about handling any consistency reconciliations yourself. CC @xcollazo :)


FWIW, we would have liked to focus on eventual consistency of mediawiki.page_content_change.v1, and backfill it. However, because of the impending need to replace the flailing Dumps 1 pipeline, we decided to focus on replacing and reconciling mediawiki_wikitext_history instead.

cc @Ahoelzl

Ok the above all makes sense, but I'm not sure I understand what you're proposing that we change our current approach to. Could you spell it out step-by-step?

I'll let @xcollazo confirm, but I'm suggesting to use the new Dumps 2 (not yet production ready, but very soon!) mediawiki_content_history Iceberg table as your sole datasource. It will be incremental (hourly or daily), reconciled and eventually consistent, so you don't need to do your own reconciliation / lambda arch.

There may be details I don't understand about properly consuming updates to that table, so I could be wrong, but I think the intention of having a consistent mediawiki_content_history is so that folks don't have to do roll their own reconcilliation.

But, @Cparle more generally, if we (DPE) were able to prioritize work for T258511: Data Lake incremental Data Updates and T291120: MediaWiki Event Carried State Transfer - Problem Statement and https://wikitech.wikimedia.org/wiki/MediaWiki_External_State_Problem, your task here would be much simpler.

I'd like to encourage you and your team to be 'squeakier' about this stuff, so that our managers understand the connection between platform capabilities and common product feature and analytics needs :) Thank YOUUUU!

Haha ok cool ... by 'squeakier' do you mean broadcasting what we need/want a bit louder? And when is Dumps 2.0 expected to land?

by 'squeakier' do you mean broadcasting what we need/want a bit louder?

Yup! Be the squeaky wheel. We can discuss more elsewhere :)

And when is Dumps 2.0 expected to land?

...release candidate...soon? We had expected this month, but I'll let Xabriel comment. There is a development table that can be used now I think.

I'll let @xcollazo confirm, but I'm suggesting to use the new Dumps 2 (not yet production ready, but very soon!) mediawiki_content_history Iceberg table as your sole datasource. It will be incremental (hourly or daily), reconciled and eventually consistent, so you don't need to do your own reconciliation / lambda arch.

@Cparle wmf_dumps.mediawiki_content_history will be, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history table. Even though we were hoping for it to be updated hourly, we have decided to update it daily, for performance reasons. You can read more about the architecture of this here.

We currently have a release candidate of this table in the datalake that uses the table's old name: wmf_dumps.wikitext_raw_rc2. We intend to have the production table by end of Q2.

There may be details I don't understand about properly consuming updates to that table, so I could be wrong, but I think the intention of having a consistent mediawiki_content_history is so that folks don't have to do roll their own reconcilliation.

We have it in the queue to experiment with a CDC-like mechanism built in to Iceberg to consume changes made to mediawiki_content_history via T366544: Use the Spark-Iceberg built in CDC mechanism to PoC a replacement for wikimedia_wikitext_current, but we have not made progress on that yet.


I think I don't follow completely the requirements for this work, so perhaps we should all talk again and align better.

As of 2024-18-12

Code for this is done and in review. To really finish it off

  • we need an airflow job to keep the data up to date
  • also need deployment/CI setup in the gitlab repo

I've run all the jobs, and historical the DRs/deletions for all-upload-methods and UW can be seen here

For later ... upload.upload_source is set from a monthly dump, because it's not available in the up-to-date data in wmf_dumps. We'd have a better view of the current-state-of-things if change_tag data was available in wmf_dumps

Hey @Cparle , @Ottomata tells me you might need HDFS deployment through CI/CD for this project? I just set up Blunderbuss which does exactly that, although we haven't gone fully public org-wise with it yet.

What exactly do you need deployed to HDFS?

Hi @amastilovic ... I'm not sure we need anything deployed to HDFS, I've already created the database we're writing to (mediawiki_upload_tracking, and an integration-testing db too test_mediawiki_upload_tracking) - is there something else that needs to be done?