[L] Track commons deletion requests
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Cparle
	Jul 24 2024, 3:49 PM

Description

We need an up-to-date view of commons deletion requests, so we can keep an eye on how DRs are affected by changes we make to UploadWizard

Proposed solution

dedicated hive db
use (daily) data in wmf_dumps to update a deletion request table containing
- timestamp for when the deletion request was created
- timestamp for when the deletion request was resolved
- whether the file was (or files were) kept or deleted
- the reason the request was opened
- the reason the request was closed
- page_titles for files that are covered by the DR

Also an up-to-date upload table that's a view of page where page_namespace=6 plus image plus filearchive, that can be linked to the deletion requests, with fields

page_id
page_title
upload time
upload source (UW, cross-wiki, etc)
deleted time
deletion comment

Everything below is out of scope but just want to write it all down here so we'll have it for again ... ultimately it'd be great to also have

uploader edit count at time of upload?
uploader groups at time of upload?
uploader account age at time of upload?

... and then maybe a file_tags table ("logo", "ownwork/notown", "has external source", etc) joined to the upload table
... and also maybe a deletion_tags table ("logo", "copvio", "FoP", etc) joined to the upload table
... probably also a table containing pHashes
... and maybe a table containing pHashes of files on wikipedias, as they're likely to be copyrighted

If we had all the above it'd really help us begin to figure out likelihood-of-deletion scores for uploads

Related Objects
Search...

Status	Assigned	Task
Open	None	T357587 [Research EPIC] Media quality investigation on Commons FY24/25
Open	None	T347298 [Epic] Upload wizard Release rights step improvements on Commons
Open	Cparle	T370898 [L] Track commons deletion requests
Open	None	T376908 Create DAG to run the code in the file-deletions repo

Event Timeline

Cparle created this task.Jul 24 2024, 3:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2024, 3:49 PM

AUgolnikova-WMF moved this task from Triage to Current Work on the Structured-Data-Backlog board.Jul 24 2024, 4:38 PM

AUgolnikova-WMF edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

AUgolnikova-WMF moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.Jul 24 2024, 4:41 PM

AUgolnikova-WMF moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.

Cparle renamed this task from Provide up-to-date metrics for commons deletions and deletion requests to [L] Provide up-to-date metrics for commons deletions and deletion requests.Jul 25 2024, 8:16 AM

Cparle claimed this task.

Cparle moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

AUgolnikova-WMF added a parent task: T357587: [Research EPIC] Media quality investigation on Commons FY24/25.Jul 25 2024, 11:19 AM

AUgolnikova-WMF added a parent task: T349641: [EPIC] MVP Logo machine detection on Commons.

Cparle updated the task description. (Show Details)Jul 25 2024, 1:49 PM

Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.Sep 23 2024, 4:10 PM

Review left as a merge request: https://gitlab.wikimedia.org/repos/structured-data/file-deletions/-/merge_requests/1

AUgolnikova-WMF added a subtask: T376908: Create DAG to run the code in the file-deletions repo.Oct 10 2024, 3:00 PM

AUgolnikova-WMF edited parent tasks, added: T347298: [Epic] Upload wizard Release rights step improvements on Commons; removed: T349641: [EPIC] MVP Logo machine detection on Commons.Oct 10 2024, 3:05 PM

AUgolnikova-WMF moved this task from Code Review to Doing on the Structured-Data-Backlog (Current Work) board.Oct 14 2024, 3:42 PM

Cparle renamed this task from [L] Provide up-to-date metrics for commons deletions and deletion requests to [L] Track commons deletion requests.Oct 18 2024, 1:30 PM

Cparle updated the task description. (Show Details)

Cparle updated the task description. (Show Details)Oct 18 2024, 1:34 PM

Cparle updated the task description. (Show Details)Oct 18 2024, 1:38 PM

Cparle updated the task description. (Show Details)Oct 21 2024, 2:04 PM

Cparle updated the task description. (Show Details)

xcollazo mentioned this in T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances.Oct 21 2024, 2:09 PM

Ottomata added a project: Data-Engineering.Oct 22 2024, 1:30 PM

Ottomata added subscribers: Ahoelzl, gmodena, tchin.

Ottomata subscribed.

Hello!

Some questions:

what are your needs for eventual consistency? Do you need to be sure you don’t miss anything? Or if you missed a few (like during upstream outages) would things be okay?
Could you identify deletion requests in MediaWiki and just submit an event?
Could you just do a streaming enrichment app to filter and enrich page_content_change into a new deletion_request kind of event
Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates (just like your proposed iceberg table)? It is built using page_content_change, so you might be able to just consume this table directly (hourly)? and not have to do any joins between page_content_change.

I think with either option (event based or new Dumps 2 mediawiki_content_change based), you will be able to have one datasource for your table, rather than having to join multiples together.

It might be helpful to make a little design doc (in this phab ticket is fine) with some requirements for this. Providing proposed solutions is good too, but maybe start with the requirements and we can advise!

Also of interest:

Consuming recent updates to Iceberg tables:

what are your needs for eventual consistency? Do you need to be sure you don’t miss anything? Or if you missed a few (like during upstream outages) would things be okay?

Missing a few is ok so long as we eventually pick them up. That's why we have the script to process the monthly dumps as well as the script to do the hourly dumps

Could you identify deletion requests in MediaWiki and just submit an event?
Could you just do a streaming enrichment app to filter and enrich page_content_change into a new deletion_request kind of event

I guess we could ~~but what we need is a view of the current state of a deletion request rather than the events that led to the current state~~
(never mind the struck-through piece - the most recent change event could contain the current state, like it does for page_content_change)

Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates ...

I was not, but we only use page_content_change for the hourly dump processing, and don't do any joins on it

... so we have only one datasource already. We use 2 datasources when processing the monthly dumps, but I think we need that for eventual consistency

It might be helpful to make a little design doc (in this phab ticket is fine) with some requirements for this. Providing proposed solutions is good too, but maybe start with the requirements and we can advise!

What we want to be able do is

right now
- track the proportion of uploads that are picking up deletion requests within X time (a day or a week)
- be able to differentiate between the DRs - for example stuff that gets deleted because it's copyrighted, or stuff that gets deleted because it's out of scope (e.g. a selfie)
ultimately: gather a dataset that we can use to train a model to compute likelihood of deletion for an upload

I think the code we have does ticks the "right now" box pretty well, though we could probably do that with a purely event-driven stream-enrichment type approach instead if you preferred ... however I don't think that would allow us to get historical data, would it? And I think it'd be more difficult to ensure that our data is correct without being able to use the monthly dumps?

Missing a few is ok so long as we eventually pick them up.

Sounds like missing is not okay :)

I don't think that would allow us to get historical data, would it?

True, you'd have to backfill, but you'll have to do that anyway.

I was not, but we only use page_content_change for the hourly dump processing, and don't do any joins on it
... so we have only one datasource already. We use 2 datasources when processing the monthly dumps, but I think we need that for eventual consistency

You have 2 datasources for eventual consistency. If you didn't need to reconcile for consistency purposes, then ya, you'd have one data source.

Are you aware of the imminent Dumps 2.0 work where a new mediawiki_content_history will be an Iceberg table with hourly incremental updates ...

This table will be eventually consistent. It is created from event.mediawiki.page_content_change_v1, and then reconciled for consistency against the analytics MariaDB replicas themselves. mediawiki_content_history will replace mediawiki_wikitext_history.

If you consume directly from mediawiki_content_history, you shouldn't need to think about handling any consistency reconciliations yourself. CC @xcollazo :)

FWIW, we would have liked to focus on eventual consistency of mediawiki.page_content_change.v1, and backfill it. However, because of the impending need to replace the flailing Dumps 1 pipeline, we decided to focus on replacing and reconciling mediawiki_wikitext_history instead.

cc @Ahoelzl

Ok the above all makes sense, but I'm not sure I understand what you're proposing that we change our current approach to. Could you spell it out step-by-step?

I'll let @xcollazo confirm, but I'm suggesting to use the new Dumps 2 (not yet production ready, but very soon!) mediawiki_content_history Iceberg table as your sole datasource. It will be incremental (hourly or daily), reconciled and eventually consistent, so you don't need to do your own reconciliation / lambda arch.

There may be details I don't understand about properly consuming updates to that table, so I could be wrong, but I think the intention of having a consistent mediawiki_content_history is so that folks don't have to do roll their own reconcilliation.

But, @Cparle more generally, if we (DPE) were able to prioritize work for T258511: Data Lake incremental Data Updates and T291120: MediaWiki Event Carried State Transfer - Problem Statement and https://wikitech.wikimedia.org/wiki/MediaWiki_External_State_Problem, your task here would be much simpler.

I'd like to encourage you and your team to be 'squeakier' about this stuff, so that our managers understand the connection between platform capabilities and common product feature and analytics needs :) Thank YOUUUU!

Haha ok cool ... by 'squeakier' do you mean broadcasting what we need/want a bit louder? And when is Dumps 2.0 expected to land?

by 'squeakier' do you mean broadcasting what we need/want a bit louder?

Yup! Be the squeaky wheel. We can discuss more elsewhere :)

And when is Dumps 2.0 expected to land?

...release candidate...soon? We had expected this month, but I'll let Xabriel comment. There is a development table that can be used now I think.

Ahoelzl moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Oct 23 2024, 9:03 PM

In T370898#10254614, @Ottomata wrote:

I'll let @xcollazo confirm, but I'm suggesting to use the new Dumps 2 (not yet production ready, but very soon!) mediawiki_content_history Iceberg table as your sole datasource. It will be incremental (hourly or daily), reconciled and eventually consistent, so you don't need to do your own reconciliation / lambda arch.

@Cparle wmf_dumps.mediawiki_content_history will be, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history table. Even though we were hoping for it to be updated hourly, we have decided to update it daily, for performance reasons. You can read more about the architecture of this here.

We currently have a release candidate of this table in the datalake that uses the table's old name: wmf_dumps.wikitext_raw_rc2. We intend to have the production table by end of Q2.

There may be details I don't understand about properly consuming updates to that table, so I could be wrong, but I think the intention of having a consistent mediawiki_content_history is so that folks don't have to do roll their own reconcilliation.

We have it in the queue to experiment with a CDC-like mechanism built in to Iceberg to consume changes made to mediawiki_content_history via T366544: Use the Spark-Iceberg built in CDC mechanism to PoC a replacement for wikimedia_wikitext_current, but we have not made progress on that yet.

I think I don't follow completely the requirements for this work, so perhaps we should all talk again and align better.

Cparle updated the task description. (Show Details)Nov 18 2024, 5:23 PM

Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.Mon, Dec 16, 2:55 PM

Cparle updated the task description. (Show Details)

Cparle updated the task description. (Show Details)Mon, Dec 16, 2:57 PM

As of 2024-18-12

Code for this is done and in review. To really finish it off

we need an airflow job to keep the data up to date
also need deployment/CI setup in the gitlab repo

I've run all the jobs, and historical the DRs/deletions for all-upload-methods and UW can be seen here

For later ... upload.upload_source is set from a monthly dump, because it's not available in the up-to-date data in wmf_dumps. We'd have a better view of the current-state-of-things if change_tag data was available in wmf_dumps

Hey @Cparle , @Ottomata tells me you might need HDFS deployment through CI/CD for this project? I just set up Blunderbuss which does exactly that, although we haven't gone fully public org-wise with it yet.

What exactly do you need deployed to HDFS?

Hi @amastilovic ... I'm not sure we need anything deployed to HDFS, I've already created the database we're writing to (mediawiki_upload_tracking, and an integration-testing db too test_mediawiki_upload_tracking) - is there something else that needs to be done?

Review done: https://gitlab.wikimedia.org/repos/structured-data/upload-tracking/-/merge_requests/3
Moving back to ready for dev

[L] Track commons deletion requestsOpen, Needs TriagePublicActions

Description

Proposed solution

Related ObjectsSearch...

Event Timeline

As of 2024-18-12

[L] Track commons deletion requests
Open, Needs TriagePublic
Actions

Related Objects
Search...