Page MenuHomePhabricator

Create partial SQL dump of watchlist table
Open, MediumPublic

Description

Create a partial SQL dump of the watchlist table that includes:

  • wl_namespace
  • wl_title

Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:47 AM
bzimport set Reference to bz49133.

I don't know that we can do this for privacy reasons. I'd like to have other folks weigh in on this request.

(In reply to comment #1)

I don't know that we can do this for privacy reasons. I'd like to have other
folks weigh in on this request.

Do you mean that even with just those columns one could reverse-engineer who's the corresponding user?

I mean it's not even available via the api. Maybe one can get a 'list of all watched pages' (need to check that).

(In reply to comment #3)

I mean it's not even available via the api. Maybe one can get a 'list of all
watched pages' (need to check that).

I doubt it, otherwise it would make little sense to restrict [[Special:UnwatchedPages]] to sysops. If that's your concern,

  1. it's not about privacy,
  2. I'm not sure it affects the anti-vandalism rationale as extracting that data would still be a non-trivial effort not worth the minor gain a vandal would get from it.
Nemo_bis raised the priority of this task from Low to Medium.Apr 9 2015, 7:08 AM
Nemo_bis set Security to None.

Schema for the watchlist table currently: https://www.mediawiki.org/wiki/Manual:Watchlist_table

The only fields that might make sense to publish are wl_namespace and wl_title, as mentioned in the task description.

Can we make guesses about the user associated with a particular watchlist by looking at its entries and seeing who edited/created the articles? More generally, how might this data be used (or abused) if published? Adding @Reedy for the privacy/abuse angle, if you want to redirect me to someone else, feel free to do so and remove yourself.

Because I guess that someone could cause inappropriate information to be published in these dumps by watchlisting a non-existent page, we should check for existence of all titles before they get written out. https://www.mediawiki.org/wiki/Help:Watching_pages#Watching_a_nonexistent_page

A second proposed idea by @CCicalese_WMF , rather than dumping the ns/title pairs in watchlist groups, is to dump a global (per wiki) list of ns/title pairs with counts of the number of times they appear in watchlists on that wiki. I feel that this is less likely to leak information but I'd still be grateful for a review.

Can we make guesses about the user associated with a particular watchlist by looking at its entries and seeing who edited/created the articles? More generally, how might this data be used (or abused) if published?

It won't work to de-anonymize every list owner, but correlating anonymized watchists with publicly available editing behavior would leak things. Non-article space watched pages are likely to be the easiest to correlate. Imagine the set of watchlists where the only User namespace pages watched is the owner's. Or the set which contain watches on workflow pages (like Steward's request boards) that will mostly be watched by folks who perform some duty related to the watched pages.

See also:

A second proposed idea by @CCicalese_WMF , rather than dumping the ns/title pairs in watchlist groups, is to dump a global (per wiki) list of ns/title pairs with counts of the number of times they appear in watchlists on that wiki. I feel that this is less likely to leak information but I'd still be grateful for a review.

This idea seems much less problematic because this data is already available for any MediaWiki page via ?action=info and api.php?action=query&prop=info&inprop=watchers&titles=....

Adding the other Platform folks who will be following along on this bug: @hnowlan @Clarakosi @WDoranWMF @AMooney

This task hasn't been updated in a while, but a little group of people have been using it as an opportunity for knowledge sharing around the xml/sql dumps. Some very WIP patches will be showing up in gerrit soon.

Change 623879 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[mediawiki/core@master] WIP: Create partial SQL dump of watchlist table

https://gerrit.wikimedia.org/r/623879

Change 625895 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[operations/dumps@master] WIP: Add new watchlist job

https://gerrit.wikimedia.org/r/625895

Change 627429 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up order of args to merge() in watchlist dumps

https://gerrit.wikimedia.org/r/627429

Change 627443 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix variable expansion in watchlist dumps script

https://gerrit.wikimedia.org/r/627443

Change 627444 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] use chunksize arg in watchlist dump script

https://gerrit.wikimedia.org/r/627444

Change 627445 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up path to Maintenance.php in watchlist dump script

https://gerrit.wikimedia.org/r/627445

Change 627468 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] use DB_REPLICA instead of DB_MASTER while dumping watchlist

https://gerrit.wikimedia.org/r/627468

Change 627471 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] stop dumping rows when we have reachd MAX(wl_id) for watchlist dumps

https://gerrit.wikimedia.org/r/627471

Change 627472 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] more error checks/cleanup for intermediate files for watchlist dumps

https://gerrit.wikimedia.org/r/627472

Change 627474 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] add options for output file name and paths to sort, awk for watchlist dumps

https://gerrit.wikimedia.org/r/627474

Change 627814 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fixups for dumping aggregated watchlist info

https://gerrit.wikimedia.org/r/627814

Change 627445 abandoned by ArielGlenn:
[mediawiki/core@master] fix up path to Maintenance.php in watchlist dump script

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /627814

https://gerrit.wikimedia.org/r/627445

Change 627429 abandoned by ArielGlenn:
[mediawiki/core@master] fix up order of args to merge() in watchlist dumps

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627429

Change 627468 abandoned by ArielGlenn:
[mediawiki/core@master] use DB_REPLICA instead of DB_MASTER while dumping watchlist

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627468

Change 627471 abandoned by ArielGlenn:
[mediawiki/core@master] stop dumping rows when we have reachd MAX(wl_id) for watchlist dumps

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627471

Change 627443 abandoned by ArielGlenn:
[mediawiki/core@master] fix variable expansion in watchlist dumps script

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627443

Change 627444 abandoned by ArielGlenn:
[mediawiki/core@master] use chunksize arg in watchlist dump script

Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627444

Change 629099 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fixups to make jenkins happy, for watchlist dumps

https://gerrit.wikimedia.org/r/629099

Change 629099 abandoned by ArielGlenn:
[mediawiki/core@master] fixups to make jenkins happy, for watchlist dumps

Reason:
squashed into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/629099

Change 627814 abandoned by ArielGlenn:
[mediawiki/core@master] fixups for dumping aggregated watchlist info

Reason:
incorporated in https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879/13

https://gerrit.wikimedia.org/r/627814

Change 627472 abandoned by ArielGlenn:
[mediawiki/core@master] more error checks/cleanup for intermediate files for watchlist dumps

Reason:
squashed into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879

https://gerrit.wikimedia.org/r/627472

Change 627474 abandoned by ArielGlenn:
[mediawiki/core@master] add option for output file name for watchlist dump script

Reason:
now included in I0573053d0e752bdcaf459292bca7d7aabae26a02

https://gerrit.wikimedia.org/r/627474

Things that should be done before the script is considered to be in final form:

  • Address any last code review comments
  • Decide whether to include the watchlist dumps in the xml/sql dumps or the weekly "other" dumps
  • Add the ability to produce compressed output from the script. Since it writes output files and moves them around, adding a pipeline to gzip is not feasible.
  • Inline documentation of classes and methods, for our future selves when we have to maintain this script
  • Large intermediate files should be written gzip compressed, since we're writing over the network (NFS)
  • Unit tests or integration test
  • Look at what happens on a rerun in case of a first failure (do we reuse good intermediate files, toss the bad ones?)
  • Check with $someone about whether use of linux-y commands sort/awk is ok for a maintenance script

Some of these are more in the nature of "one fnal pass to make sure they are ok", and some (eg. gzip output) are not done items.

Change 650472 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] watchlist dump: optionally gzip compress output if output file ends in ".gz"

https://gerrit.wikimedia.org/r/650472

Change 650472 abandoned by Holger Knust:
[mediawiki/core@master] watchlist dump: optionally gzip compress output if output file ends in ".gz"

Reason:

https://gerrit.wikimedia.org/r/650472