Create a partial SQL dump of the watchlist table that includes:
- wl_namespace
- wl_title
Version: unspecified
Severity: enhancement
Create a partial SQL dump of the watchlist table that includes:
Version: unspecified
Severity: enhancement
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T140977 Dump public data from all sql tables that have mixed public/private data | |||
Open | Feature | None | T51133 Create partial SQL dump of watchlist table |
I don't know that we can do this for privacy reasons. I'd like to have other folks weigh in on this request.
(In reply to comment #1)
I don't know that we can do this for privacy reasons. I'd like to have other
folks weigh in on this request.
Do you mean that even with just those columns one could reverse-engineer who's the corresponding user?
I mean it's not even available via the api. Maybe one can get a 'list of all watched pages' (need to check that).
(In reply to comment #3)
I mean it's not even available via the api. Maybe one can get a 'list of all
watched pages' (need to check that).
I doubt it, otherwise it would make little sense to restrict [[Special:UnwatchedPages]] to sysops. If that's your concern,
Schema for the watchlist table currently: https://www.mediawiki.org/wiki/Manual:Watchlist_table
The only fields that might make sense to publish are wl_namespace and wl_title, as mentioned in the task description.
Can we make guesses about the user associated with a particular watchlist by looking at its entries and seeing who edited/created the articles? More generally, how might this data be used (or abused) if published? Adding @Reedy for the privacy/abuse angle, if you want to redirect me to someone else, feel free to do so and remove yourself.
Because I guess that someone could cause inappropriate information to be published in these dumps by watchlisting a non-existent page, we should check for existence of all titles before they get written out. https://www.mediawiki.org/wiki/Help:Watching_pages#Watching_a_nonexistent_page
A second proposed idea by @CCicalese_WMF , rather than dumping the ns/title pairs in watchlist groups, is to dump a global (per wiki) list of ns/title pairs with counts of the number of times they appear in watchlists on that wiki. I feel that this is less likely to leak information but I'd still be grateful for a review.
It won't work to de-anonymize every list owner, but correlating anonymized watchists with publicly available editing behavior would leak things. Non-article space watched pages are likely to be the easiest to correlate. Imagine the set of watchlists where the only User namespace pages watched is the owner's. Or the set which contain watches on workflow pages (like Steward's request boards) that will mostly be watched by folks who perform some duty related to the watched pages.
See also:
This idea seems much less problematic because this data is already available for any MediaWiki page via ?action=info and api.php?action=query&prop=info&inprop=watchers&titles=....
Adding the other Platform folks who will be following along on this bug: @hnowlan @Clarakosi @WDoranWMF @AMooney
This task hasn't been updated in a while, but a little group of people have been using it as an opportunity for knowledge sharing around the xml/sql dumps. Some very WIP patches will be showing up in gerrit soon.
Change 623879 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[mediawiki/core@master] WIP: Create partial SQL dump of watchlist table
Change 625895 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[operations/dumps@master] WIP: Add new watchlist job
Change 627429 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up order of args to merge() in watchlist dumps
Change 627443 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix variable expansion in watchlist dumps script
Change 627444 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] use chunksize arg in watchlist dump script
Change 627445 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up path to Maintenance.php in watchlist dump script
Change 627468 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] use DB_REPLICA instead of DB_MASTER while dumping watchlist
Change 627471 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] stop dumping rows when we have reachd MAX(wl_id) for watchlist dumps
Change 627472 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] more error checks/cleanup for intermediate files for watchlist dumps
Change 627474 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] add options for output file name and paths to sort, awk for watchlist dumps
Change 627814 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fixups for dumping aggregated watchlist info
Change 627445 abandoned by ArielGlenn:
[mediawiki/core@master] fix up path to Maintenance.php in watchlist dump script
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /627814
Change 627429 abandoned by ArielGlenn:
[mediawiki/core@master] fix up order of args to merge() in watchlist dumps
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627468 abandoned by ArielGlenn:
[mediawiki/core@master] use DB_REPLICA instead of DB_MASTER while dumping watchlist
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627471 abandoned by ArielGlenn:
[mediawiki/core@master] stop dumping rows when we have reachd MAX(wl_id) for watchlist dumps
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627443 abandoned by ArielGlenn:
[mediawiki/core@master] fix variable expansion in watchlist dumps script
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627444 abandoned by ArielGlenn:
[mediawiki/core@master] use chunksize arg in watchlist dump script
Reason:
incorporated into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 629099 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fixups to make jenkins happy, for watchlist dumps
Change 629099 abandoned by ArielGlenn:
[mediawiki/core@master] fixups to make jenkins happy, for watchlist dumps
Reason:
squashed into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627814 abandoned by ArielGlenn:
[mediawiki/core@master] fixups for dumping aggregated watchlist info
Reason:
incorporated in https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879/13
Change 627472 abandoned by ArielGlenn:
[mediawiki/core@master] more error checks/cleanup for intermediate files for watchlist dumps
Reason:
squashed into https://gerrit.wikimedia.org/r/c/mediawiki/core/ /623879
Change 627474 abandoned by ArielGlenn:
[mediawiki/core@master] add option for output file name for watchlist dump script
Reason:
now included in I0573053d0e752bdcaf459292bca7d7aabae26a02
Things that should be done before the script is considered to be in final form:
Some of these are more in the nature of "one fnal pass to make sure they are ok", and some (eg. gzip output) are not done items.
Change 650472 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] watchlist dump: optionally gzip compress output if output file ends in ".gz"
Change 650472 abandoned by Holger Knust:
[mediawiki/core@master] watchlist dump: optionally gzip compress output if output file ends in ".gz"
Reason: