Make watchlist table available as curated foo_p.watchlist_count on labsdb
Open, Stalled, NormalPublic

Description

The dewiki_p.watchlist table is missing at the labs DB. Toolserver had a sanitized view of the watchlist table, so please add the same view to labs.

*edit*

In the previous incarnation of maintain-views we had a customview defined for watchlist

https://gerrit.wikimedia.org/r/#/c/225218/3/maintain-replicas/maintain-replicas.pl

'watchlist_counts' => {
    'source' => 'watchlist',
    'view'   => 'select count(*) as wl_count, wl_namespace, wl_title',
    'group'  => 'wl_namespace, wl_title having wl_count >= 30', },

@jcrespo created a mechanism to curate the watchlist table at the sanitarium layer into a table called watchlist_count and a view was created for some wiki's (maybe only enwiki?) at that time:

https://phabricator.wikimedia.org/T138450#2401133

Because the watchlist_count table is already curated and "safe" we can add it to the fullviews array for maintain-views (then maintain-replicas) without fear. This makes the handling of this table consistent with current process. @Krenair added it here to the json config for the then candidate maintain-replicas.pl replacement:

https://github.com/wikimedia/operations-software/commit/c886c32dcd94fb8937b016638036a1a6322ea95a

This has been carried over to the current iteration of maintain-views (sourced from krenairs rewrite) here https://github.com/wikimedia/operations-puppet/blob/production/modules/role/files/labsdb/maintain-views.py

The config can be seen here:

https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/labsdb/maintain-views.yaml#L137

As it stands we need to run the current maintain-views across all labs wiki replica DBs to pickup the watchlist_count view where applicable

Details

Reference
bz57617

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
jcrespo moved this task from Triage to Backlog on the DBA board.Dec 9 2015, 10:01 AM

@jcrespo The reason this is "hard" is because it involves replication filters and requires a mysql restart (downtime) so it has to be properly scheduled. This is a reminder from your past-self so you do not lose time again thinking why this is not yet done.

@jcrespo Can we get any update? Its been some months and I have a few tools that can use this property.

Yes, although I have not updated it here, in order to correct several data issues and to populate the new labs server, all tables are being reloaded right now. It will take, however weeks to go over all wikis.

So this is in progress right now.

My eta is that enwiki will be available in ~4 weeks, other wikis will folows on the following weeks. It is going slow but it will have the advantage of 1) all data will be imported fresh from production, so potential drift will dissappear 2) we will have increased capacity in the end (wait for the announcement).

jcrespo moved this task from Backlog to Next on the DBA board.Apr 1 2016, 5:59 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 6:32 PM
jcrespo moved this task from Next to In progress on the DBA board.May 4 2016, 9:50 AM

This task is blocking https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Most-watched_users from being updated.

I'm excited to read that this task is now in progress.

Enwiki will take a bit more than I though- there was the failover, thenthere were multiple issues on db1069, but I think they are now fixed: T134349 and now import is back on track.

jcrespo renamed this task from watchlist table not available on labs to Make watchlist table available on labs.May 10 2016, 3:45 PM

After some thinking, and my thoughts on why labs breaks so easily T136618#2356834, I think this is one of the cases that would fail to be replicated accurately- because watchlist is a table which all update operations are based on user_id (wl_user) and that is precisely one of the fields to anonimize, it is not possible to securely propagate changes and not expose to a potential data leak (what pages a user has watched) at the same time.

The only way I can think to make this work would be to have a cron job that, every some time, it calculates the summary table and write its to labs. This is not immediate, it requires more work than just importing a table or creating a trigger and we have to evaluate how much time it will take to run a single summary job.

I will start by creating a new table manually, and see what are the possibilities to make this work.

After some thinking, and my thoughts on why labs breaks so easily T136618#2356834, I think this is one of the cases that would fail to be replicated accurately- because watchlist is a table which all update operations are based on user_id (wl_user) and that is precisely one of the fields to anonimize, it is not possible to securely propagate changes and not expose to a potential data leak (what pages a user has watched) at the same time.

The only way I can think to make this work would be to have a cron job that, every some time, it calculates the summary table and write its to labs. This is not immediate, it requires more work than just importing a table or creating a trigger and we have to evaluate how much time it will take to run a single summary job.

I will start by creating a new table manually, and see what are the possibilities to make this work.

The Toolserver previously used MySQL views for this, as I understand it. Why has completely excluding a column from a table view become more complex?

A manually updated table would be better than no table.

The Toolserver previously used MySQL views for this

An escalation of privileges vulnerability would expose user private data. It also exposes the threat of someone potentially sniffing the changes being sent to the server on the wire as it is a more open network than out private one. We cannot risk to that- private data must continue being on production machines only, and the only semi-private data that can be on labs but hidden with views are records (not columns) only accessible already to users with higher permissions on wiki or have been public in the past. E.g.: pasword hashes should never be sent to labs; deleted revision can be filtered on labs. I would consider watchlists as sensitive as passwords/ips regarding privacy. While I do not have examples of this happening in the past, there have been several incidents that, should they happen all at the same time, they could have lead to data leak.

A better solution would be to regenerate that table automatically each week, each day, or each hour, whatever is technically possible. It will be lagged for hour(s), but it can be at least exposed securely- I do not think there would be a huge impact on the ability to use that table in any case.

scfc added a comment.Jun 11 2016, 6:05 AM

There is already the potential for a privilege vulnerability that would expose private user data: If someone hacks MediaWiki, they can execute arbitrary SQL queries. We trust MediaWiki to protect the "views" it offers to the public, so trusting MariaDB to protect its views does not seem to be fundamentally different.

I don't mind the current Sanitarium setup, and WMF may have policies that require the setup to be as it is, but it is not the only rational solution to making the (public) data available to Labs users.

For this task, the periodical creation of a table with tuples (wl_namespace, wl_title, wl_labsdb_count) would be simple and effective; it might be technically interesting to update the counts with triggers in real-time, but I have never seen @jcrespo idling, so that should wait :-).

Krd removed a subscriber: Krd.Jun 11 2016, 6:17 AM

There is already the potential for a privilege vulnerability that would expose private user data: If someone hacks MediaWiki, they can execute arbitrary SQL queries. We trust MediaWiki to protect the "views" it offers to the public, so trusting MariaDB to protect its views does not seem to be fundamentally different.

Let me start semi-jokingly that I as a "sufferer" from mediawiki code, and one of its largest critics, I trust more on mediawiki code security than MariaDB's :-) MySQL has a long story of vulnerabilities, and even the own host where it is hosted has a wider point of attack, and while we have dozens of mediawiki hackers and many other security volunteers able to respond quicly to mediawiki 0 days (plus a lot of production monitoring), we have some part-time DBA to handle labsdbs.

While there always is a scenario "what if..."- the distinction would be clear: almost everybody has access to the labs network, almost nobody has access to the production network. Both are separated and it is the intention to keep both physically and "data"-separated. While we cannot defend ourselves from bad code, we can from the infrastructure and network point of view. While I have not seen many reports compromising the production network, labs one is, by definition, insecure- it allows arbitrary code execution and free access to all. Mediawiki may be secure or not (I am inclining to thing it is), but unlike labs hosts, it does not allow arbitrary SQL code execution.

We used to be really bad at security, but now there are a lot of clever people assuring we are improving in this aspect- and we should follow its recommendations. Full transparency and data privacy have sometimes collisions, but I know that in cases like this privacy and security has to win.

However, I do not discard having something in the future that could get updates in real time- I think a first approach is having something useful and automatic ASAP- and we can improve it later. The first tests I started doing tell me that we could have a lag of only 1-2 hours, so maybe it wouldn't be that bad for a start, wouldn't it?

Expect more news soon.

jcrespo added a comment.EditedJun 11 2016, 1:25 PM

There is a huge improvement for it being a real table rather than a view- it has usable indexes, so now you can do:

MariaDB  labsdb1001 enwiki_p > SELECT * FROM watchlist_count ORDER BY watchers DESC LIMIT 10;
+----------+--------------+---------------+
| watchers | wl_namespace | wl_title      |
+----------+--------------+---------------+
|   108730 |            1 | Main_Page     |
|   108730 |            0 | Main_Page     |
|    16923 |            5 | Sandbox       |
|    16923 |            4 | Sandbox       |
|    16663 |            5 | Introduction  |
|    16663 |            4 | Introduction  |
|    16255 |          119 | Lea_Luboshutz |
|    16255 |          118 | Lea_Luboshutz |
|    16253 |            1 | Lea_Luboshutz |
|    16253 |            0 | Lea_Luboshutz |
+----------+--------------+---------------+
10 rows in set (0.00 sec)

And it will take 0 seconds instead of half an hour doing stupid and unnecessary filesorts.

Yes, the query up here is real. I've created a proof of concept for you to play around. Please, do not create tools based on it, this is just for you to tell me if this is useful, it is not updated right now and both the table name, the columns and the names, the indexes and anything about it could change at any time.

Please provide feedback about if this is useful, if you see any security concerns, and I will put it on all hosts, on all wikis, and make it be updated automatically.

Please provide feedback about if this is useful, if you see any security concerns, and I will put it on all hosts, on all wikis, and make it be updated automatically.

Very nice! It took 4.576 seconds to update https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Most-watched_users. I made a few formatting tweaks to the report and re-ran the script and it took 3.940 seconds. No complaints here. :-) These numbers include logging in to the wiki and editing the page, meaning that the query returned in less than 4 seconds both times.

I have now the watchlist count generator running on all public wikis. I will measure how much it takes for a second run (after indexes are hot) on all wikis to establish the refresh frequency and puppetize it. However, (for this to get public) it requires T135029 (maintain replicas refresh, not adywiki in particular).

The production watchlist table contains (wl_namespace, wl_title) entries that do not currently have page table (page_namespace, page_title) entries. Can deleted/non-existent pages be included in watchlist_count?

Example: https://en.wikipedia.org/w/index.php?title=How_wikis_work&action=info#mw-pageinfo-watchers has 1,952 page watchers, but this page does not currently exist.

Hmmm, never mind.

MariaDB [enwiki_p]> select * from watchlist_count where wl_title = 'How_wikis_work' and wl_namespace in (0, 1);
+----------+--------------+----------------+
| watchers | wl_namespace | wl_title       |
+----------+--------------+----------------+
|     1952 |            0 | How_wikis_work |
|     1952 |            1 | How_wikis_work |
+----------+--------------+----------------+
2 rows in set (0.00 sec)

I guess my report changed for some other reason.

I guess my report changed for some other reason.

There was a join with the page table due to page-move vandalism, presumably: https://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Most-watched_pages_by_namespace&oldid=726115449. I'll stop spamming this task now.

I have now the watchlist count generator running on all public wikis.

Someone just asked me about generating a watchlist-related database report for fa.wikipedia.org.

mzmcbride@tools-bastion-03:~$ mysql -henwiki.labsdb enwiki_p -e "describe watchlist_count;" 
+--------------+----------------+------+-----+---------+-------+
| Field        | Type           | Null | Key | Default | Extra |
+--------------+----------------+------+-----+---------+-------+
| watchers     | bigint(21)     | NO   |     | 0       |       |
| wl_namespace | int(11)        | NO   |     | 0       |       |
| wl_title     | varbinary(255) | NO   |     |         |       |
+--------------+----------------+------+-----+---------+-------+
mzmcbride@tools-bastion-03:~$ mysql -hfawiki.labsdb fawiki_p -e "describe watchlist_count;" 
ERROR 1146 (42S02) at line 1: Table 'fawiki_p.watchlist_count' doesn't exist

:-(

The tables actually are generated physically, but work is being done as of this moment by labs engineers to make those available on all wikis. Stay tuned. CC @chasemp

chasemp renamed this task from Make watchlist table available on labs to Make watchlist table available as curated foo_p.watchlist_count on labsdb.Oct 18 2016, 3:34 PM
chasemp updated the task description. (Show Details)

I think with T148560 this is acheivable

jcrespo removed jcrespo as the assignee of this task.Oct 25 2016, 8:05 AM
Dispenser added a comment.EditedOct 26 2016, 4:01 PM

Several short comings compared to the previous implementation

  1. No active watcher count (visitingwatchers, i.e. logged in past 30 days)
  2. No support for redlinks, see Most watched RfAs (Also supported by the API)
  3. Lower thresholds for User talk pages and my unwatched changes tools
  4. Inability to adjust visitingwatchers/user activity cutoff. Useful in generating statistics and prioritizing anti-vandalism.

Also sometimes we have users steeling watchers, skewing counts.

This is almost done, tables are physically generated, we need to check the views are, too, and we need to puppetize the creation script on sanitarium.

@Dispenser Thank you for the comment. I would like you to create a separate task or tasks for improvements to the existing table. As far as I know, the current structure was approved by the legal team (that is what I was told by the labs team at the time) as not leaking any PII. That doesn't mean it cannot be changed ever -we are indeed very open to suggestions for improvement- but I believe they may need to be evaluated and handled separately and probably approved again.

This is almost done (the table is available on the current labsdb servers), but to close it we need to:

  • Puppetize thinking there may be other tables like this in th future
  • Add a table with "last updated" information
  • Add it to the new labsdb servers

This is actually done on all wikis- but for some reason, there is a bug and the views have not been regenerated.

$ mysql --skip-ssl -h labsdb1001.eqiad.wmnet frwiki -e "SELECT * FROM watchlist_count ORDER BY watchers DESC LIMIT 1"
+----------+--------------+-------------------+
| watchers | wl_namespace | wl_title          |
+----------+--------------+-------------------+
|     8696 |            5 | Accueil_principal |
+----------+--------------+-------------------+

@chasemp do we need to run something even if the view definition has been added?

@jcrespo, for now the runs are not automated. I think this is good to go
now and took:

maintain-views  --all-databases --table watchlist_count --replace-all

@chasemp Thank you very much! I think I can run that on my own, at least for frwiki.

Oh, sorry, I misunderstood it. You run it already.

@MZMcBride can you test it, even if T59617#2893932 is still pending?

$ mysql --skip-ssl frwiki_p -e "SELECT * FROM watchlist_count ORDER BY watchers DESC LIMIT 1"
+----------+--------------+-------------------+
| watchers | wl_namespace | wl_title          |
+----------+--------------+-------------------+
|     8696 |            5 | Accueil_principal |
+----------+--------------+-------------------+

@MZMcBride can you test it, even if T59617#2893932 is still pending?

Yay, it works!

Output from frwiki_p: https://en.wikipedia.org/wiki/Special:Permalink/761938232.

The report takes less than five seconds to generate.

jcrespo changed the task status from Open to Stalled.Mar 3 2017, 9:51 AM

This is stalled- this is definitely going to happen, but we cannot find the time to do it. The scripts are done, we just need to puppetize and create a specific account for this and other similar summary tables in a secure way. For now, it is not a hug blocker because the scripts is running and the tables, regenerated.

Change 375349 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add script to generate watchlist_count table on labs

https://gerrit.wikimedia.org/r/375349

Marostegui moved this task from Next to Backlog on the DBA board.Jan 11 2018, 11:38 AM
kaldari added a subscriber: kaldari.EditedMay 3 2018, 10:42 PM

FYI, this is needed for T193086.

@kaldari Counting is available to be done on production (or anywhere, throught the api). There is not such an infrastructure setup to handle custom tables on wikirreplicas- and right now there are no resources to work on that, so that is stalled indefinitely until someone can help with code. The functionality is needed, but I just want to make sure expectations are clear about when that will be done (this is not scheduled right now). Once there is a common framework to allow custom tables with proper privacy checking, adding extra tables should be trivial. Ideally volunteers should be able to propose new tables being added. There is https://gerrit.wikimedia.org/r/375349 but that is now outdated and need more work.

@jcrespo: Thanks for the info. This tool will be running on Tool Forge. It may be possible for us to use the API for this, although that may not scale for this particular project. In some cases we may need to get watcher info for thousands of pages at once (e.g. an image used in a common template). The feature is not high priority, so we may just decline it for now.

@Framawiki notes that the "watchlist_count" table has gone missing at https://en.wikipedia.org/w/index.php?title=User_talk:MZMcBride&oldid=846197628#About_watchlist_table_on_replicas. I can confirm:

tools.mzmcbride@tools-bastion-03:~/scripts/enwiki$ mysql -henwiki.labsdb enwiki_p -e 'show tables like "w%";'
+-------------------------+
| Tables_in_enwiki_p (w%) |
+-------------------------+
| wbc_entity_usage        |
| wikilove_log            |
+-------------------------+

My guess is that some kind of database reimport or database host rebuild happened?

It looks like we have a generate_watchlist_count.sh script. Does that need to be re-run? If so, are @Marostegui and @jcrespo the only two who can do it or can any shell user?

This ticket is open because this was never done properly- we lack the code infrastructure to have user-defined generating summary tables. The script was a one time thing I did, but it will not work without proper puppetization and generalization/abstraction, it is not just a question of running it once. I set the ticket as stalled and "help wanted" because I currently do not have the time at the moment to solve it correctly so it doesn't break anymore. This is an ambitious problem that requires help - the script may help, however, the idea of what is to be done- if someone wants to fix it and puppetize it before I find the time I can deploy it quickly. E.g. the idea is to create a process so that in the future summary tables is something that can be 100% or almost 100% user controlled.

Ouch, just found that my bot was relying on it to create reports too :)