Page MenuHomePhabricator

Recover Echo Notification Blacklist from Backup
Closed, InvalidPublic

Description

In T177825 it was discovered T177825#3685628 that the maintenance script was run more than once (I vaguely recall this during deployment, but I thought the failure was before any were processed). The script, it appears, was not meant to handle the preference in a mixed-state (i.e. some with usernames, others with user ids).

We need three things in order to restore the blocklists:

  1. Recover the preference from the backup for only the users who have a 0 (or a series of 0's separated by a newline character, there could be more than 2) (i.e. only users who have not already updated their blacklist).
  2. Update the maintenance script to support mixed-state Figure out why it failed and prevent it from happening again.
  3. Apply the data to production and run the updated maintenance script.

Event Timeline

dbarratt updated the task description. (Show Details)

"Recover the preference from the backup "

Do you have a timestamp and a query for me?

"Recover the preference from the backup "

Do you have a timestamp and a query for me?

Here's a query to retrieve the user ids of users who have a bad preference:

SELECT up_user FROM user_properties WHERE up_property = 'echo-notifications-blacklist' AND up_value REGEXP '^(0|0\n)+$';

and this is a query to get the pref:

SELECT up_value FROM user_properties WHERE up_property = 'echo-notifications-blacklist';

obviously restricted the list of up_user from the list in prod and will obviously need to be updated in prod with the same up_property.

It was run during the "Morning SWAT" on Monday, October 09
https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0October.C2.A009
How would I get a more precise timestamp for you?

How would I get a more precise timestamp for you?

Can you tell me a query that was run, so there is a hard limit per wiki? E.g. 'not recover after "X query" was run'. I can search the exact SQL queries on the binary log, but I need something to work with. What is the first write you know it was run, that touched things? Was it run on all wikis at the same time?

We have full backups from 4 October an 10 October, but they come from a delayed slave, so I am trying to decide which ones would be more useful. We can roll forward some of them, but it is quite time consuming.

Can you tell me a query that was run, so there is a hard limit per wiki? E.g. 'not recover after "X query" was run'. I can search the exact SQL queries on the binary log, but I need something to work with. What is the first write you know it was run, that touched things? Was it run on all wikis at the same time?

This is the script that was executed:
https://phabricator.wikimedia.org/diffusion/ECHO/browse/master/maintenance/updatePerUserBlacklist.php
Here is the UPDATE query that was executed:
https://phabricator.wikimedia.org/diffusion/ECHO/browse/master/maintenance/updatePerUserBlacklist.php;65a61a91411464a5ec8d74eb71a87e3e9ed988c3$71

You could also search for any of the $this->output() text in the logs.

And yes, I believe it was run on all queries that have echo enabled at that time (but as I mentioned, it was also run more than once).

Thanks, https://phabricator.wikimedia.org/diffusion/ECHO/browse/master/maintenance/updatePerUserBlacklist.php;65a61a91411464a5ec8d74eb71a87e3e9ed988c3$71 helps.

I am for now exporting the state of the tables on the 4th Oct backup, and I will see from that how to proceed.

@jcrespo What I think (hope) you will find is two groups of UPDATE user_properties SET up_value=X WHERE up_user=N AND up_property='echo-notifications-blacklist'; queries, first a small one and then later a bigger one, on 2017-10-09 between 18:00 and 19:00 UTC. The first group should be small (~1300 for enwiki and at most a couple thousand across all wikis in total), the second group should be larger. In the second (bigger) group of queries, some values of X (values that up_value is set to) will look like '0' or '0\n0' or '0\n0\n0' etc. For those users, there should also be non-zero values set in the first round, and those are the values David will need. Cross-referencing this is probably a lot of effort, so you can also cross-reference it against users that have zeroes currently (with David's REGEXP query), or just dump all the first round data and let David cross-reference it himself.

Yes, but, just to be clear, the list of prefs that need to be updated should be taken from prod right now because some users have already updated their blacklist and we probably shouldn't mess with it again.

Can I be a bit simplistic?- I will setup a mysql instance in read only accesible from the cluster with user_properties between the 3th and the 6th. I could do, later, the same for the tables between the 7th and the 13th. And you check if that is helpful for you and what is the actual methodology you would like to use. Reverting those kind of changes require a lot of knowledge of how things work, what was done and what is required, which I may lack. The older the backups we need, the more difficult it is.

While that is being setup, if the number of updates is relatively small, compared to the table size, I can try to find the individual updates on ROW format (which stores previous and final result) with the parameters mentioned at https://phabricator.wikimedia.org/T178313#3688296 .

While that is being setup, if the number of updates is relatively small, compared to the table size, I can try to find the individual updates on ROW format (which stores previous and final result) with the parameters mentioned at https://phabricator.wikimedia.org/T178313#3688296 .

It should be pretty small, yes. The first batch is probably about 1500 rows on enwiki out of millions (and smaller numbers on smaller wikis). The second batch will be bigger, but not that much bigger (I'd be surprised if it was as much as 10k), and we don't actually need the data from that batch anyway, you just might have to look at it to distinguish it from the first batch.

I am backing up the ROW binary logs right now, because they happen to have a 14-15 day expiration.

I get something like:

### UPDATE `enwiki`.`user_properties`
### WHERE
###   @1=[user_id] /* INT meta=0 nullable=0 is_null=0 */
###   @2='echo-notifications-blacklist' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @3='The Rambling Man\x0aFloquensock' /* BLOB/TEXT meta=2 nullable=1 is_null=0 */
### SET
###   @1=[user_id] /* INT meta=0 nullable=0 is_null=0 */
###   @2='echo-notifications-blacklist' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @3='1573\x0a3216101'

But on enwiki it starts at 19:38 or so. One at 19:28. No others since 18:00. It could be the slave was delayed, or the estimation of the time is wrong.

Thanks for finding that @jcrespo. It's great to see that both the old and new values are in the dump. Could you specifically find queries where numbers were changed to zeroes? I.e. the @3 in the WHERE (old value) looks like 123\x0a45678 and the @3 in the SET (new value) looks like 0\x0a0

@Catrope I need a regex or a grep command or something. Or, I give you the logs and you can extract exactly the queries you want. "looks like"? contains an '\x0' ? looks like 0\x0a0? is that an exact constant? contains it? I can help, but I need to avoid the ambiguity of English. The queries are multi-line and contain lots of garbage, so not as direct as just grep.

I get, however:

### DELETE FROM `enwiki`.`user_properties`
### WHERE
###   @1=[user-id] /* INT meta=0 nullable=0 is_null=0 */
###   @2='echo-notifications-blacklist' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @3='' /* BLOB/TEXT meta=2 nullable=1 is_null=0 */
[garbage]
### INSERT INTO `enwiki`.`user_properties`
### SET
###   @1=[user-id] /* INT meta=0 nullable=0 is_null=0 */
###   @2='echo-notifications-blacklist' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @3='0' /* BLOB/TEXT meta=2 nullable=1 is_null=0 */

Maybe it deleted and inserted rather than updated?

Yeah, all modifications of echo-notifications-blacklist value to 0 are DELETES and then inserts, starting from 19:18 / db1065-bin.004117:1011275000 in the case of enwiki.

@Catrope I need a regex or a grep command or something. Or, I give you the logs and you can extract exactly the queries you want. "looks like"? contains an '\x0' ? looks like 0\x0a0? is that an exact constant? contains it? I can help, but I need to avoid the ambiguity of English. The queries are multi-line and contain lots of garbage, so not as direct as just grep.

Sorry, I should have been more specific. I meant cases where the prior data (in WHERE) was numbers instead of text, and the updated data was one or more zeroes separated by newlines.

However, it looks like you found cases where things were deleted and reinserted instead. I'd be happy to dig through these logs myself.

What is your user name on terbium?, I will send all relevant logs to you so you can check it, by yourself as I will be disconnecting soon.

What is your user name on terbium?, I will send all relevant logs to you so you can check it, by yourself as I will be disconnecting soon.

I'm catrope on terbium.

I have copied to terbium:/home/catrope/T178313 all relevant logs around the 24 hours of the given data, should be enough to fully debug the issue. Please do not move them outside of terbium or give it more open permissions- they contain private data of all kids (as much as the database itself). I have not copied the ones from s3, because some port issue, but the idea would be the same for those. You can see the shard based on the logs name, to get plain text:

/opt/wmf-mariadb101-client/bin/mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl db1060-bin.000886 | less

/opt/wmf-mariadb101-client/bin/mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl db1065-bin.004117 --start-datetime='2017-10-09 19:18' | less # takes a while

/opt/wmf-mariadb101-client/bin/mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl db1065-bin.004117 --start-position=1011275000 | less

@Catrope
This is really bizarre... the maintenance script was already setup to handle being run multiple times:
https://phabricator.wikimedia.org/diffusion/ECHO/browse/master/maintenance/updatePerUserBlacklist.php;65a61a91411464a5ec8d74eb71a87e3e9ed988c3$59

If I run it multiple times it works as expected. Even if I put the PHP failure in the same place as it was (where wfWaitForSlaves(); is now) and I run it more than once, I get the failure the first time, then run it again I get successes.

Is it possible that CentralIdLookup could not lookup the ids when the script was run? It doesn't seem like the problem is because it was run more than once.

Change 384618 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/extensions/Echo@master] Use centralIdsFromNames() method to get central ids.

https://gerrit.wikimedia.org/r/384618

Change 384618 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/extensions/Echo@master] Use centralIdsFromNames() method to get central ids.

https://gerrit.wikimedia.org/r/384618

I don't know if that patch will fix it, but seems to produce more consistent results locally if the data is malformed

@jcrespo would it be possible to run the script on the recovered data as a test before recovering to production and running it there?

Yeah, all modifications of echo-notifications-blacklist value to 0 are DELETES and then inserts, starting from 19:18 / db1065-bin.004117:1011275000 in the case of enwiki.

Then that is not what you are looking for. The maintenance script specifically calls update()
https://phabricator.wikimedia.org/diffusion/ECHO/browse/master/maintenance/updatePerUserBlacklist.php;3460faa930689b8f4d2f28edaf086beb8c7e3590$71
which runs an UPDATE
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/libs/rdbms/database/Database.php;bd85fcdec5a2f2a75fcb50154765d9508175083f$1646

However, a user updating their preferences does result in a DELETE and then INSERT
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/user/User.php;bd85fcdec5a2f2a75fcb50154765d9508175083f$5418

If UPDATE is missing... then this script didn't run on any records and the problem is something else.

In https://gerrit.wikimedia.org/r/#/c/374361/2 @Anomie said:

You'll want to set the values in the passed array to something distinctive (e.g. false) and then filter out entries from $ids with that value. Just doing an array_flip will result in bogus values for names that don't exist.

Which is clearly the case in this script (and the above patch fixes), however, I don't see how it's possible for all of them to not exist, unless there was no access to the central auth database/tables. Is there a way to test this hypothesis?

At least in debug on production, it seems to work:

hphpd> $lookup = CentralIdLookup::factory(); var_dump($lookup->lookupUserNames(['DBarratt (WMF)' => 0]));
$lookup = CentralIdLookup::factory(); var_dump($lookup->lookupUserNames(['DBarratt (WMF)' => 0]));
array(1) {
  ["DBarratt (WMF)"]=>
  int(50584396)
}

Is it possible.... however unlikely... that these rows were already zeros? they were zeros before the script was run? If that were the case, the script would have skipped over every row, and no UPDATE would be found.

I've checked other wikis, the pattern repeats:

  • The only mention of echo-notifications-blacklist before the data is to set it to '', probably a default when any user preference is set or reset; common to this and a few other properties
  • Starting from 19:17, the following happens:
  • user_touched is set with a 6 second into the future value
  • echo-notifications-blacklist for that user (was '') is deleted
  • echo-notifications-blacklist is set to '0'
  • page User:<user> is last_updated with the current timestamp
  • several of these happen 1 minute apart, on several wikis

I have left a user_properties-20171004.sql.gz file on the same directory, containing that table for all wikis 2 weeks ago.

In https://gerrit.wikimedia.org/r/#/c/374361/2 @Anomie said:

You'll want to set the values in the passed array to something distinctive (e.g. false) and then filter out entries from $ids with that value. Just doing an array_flip will result in bogus values for names that don't exist.

Which is clearly the case in this script (and the above patch fixes), however, I don't see how it's possible for all of them to not exist, unless there was no access to the central auth database/tables. Is there a way to test this hypothesis?

If there was somehow no access to the CentralAuth database/tables, the attempt to SELECT from them in CentralAuthIdLookup should have thrown an exception. You'd have to have had access to mysteriously existing-but-empty tables.

BTW, I can't see how the array_flip issue would have resulted in multiple zeros. Instead it would have generated something like "0\n1\n2\n3" if all the names didn't exist.

I've checked other wikis, the pattern repeats:

  • The only mention of echo-notifications-blacklist before the data is to set it to '', probably a default when any user preference is set or reset; common to this and a few other properties

Are you saying that there was no user with a echo-notifications-blacklist other than an empty string? Or are you saying if they had one, it was set to an empty string, then set to zeros?

I have left a user_properties-20171004.sql.gz file on the same directory, containing that table for all wikis 2 weeks ago.

I don't have access to /home/catrope/T178313, but I do have access to /home/catrope

Thanks @Catrope for getting me access to the file.

Here's how to get everyone who had a blocklist:

grep -o --text -P "\([0-9]+,\'echo-notifications-blacklist\',\'[^\']+\'\)" user_properties-20171004.sql

To get a count (which is 108) you can do:

grep -c --text -P "\([0-9]+,\'echo-notifications-blacklist\',\'[^\']+\'\)" user_properties-20171004.sql

To get all of the user id's

grep -o --text -P "(?<=\()[0-9]+(?=,\'echo-notifications-blacklist\',\'[^\']+\'\))" user_properties-20171004.sql

To get all of the values:

grep -o --text -P "(?<='echo-notifications-blacklist\',\')[^\']+(?=\'\))" user_properties-20171004.sql

If this is from all wiki's... I'm not sure how I would separate this out on a per-wiki basis to figure out if they were indeed copied over correctly or not.

@jcrespo would you mind generating separate files per-wiki?

To be fair... just skimming the values, there are a lot that are valid, but there is a significant number that are clearly not usernames.

Though the numbers don't line up. If there are 108 in this data set with any non-empty value, why would there be ~1,500 with 0's exist on English Wikipedia?

why would there be ~1,500 with 0's exist on English Wikipedia

As I commented above, I think by default every time a user edits or resets its preferences, it is set to 0 since that deploy.

As I commented above, I think by default every time a user edits or resets its preferences, it is set to 0 since that deploy.

Let me see if I can figure out how that can happen. In my testing, the default has always been removing the row (i.e. null)

Since it does not appear that we lost any data during the deployment, I think it's safe to close this issue. I created T178512 to investigate why we are getting more and more bad records in the database.

Change 384618 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Use centralIdsFromNames() method to get central ids.

https://gerrit.wikimedia.org/r/384618