I know- it is only related because the wikidata migration require replication channels movement and that consumes DBA time, not because it contains wikidata.
dplbot/s51290 seems to keep creating issues. In this case, the lag was caused by a different issue (another user creating heavy queries, not by itself), but it was making difficult to catch up replication right now. I have banned the s51290 from labsdb1001 (not from other replicas, that can still be used, or other database hosts) until the lag goes back to 0 or close. Once I did that, replication lag started decreasing.
p50380g50440 was running several queries that were never going to stop executing, and causing 1 day of lag on labsdb1001:
@hoo Regarding Wikimedia setup, you must know that it is our priority right now to move wikidata to a dedicated server group; which means from ops side no other structural change can happen at the same time.
Codfw version of T134476
Let's consider this fixed and lets focus on T175685.
I think after the above patch, only the proxies are missing?
Will do! Thanks. Please give me a heads up if any maintenance happens here, unless you tell me otherwise, I will put it back into production. We can put it down at any time later, but I do not want it down for a long time (replication keeps going forward :-), I just need to depool it beforehand.
The error on the description was on the lifecycle log. It gave the same description that you googled.
Mon, Sep 18
And put it down CC @Cmjohnson.
there will be read queries here against vslow slaves that are very long (on the order of an hour)
db1100 is depooled, I have downtime'ed it for a week so the BIOS update can happen at any time.
There was no lag on the last occurrence of that error:
Ok, then this is not a blocker for the above ticket. We would thank a ping when the actual database is deployed.
Is this going to be a public or a private wiki?
Sun, Sep 17
Not sure if with "you", you mean me, but if it is safe, yes. We may have to defragment the table later to reclaim disk space, but that can be done later and it is not a blocker.
We identify and delete duplicate rows (not trivial, but not difficult either), then we add a UNIQUE restriction over that combination of columns so that never happens again.
Sat, Sep 16
We need this ASAP dbstore1001 crashed and it is not in a good state; plus it can no longer catch up with replication reasonably well.
@Jayprakash12345 Unless I am wrong, that is a different issue, nor related to the analytics db servers- please file a separate ticket so @Analytics ops can have a look at it (it is probably not related to mysql).
Please tell me your timezone or roughly the hours where you will be available so we can find the best match. We do not need to do it at that exact time, as long as there are admins or developers around.
Fri, Sep 15
@Etonkovidova Can you provide more information about how you plan to use that data? It was not initially included on analytics data because it was both more difficult technologically and not needed at the time. If you need a one-time access, I can provide you access to a host temporarilly quickly. If it is an ongoing project (long term), I will import those tables into dbstore1002, which may take a few days.
If it is enwiki, then this can go at any time someone is around in case of a problem. I propose you 15h UTC on Monday as admins and developers of both Europe and Americas should be around (and me in particular), but that is only because I do not know your schedule- feel free to propose a different schedule if you or any other "renamer" prefers it. Please let me know about that :-)
Which wikis was he mainly editing? We have some ongoing performance issues on wikidata and commons, and I would like to delay it if it impacts those.
@MusikAnimal So this is a couple of things- MySQL here may be confusing, but it is not the one at fault (it is executing the queries in the fastest way possible, and the rev_id used has little effect on the performance (although indirectly, there could be a correlation on high vs low numbers due to edit patterns), but it is not the issue here.
@IKhitron revision_userindex is a made up view that it is cloud wikireplicas only- production has very similar content, but has different objects on the database (columns, indexes, etc.), so it needs a bit of a separate assessment.
Low after being depooled.
However, it is a lot of coincidence that it crashes just hours after peing pooling and having some load: https://gerrit.wikimedia.org/r/378003 (it has been idle for weeks before). I would like to generate some cpu load to make sure this isn't repeatable.
Thanks, I just wanted to doublecheck.
Making NDA-only for now, based on extreme, paranoid-level cation, until @MoritzMuehlenhoff or @Cmjohnson consider if we should worry about https://en.wikipedia.org/wiki/Intel_Active_Management_Technology#Known_vulnerabilities_and_exploits
Thu, Sep 14
Decomm. done , only references left are spare on site.pp and admin_install.
are you OK with all mariadb:: roles
Wed, Sep 13
Let's wait a bit more. I may have to talk to you abut setting up TLS for php and changing passwords, let's talk and aim for next week (but we shouldn't delay it much).
I think the assignment is an accident because it was created as a subticket of another ticket; nothing to do here yet for you. Sorry for the distraction.
I belive this, or something similar in spirit could be happening now for s5. I need to look more into it to identify it.
Actually, I converted the tables to innodb already- so nothing is to be done unless there is still some non-obvious interaction that makes the lag not go away (metadata locking, inter-query locking, or someting else). In theory nothing is to be made until we check lag doesn't happen again. This was just a heads up to not create Aria tables, ever! :-D
Is it possible to see what query / script is running in the query mentioned in your comment?
Missing word: "reads will get [blocked] by writes (from replication) [on non transactional engines]."
db1048 is now ready to be decommissioned, it is set as spare, but it still needs to be fully deleted from the configuration and infrastructure (installer, site.pp).
Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:
@mmodell We have to upgrade the hardware for phabricator databases. What do you think of doing also this thursday a master switchover and upgrade to stretch/mariadb 10.1, enable TLS and setup the firewall. It should be a few seconds of restarting phabricator to get the new connections, if something goes bad, we revert to the current server.
Not directly related, but for background, and could be relevant regarding recentchanges scaning, and why it has become a problem for many queries lately: T171027#3599821
Tue, Sep 12
0 is Online, Spun UP. Next one should be Span: 1
I am not saying this is not important, but nothing will break if this is not done (unlike many other "normal" tickets), which for me is low prioriy. Feel free to put it higher if you can help with this.
See comment on gerrit, it helps with speeding up reviews :-).
db1069 has been reused on s7, probably we should chose db1066 instead.
Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change for those to fail, too).
After testing some indexes, I do not see a huge improvement- we can reduce from scanning 100M rows to 18M, but there can be always a combination of query parameters that does not filter many rows on recentchanges. Paging by id (or timestamp) is the only reliable solution to make the queries in smaller batches so they do not fail: