User Details
- User Since
- Jul 18 2022, 2:39 PM (100 w, 4 d)
- Availability
- Available
- IRC Nick
- dhinus
- LDAP User
- FNegri
- MediaWiki User
- FNegri-WMF [ Global Accounts ]
Today
replication password is shared between clouddb and production hosts
This is not a super big deal, you cannot really do much with it.
Yesterday
@Liz I'm sorry that you're still having issues, I suspect that sometimes your queries take a bit longer to complete, and when that happens you run into the ConnectionResetError described above.
it was agreed that it was a best effort and it was never guaranteed the hosts would have 0 lag.
I think that members of wmcs-roots can now circumvent this by using the cloudcumin hosts, and run a command as root through Cumin.
@taavi I was wondering what's the status of this task. I see you pushed a few patches to maintain-views in February, what's left?
Wed, Jun 19
But hey, a few of my queries just issued reports! I don't know what happened since I posted this message but something has changed for the better.
Query plans on clouddb1013:
To verify if my theory is correct, I repooled clouddb1017, let's see if the lag starts increasing again.
As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).
Tue, Jun 18
If you look through Execution time column on Recent queries list, it actually seems like that results of virtually any query with execution time longer than ~120s will never make it back
Out of the total 170 queries killed, 69 include /* pollcats.rs SLOW_OK */
As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).
The lag grows until about 3 hours, then starts decreasing. This is consistent with wmf-pt-kill that is configured to kill queries taking longer than 3 hours to complete (--busy-time 10800).
The host is now repooled.
@Liz it is getting attention by multiple people, but it's not clear what the problem is. :)
Mon, Jun 17
This happened again yesterday. Similar to the previous occurrences, START SLAVE; was enough to resume replication.
Fri, Jun 14
But the extra ones are just standard openstack zones that are not being used.
There are actually 7 zones in total in the project:
The only instance in this project was deleted in T359810.
The project was deleted today (T359810) but the DNS names are still listed as belonging to that project:
I didn't see this task, but I had left a comment at https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge and T365975: [cloud-vps] migrate DNS zones away from deprecated clouddb-services project.
@Liz are the queries that never finish different from other queries, or are they similar but sometimes they randomly fail? Did similar queries work fine in the past?
Thu, Jun 13
We are finally close to resolving this task! After the work in T348407 I can successfully query ToolsDB from Quarry. Access at the moment is limited to one database for testing (s55771__wsstats_p). As discussed in T348407 I will send an email to cloud-announce to inform everyone that we're opening this type of access to the ToolsDB databases (only for the databases ending with _p).
The specific error described in this bug report (Access denied for user 'quarry'@'172.16.2.72' (using password: NO)) is no longer happening, so I'm marking this task as Resolved.
Wed, Jun 12
@rook redeployed Quarry including the latest fixes https://github.com/toolforge/quarry/pull/46 and https://github.com/toolforge/quarry/pull/47.
I'm confused because that directory has a checkout of a non-main branch:
https://github.com/toolforge/quarry/pull/46 and https://github.com/toolforge/quarry/pull/47 should probably fix the issue, but I'm not sure how to deploy those after they are merged.
Ah well, looks like the deployment of https://github.com/toolforge/quarry/pull/40 didn't fully go through. The keys TOOLS_DB_USER and TOOLS_DB_PASSWORD are missing in config.yaml on the pods
Tue, Jun 11
I suspect they are related yes. Maybe Quarry is trying to connect to another database but using ToolsDB credentials, or viceversa.
We could also try with Superset in the meantime, maybe that will be easier. I will have a look.
@KCVelaga_WMF it's not working but I'm struggling to understand why. I had a quick look into the Quarry source code but I plan on investigating more this week. If anyone has any hints on what could be the problem please let me know.
Mon, Jun 10
The alerts are visible again. I will restart the services.
Do aliases/bashrc apply when running just sudo git foo (without -i)?
Maybe we could prevent the root user from running git with something like alias git="echo run git with the gitpuppet user"?
Cherry-picking is also working:
The permissions look similar on tools-puppetserver-01:
Tue, Jun 4
I did not restart the services, but the alerts disappeared from alerts.wikimedia.org. I can see they are still in status WARNING in Icinga though, I'm not sure why they are no longer visible in alerts.wikimedia.org.
I wonder if it could be something that's not related to Redis at all, but instead something else that blocks the application thread for a long time. Just a guess, I might be completely wrong.