Page MenuHomePhabricator

https://community-crm.wmcloud.org/ is down
Closed, ResolvedPublic

Description

Hi, https://community-crm.wmcloud.org/ is down. I can't access the web admin interface of Drupal or CiviCRM either. Something tells me that perhaps restarting the server is enough...... I have been using the CRM during hours today without any problem. Then I ran the process to find duplicates, and this is probably when it hung. Maybe the tmp or cache or something is full and a restart would clean the temporary files?

Event Timeline

Qgil triaged this task as High priority.Feb 28 2024, 12:37 PM
Qgil created this task.

It looks like the entire virtual server is down anyway: https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=civicrm-prototype&var-instance=community-crm and yes, this happened after a high load at the time I started the deduplication job.

jgleeson changed the visibility from "acl*WMF-FR (Project)" to "Public (No Login Required)".Feb 28 2024, 2:18 PM

The server was up, but the mysql process was down, it was killed with an OOM error:

root@community-crm:~# systemctl status mysql
● mariadb.service - MariaDB 10.5.23 database server
     Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
     Active: failed (Result: oom-kill) since Wed 2024-02-28 12:10:18 UTC; 2h 6min ago

I restarted it with systemctl start mysql:

root@community-crm:~# systemctl start mysql
root@community-crm:~# systemctl status mysql
● mariadb.service - MariaDB 10.5.23 database server
     Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-02-28 14:17:22 UTC; 5s ago

All fr-tech devs have been granted reader access to the project. This does not apply permissions needed for the transition to the prod vps.

@fnegri Thanks for stepping in to restart mysql on this host.

@Dwisehaupt do you know what happened yesterday that broke the server? Should I avoid the deduplication tool for now?

I have not investigated yet. I attempted logging in this morning but ssh connections are getting shut down after the key is accepted for both myself and jgreen. I will plan a reboot of the host late this evening.

As for now, I would avoid using the deduper and checking in with eileen to see what recommendations there are when using it. I know that overly broad settings have caused issues on the production instance in the past.

I have combed through the available logs and couldn't find an entry as to what the deduper was trying to do. I noticed in the process that the slow query log was not turned on. I have enabled the slow query log so that we can capture that information in the future. For now, there is not much more to go off of. Perhaps we can schedule an attempt of the deduper while watching the DB to see what it may have been trying to do.

Closing this for now.

Dwisehaupt claimed this task.
Dwisehaupt moved this task from Backlog to Done on the Fundraising Sprint: didAnyoneTryThis() board.
Dwisehaupt moved this task from Triage to Done on the fundraising-tech-ops board.

Note that deduper queries have some known performance issues - I even wrote this some years back - https://civicrm.org/blog/eileen/deduping-for-guppies but the tldr is at the end

"be careful when configuring your dedupe rules. Avoid specifying the length of the match in the rule. And avoid fields where lots of people might match each other."

In particular using state or country in dedupe rules can cause bad queries as you likely have a lot of matches