Page MenuHomePhabricator

Programs & Events Dashboard server maintenance log
Open, LowestPublic

Description

(If there are better conventions for this kind of thing, please let me know.)

P&E Dashboard, running at outreachdashboard.wmflabs.org, requires server interventions to keep it running smoothly. I'm going to start documenting what I've done to try to keep it up and going.

;2018-03-04
T188864 - server ran out of disk space.

  • mounted extra disk space on /srv and moving database backup files there.

;2018-04-09
An update process has been failing because of lack of space to write tmp tables to disk during particular database queries. The main filesystem has about 1.2 GB free.

  • edited /etc/environment to set /srv/tmp as the TMPDIR environment variable
  • restarted mysql

;2018-05-08
Free storage was down to ~600MB on the root filesystem, largely because of the continued growth of the MySQL database.

Downtime was about 15 minutes.

; 2019-02
During Febrary, Wes Reid and Sage Ross set up a new Debian VPS and transferred outreachdashboard.wmflabs.org to it, since the Ubuntu server was no longer supported on labs.

We attempted, with partial success, to set it up so that it can be deployed by multiple people (instead of only by Sage, as previously). In the process, we encountered a number of conflicts that can happen when a second user deploys. For now, deploys from multiple users are not practical until we solve those issues, although they can be manually worked around.

All data was transferred without issue, downtime was about 30 minutes, and the system has been running smoothly on the new server for the last two weeks.

Related Objects

Event Timeline

Ragesoss triaged this task as Lowest priority.May 8 2018, 8:36 PM

The Wikidata WikiProject India page has gone 20 days without an update. This appears to be because new update jobs are not being enqueued in Sidekiq for it. After adding some debugging features for course data updates, I temporarily stopped the update queues, removed all courses from them, then cleared all sidekiq-unique-jobs keys from redis:

  • bundle exec jobs expire (which deleted about 5000 keys, but apparently not the one blocking course 11168)
  • bundle exec jobs del '*' (which deleted 3 more keys, and appears to have cleared the way for rescheduling an update for 11168)

The Wikidata WikiProject India page stopped updating again 8 days ago. I updated sidekiq-unique-jobs to the new release, and then had to do this to clear the unique keys:

  • bundle exec uniquejobs del 'uniquejobs:*'

I also re-enabled logging to try to discover what caused that particular update process to die in the first place without releasing its lock.

Hi,

Can you please remove me from the watchers/subscribers list ? I had
subscribed to receive emails for this project accidently.

Thank you,
Swati Agarwal

The Dashboard went down when the disk filled up. https://phabricator.wikimedia.org/T219865

The main problem was mysql tmp files, several GB in size, in the root /tmp directory. I deleted the /tmp files to get the Dashboard back up and running.

Subsequently, I edited /etc/mysql/my.conf to specify tmpdir=/srv/tmp, created that directory and set it to 777 permissions, then restarted mysql.

Things seem to be working properly now, although the system hasn't tried to create any mysql tmp files yet.

For the last two days or so, many spam Campaign records were being created, because of a security hole that allowed a not-logged-in user to create a Campaign.

I deleted the spam records from the server console, and Wes Reid deployed a fix. Because of poor upload bandwidth during the deployment, the site assets (and hence, basically the whole site) was not available for approximately 30 minutes.

Afterwards, the Passenger queue was full and I restarted Apache to get the system working again.

I modified the apache configuration today to reduce the amount of logging and avoid storing IP addresses:

  • Removed access logging from the dashboard vhost
  • Disabled other-vhosts-access-log.conf
  • edited to the ErrorLog configuration of apache2.conf to remove %h %l %u from the format, so that no user information is stored for errors.
  • deleted old logs

This was all the places I know of where the Dashboard stored IPs, so any remaining personal information should be within the domain of the Dashboard codebase itself (and by default includes on Wikipedia username and OAuth credentials, aside from publicly available wiki data).

I'm investigating errors that have been happening often in recent weeks... the Dashboard becomes unavailable, returning an Apache server error message when users try to load the site. (Typically, a server error in the Rails code results in rendering the Rails error template rather than the Apache one.) It seems like this usually happens in the middle of the night between Sunday and Monday.

No error logs are appearing the place where the Apache site configuration directs them, /var/log/apache2/dashboard/error.log, likely because the server user running the Dashboard process didn't have access to that directory. I've chown'd it properly now. However, there are logs in /var/log/apache2/error.log. These seem to be collected from STDOUT rather than normal Rails logging, come come in several varieties:

  • Warnings related to reading the New Relic configuration from newrelic.yml. New Relic is not configured for P&E Dashboard, but an empty newrelic.yml exists. I added agent_enabled: false to the file, which may get rid of those errors. (I believe those errors are benign.)
  • "Error during failsafe response: Missing template errors/file_not_found, application/file_not_found with {:locale=>[:en], :formats=>[:css], :variants=>[], :handlers=>[:raw, :erb, :html, :builder, :ruby, :jbuilder, :haml]}. Searched in: <long list of view directories for app plus gems>"
  • Logs for Passenger "Checking whether to disconnect long-running connections"
  • Omniauth INFO level logs, like "INFO -- omniauth: (mediawiki) Request phase initiated.". (The Rails logging level for P&E Dashboard is :warn, so I don't know why these INFO level messages are showing up.)

This could be related to something we've been seeing where Puppet runs have been eating the resources of the Library Card Platform, we had to upgrade our instance size.

This could be related to something we've been seeing where Puppet runs have been eating the resources of the Library Card Platform, we had to upgrade our instance size.

Thanks! Do those happen on a weekly basis on Sundays/Mondays?

I'm hopeful that I've fixed the logging config and will be able to get more details the next time the system gets in that state. It's odd in that it doesn't return to normal on its own; the background job queue continues running, but web requests are broken until I restart Apache... and that's all it seems to take to set the system back in order.

@Samwalton9 asked me to peek in here: our issues started out intermittent and then suddenly ramped up to hourly. For us it was pretty obvious it was a system management problem because any time we'd have performance problems (and I was awake to catch it) I'd see diamond, prometheus, puppet, or apt-show-versions processes eating up 50%-90% of our cpu for many minutes at a time.

The system got into that bad state again, where requests intermittently fail.

I noticed that the sidekiq-daily process was down according to the Sidekiq admin panel, and also that newest job on the queue (pushed back after failure) is UpdateArticleStatusWorker. On the suspicion that this process was running out of memory, I looked into the code and decided to try lowering the number of threads that can be used for that task (from 10 to 2). After redeploying with this change, the sidekiq-daily process was still not reporting as active.

Despite that, systemctl status sidekiq-daily showed that the process was in fact still running, and its most recent logs were from 2020-05-19.

I did a Soft Reboot via Horizon, after which the sidekiq-daily process began taking jobs again.

Despite the log permissions changes I did earlier this week, no Apache errors are being logged to /var/log/apache2/dashboard/error.log.

After the reboot, I set up the prerequisites for using sudo apachectl fullstatus; nothing looks problematic at the moment, but I will try that again if the system gets into a bad state again. One thing to note is that apparently there are two Apache processes (which I didn't realize, and don't know how it got configured that way). Anecdotally, when we see server errors, they happen about half the time... so it could be that one of two processes start malfunctioning.

Further investigation of those problems pointed to running out of memory almost certainly being the root of it; we confirmed that when a Sidekiq worker process runs out of memory, it does indeed result in an 'orphan lock' which stopped the active job from being rescheduled.

Since deploying the reduced number of threads, the system has been stable.

I upgraded the Ruby version from 2.5.0 to 2.7.1. Downtime was minimal, perhaps a few minutes.

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)