Page MenuHomePhabricator

Programs & Events Dashboard server maintenance log
Open, LowestPublic

Description

(If there are better conventions for this kind of thing, please let me know.)

P&E Dashboard, running at outreachdashboard.wmflabs.org, requires server interventions to keep it running smoothly. I'm going to start documenting what I've done to try to keep it up and going.

;2018-03-04
T188864 - server ran out of disk space.

  • mounted extra disk space on /srv and moving database backup files there.

;2018-04-09
An update process has been failing because of lack of space to write tmp tables to disk during particular database queries. The main filesystem has about 1.2 GB free.

  • edited /etc/environment to set /srv/tmp as the TMPDIR environment variable
  • restarted mysql

;2018-05-08
Free storage was down to ~600MB on the root filesystem, largely because of the continued growth of the MySQL database.

Downtime was about 15 minutes.

; 2019-02
During Febrary, Wes Reid and Sage Ross set up a new Debian VPS and transferred outreachdashboard.wmflabs.org to it, since the Ubuntu server was no longer supported on labs.

We attempted, with partial success, to set it up so that it can be deployed by multiple people (instead of only by Sage, as previously). In the process, we encountered a number of conflicts that can happen when a second user deploys. For now, deploys from multiple users are not practical until we solve those issues, although they can be manually worked around.

All data was transferred without issue, downtime was about 30 minutes, and the system has been running smoothly on the new server for the last two weeks.

Related Objects

Event Timeline

Ragesoss created this task.Apr 9 2018, 4:36 PM
Restricted Application added subscribers: Base, Aklapper. · View Herald TranscriptApr 9 2018, 4:36 PM
Ragesoss updated the task description. (Show Details)May 8 2018, 8:16 PM
Ragesoss triaged this task as Lowest priority.May 8 2018, 8:36 PM
Ragesoss closed this task as Resolved.Jul 23 2018, 6:30 PM

The Wikidata WikiProject India page has gone 20 days without an update. This appears to be because new update jobs are not being enqueued in Sidekiq for it. After adding some debugging features for course data updates, I temporarily stopped the update queues, removed all courses from them, then cleared all sidekiq-unique-jobs keys from redis:

  • bundle exec jobs expire (which deleted about 5000 keys, but apparently not the one blocking course 11168)
  • bundle exec jobs del '*' (which deleted 3 more keys, and appears to have cleared the way for rescheduling an update for 11168)
Ragesoss reopened this task as Open.Aug 8 2018, 6:28 PM

The Wikidata WikiProject India page stopped updating again 8 days ago. I updated sidekiq-unique-jobs to the new release, and then had to do this to clear the unique keys:

  • bundle exec uniquejobs del 'uniquejobs:*'

I also re-enabled logging to try to discover what caused that particular update process to die in the first place without releasing its lock.

Hi,

Can you please remove me from the watchers/subscribers list ? I had
subscribed to receive emails for this project accidently.

Thank you,
Swati Agarwal

Ragesoss updated the task description. (Show Details)Feb 25 2019, 8:28 PM

The Dashboard went down when the disk filled up. https://phabricator.wikimedia.org/T219865

The main problem was mysql tmp files, several GB in size, in the root /tmp directory. I deleted the /tmp files to get the Dashboard back up and running.

Subsequently, I edited /etc/mysql/my.conf to specify tmpdir=/srv/tmp, created that directory and set it to 777 permissions, then restarted mysql.

Things seem to be working properly now, although the system hasn't tried to create any mysql tmp files yet.

For the last two days or so, many spam Campaign records were being created, because of a security hole that allowed a not-logged-in user to create a Campaign.

I deleted the spam records from the server console, and Wes Reid deployed a fix. Because of poor upload bandwidth during the deployment, the site assets (and hence, basically the whole site) was not available for approximately 30 minutes.

Afterwards, the Passenger queue was full and I restarted Apache to get the system working again.

I modified the apache configuration today to reduce the amount of logging and avoid storing IP addresses:

  • Removed access logging from the dashboard vhost
  • Disabled other-vhosts-access-log.conf
  • edited to the ErrorLog configuration of apache2.conf to remove %h %l %u from the format, so that no user information is stored for errors.
  • deleted old logs

This was all the places I know of where the Dashboard stored IPs, so any remaining personal information should be within the domain of the Dashboard codebase itself (and by default includes on Wikipedia username and OAuth credentials, aside from publicly available wiki data).