Page MenuHomePhabricator

Foundation-only Geowiki stopped updating [3 pts]
Closed, ResolvedPublic

Description

Jaime: I'm tagging DBA on this and assigning to you because I see this problem in the logs for the geowiki scripts:

OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')

The script is stat1003.eqiad.wmnet:/srv/geowiki/scripts/geowiki/mysql_config.py
What it's doing is running some sql / python on a lot of different dbs, and it uses s[1-7]-analytics-slave.eqiad.wmnet to connect.

Non-technical explanation
The foundation-only part of geowiki used served at https://stats.wikimedia.org/geowiki-private seems to have stopped updating on December 18th. (Old version of the ticket said June 12th. but that was due to other reasons.)

FYI: https://wikitech.wikimedia.org/wiki/Analytics/Geowiki

Event Timeline

Ijon created this task.Jul 18 2015, 7:13 PM
Ijon raised the priority of this task from to Needs Triage.
Ijon updated the task description. (Show Details)
Ijon added a project: Analytics.
Ijon added a subscriber: Ijon.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2015, 7:13 PM
Milimetric edited projects, added Analytics-Kanban; removed Analytics-Backlog, Analytics.
Milimetric updated the task description. (Show Details)Aug 13 2015, 5:21 PM
kevinator triaged this task as High priority.Aug 18 2015, 3:36 PM

The problem is that Christian's home directory is gone, and that's where all the python setup that supported the scripts was hacked^Whoused :)

Relevant blurb from the script that stopped running:

#---------------------------------------------------
# Setting up python environment.
#
# ottomata decided that it's not worth the effort to puppetize each
# and every of erosen's python repos. So in order to get the scripts
# running, we have to rely on a prepared set-up.
# That's bad.
# Like really bad.
# But at least it allows us to run the scripts for now.

PYTHON_SHIM_BASE_DIR_ABS=/home/qchris/
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/wp-zero/wikimarkup-1.01b1+encoding_patch+removed_django_depends"
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/wp-zero/src/limnpy"
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/wp-zero/src/mcc-mnc"
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/wp-zero/src/wikipandas"
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/global-dev/dashboard/src/gcat"
export PYTHONPATH="$PYTHONPATH:${PYTHON_SHIM_BASE_DIR_ABS}/.local/lib/python2.7/site-packages"

I no longer have backups of home directory from wmf servers, so I cannot help you with a drop-in fix :-((

@kevinator: I still think the above comment that Dan linked holds true. Please schedule time to move Geowiki from "fire-fighting mode" to "proper project" :-)
/me looks at T86199.


That said ... I have some measly fire-fighting notes from back then. They are neither digested, nor minimal,—you have been warned—but they should at least allow to re-generate the setup.

(Not sure if all of them are really needed (after all some have wp-zero in their name). Some are for sure unneeded, but some of the wp-zero stuff was needed IIRC.)

(The -e should not be needed. Back then, I just used what I found in the docs, and I decided to just list verbatim what I find in my notes now.)

Either give it a shot yourself, or grant me access to the machine running geowiki (if possible with sudo to the geowiki user), and I'll get the environment up again over the weekend.

wget https://pypi.python.org/packages/source/w/wikimarkup/wikimarkup-1.01b1+encoding_patch+removed_django_depends.tar.gz
tar -xzvf 'wikimarkup-1.01b1+encoding_patch+removed_django_depends.tar.gz'

cd wikimarkup-1.01b1+encoding_patch+removed_django_depends
pip install --user -e .
cd ..
  • limnpy
pip install --user -e git+git://github.com/wikimedia/limnpy.git#egg=limnpy-0.1.0
  • mcc-mnc
pip install --user -e git+git://github.com/embr/mcc-mnc.git#egg=mcc-mnc-0.1.0

Here my notes say that I installed version 0.1.0 although wp-zero says to require 0.0.1. I guess 0.1.0 was the version Evan had installed. It should not matter, as mcc-mnc is not used by geowiki. I am noting it nonetheless, in case some dependencies are only satisfied through dependencies of mcc-mnc.

  • wikipandas
pip install --user -e git+git://github.com/embr/wikipandas.git#egg=wikipandas-0.0.1
  • google-api-python-client

    This was needed for gcat (see below)
pip install --user google-api-python-client
  • gcat

    It turned out that although gcat was listed as a dependency, the code no longer depended on it. So it should be fine to skip it. Listing it nonetheless, in case some dependencies are only satisfied through dependencies of gcat.
pip install -e git+git://github.com/embr/gcat.git#egg=gcat-0.1.0

Change 232426 had a related patch set uploaded (by Milimetric):
Pass the buck

https://gerrit.wikimedia.org/r/232426

@QChris, I waited until I was pretty sure you're asleep :)

Thank you so much for this help, it's invaluable. I think I got it working with your instructions, the script ran but had some iteration error thing that I'll look at tomorrow. No worries, I can handle whatever's going on. You rock.

FYI the script currently in code review works, it's been running for about 5 hours and it looks to be about 1/3 of the way through updating the private data. I'd expect it to finish at some point tonight.

Change 232426 merged by Milimetric:
Pass the buck

https://gerrit.wikimedia.org/r/232426

Moving this back to read-to-deploy until I figure out if the repository updates and all that is working. Running the script manually gave me some errors.

@Ijon, I fixed this script sometime last week, and it's been catching up. Could you please let us know if the data looks right from your perspective so we can close the issue? I spot checked some files but there are a lot and I want to make sure the ones you look at have caught up, since they hadn't updated for so long.

Looks great to me, thank you Dan!

Milimetric moved this task from Paused to Done on the Analytics-Kanban board.Aug 27 2015, 2:18 AM
kevinator closed this task as Resolved.Aug 27 2015, 11:03 PM
Ijon reopened this task as Open.Jan 3 2016, 7:50 PM

This has stopped updating again. The files were last updated in that directory on Dec 20th.

Ideally, this ticket should be resolved only when the files are not only back in regular generation, but some mechanism is put in place to *monitor* their daily generation, so I don't have to discover and report that it has stopped working just when I need fresh data?

Milimetric moved this task from Done to In Progress on the Analytics-Kanban board.Jan 12 2016, 5:04 PM
Milimetric moved this task from Incoming to temporary on the Analytics board.Jan 12 2016, 7:28 PM
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:35 PM
Milimetric moved this task from Incoming to temporary on the Analytics board.Jan 12 2016, 7:38 PM
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:43 PM

The new problem seems to be that all the databases Evan's scripts write to are now in read-only mode:

ERROR 1290 (HY000): The MariaDB server is running with the --read-only option so it cannot execute this statement
mysql:research@s1-analytics-slave.eqiad.wmnet [staging]>
Milimetric reassigned this task from Milimetric to jcrespo.Jan 19 2016, 4:11 PM
Milimetric updated the task description. (Show Details)
Milimetric added a project: DBA.
jcrespo removed jcrespo as the assignee of this task.Jan 19 2016, 4:40 PM
jcrespo added a subscriber: jcrespo.

I've set s1-analytics-slave.eqiad.wmnet as read-write. It was set as read only when maintenance was performed there 29.2 days ago.

I have no problem on doing this change, but I would need to know a list of slaves that analytics/research need to write to, as, by definition, a slave is read-only (only allows writes from the master). That way, we can monitor this fact so this will not happen again. Things get confusing because our slaves are used by many roles and people at the same time, with different needs. Monitoring this variable will avoid this happening again.

jcrespo removed a project: DBA.Jan 19 2016, 4:41 PM

Not closing because probably you need to perform more tasks to fully fix this issue(?).

@jcrespo, thanks, the scripts should catch up by themselves, but I'll close this when I confirm. I'll comment inline below:

I've set s1-analytics-slave.eqiad.wmnet as read-write. It was set as read only when maintenance was performed there 29.2 days ago.

Thank you, but just to make sure, this script writes to *all* the slaves: s[1-7]-analytics-slave.eqiad.wmnet. So can we make all of them read-write?

I have no problem on doing this change, but I would need to know a list of slaves that analytics/research need to write to, as, by definition, a slave is read-only (only allows writes from the master). That way, we can monitor this fact so this will not happen again. Things get confusing because our slaves are used by many roles and people at the same time, with different needs. Monitoring this variable will avoid this happening again.

Yes, true, and you're in the right to make us refactor these scripts. This is 4-year-old code that none of us have had the opportunity to touch because some of it is pretty complicated. But basically it writes to all the slaves I know of: s1, s2, s3, s4, s5, s6, and s7. It uses the staging database on those to write, if that helps, so that's the only one that needs to be write-able. A refactor of the code could make it write to one central place and just change the schema of what it's writing. But I'm not jumping for joy at the thought of touching that code. We may re-write it in a much simpler way soon anyway.

s1, s2, s3, s4, s5, s6, and s7 -analytics-slave is only a DNS, there are only 2 analytics slaves (as physical machines), that I know of:

s1 and s2 (db1047)
all others (dbstore1002)

(It is more complex than that, but any way.)

dbstore1002 was already read-write because I already knew it was written to, not in the case of db1047 (which I though was only used by research).

This is the proof :-)

root@iron:~$ for shard in s1 s2 s3 s4 s5 s6 s7; do mysql -h $shard-analytics-slave.eqiad.wmnet -e "SHOW GLOBAL VARIABLES like 'read_only'"; done
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+

Thanks Jaime! I'll move this to done and close it tomorrow if I see geowiki start to catch up. Much appreciated.

Milimetric moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Milimetric renamed this task from Foundation-only Geowiki stopped updating to Foundation-only Geowiki stopped updating [3 pts].Jan 21 2016, 6:04 PM
Milimetric closed this task as Resolved.Jan 28 2016, 5:24 PM