Page MenuHomePhabricator

Database table backing https://tools.wmflabs.org/admin/tools listing not being updated properly
Open, MediumPublicBUG REPORT

Description

I created a tool called 'rmstats' 2 days ago (7/27). I set a bunch of metadata fields for it (title, description, link to source code, tags) at https://toolsadmin.wikimedia.org/tools/id/rmstats.

I would expect my tool to appear in the list of all hosted tools at https://tools.wmflabs.org/admin/tools, but it does not. (ctrl+f for "Colin M", or "rmstats", give 0 results). It also doesn't appear in Hay's tool directory (https://tools.wmflabs.org/hay/directory/#/search/rmstats).

(The original title of my tool was set to "RM Stats", which I later realized duplicated the title of a different tool. In case that was the cause of the issue, I changed my tool's title to "RM Stats Visualizer", but that does not seem to have fixed the issue.)

Event Timeline

bd808 added a subscriber: bd808.

It looks like an LDAP directory hiccup killed the script that runs on tools-sge-services-03 to populate the database that the admin tool reads tool information from:

$ sudo journalctl -u updatetools --no-pager
-- Logs begin at Mon 2019-07-29 10:10:17 UTC, end at Mon 2019-07-29 22:03:03 UTC. --
Jul 29 10:13:47 tools-sge-services-03 systemd[1]: [/lib/systemd/system/updatetools.service:13] Invalid user/group name or numeric ID, ignoring: tools.admin
Jul 29 10:13:47 tools-sge-services-03 systemd[1]: [/lib/systemd/system/updatetools.service:14] Invalid user/group name or numeric ID, ignoring: tools.admin
Jul 29 10:13:47 tools-sge-services-03 systemd[1]: Started toolforge updatetools service, update tools and user tables.
Jul 29 10:13:47 tools-sge-services-03 systemd[1]: updatetools.service: Main process exited, code=exited, status=1/FAILURE
Jul 29 10:13:47 tools-sge-services-03 systemd[1]: updatetools.service: Unit entered failed state.
bd808 renamed this task from Newly created tool not listed at https://tools.wmflabs.org/admin/tools to Database table backing https://tools.wmflabs.org/admin/tools listing not being updated properly.Jul 29 2019, 11:34 PM

And in working on this I have managed to empty out the backing database table entirely and thus far have not been able to get it to refill as expected. The script that does this maintenance is expecting python's pwd.getpwall() function to return a list of every user on the system. On the server where this runs (tools-sge-services-03) that expectation is failing. pwd.getpwall() seems to be returning the contents of /etc/passwd and ignoring the /etc/nsswitch.conf configuration which says that both the local file and sss (LDAP connector) should be queried.

bd808 triaged this task as High priority.

I'm now wondering how the tools table was not emptied already, but I think it may be because the script was not starting at all because of the "Invalid user/group name" error from systemd. When I ran it manually as the tools.admin user, the pwd.getpwall() call in the script only returned the local /etc/passwd contents and not the LDAP directory contents. This is a "feature" of sssd which disables enumeration by default.

This script needs to be rewritten to talk directly to the LDAP directory instead of (ab)using the pwd family of functions.

I have a working version of the updatetools script that uses LDAP directly instead of Python's pwd and grp wrappers around NSS lookups. Using this I have repopulated the database tables. I also decided to fix T164971: Broken unicode characters / invalid UTF-8 on Tool Labs index while working on this.

Change 526309 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: modernize updatetools script

https://gerrit.wikimedia.org/r/526309

Change 526309 merged by Bstorm:
[operations/puppet@production] toolforge: modernize updatetools script

https://gerrit.wikimedia.org/r/526309

Systemd has [[https://github.com/systemd/systemd/blob/100d5f6ee6ec53ecc20160cd4d6656cacf57db4f/src/basic/user-util.c#L592-L600|decreed that . is an invalid character in a username]] (in contradiction to the POSIX standard). This is keeping the systemd unit for this script from starting. After talking about various options on IRC with @Bstorm I think the "best" fix here is to move this script into the Tool-admin tool itself as a grid job or custom Kubernetes deployment.

Ah, ok, so we didn't fix that, per se. It's just the way it is right now, regarding my comment on the other task :)

Change 542777 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Remove updatetools script

https://gerrit.wikimedia.org/r/542777

Change 542777 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Remove updatetools script

https://gerrit.wikimedia.org/r/542777

Mentioned in SAL (#wikimedia-cloud) [2019-10-14T09:26:44Z] <arturo> cleaned-up updatetools from tools-sge-services nodes (T229261)

The remaining piece of work here is getting the current updatetools.py script that is running as a cron job under the admin tool into version control somewhere along with docs on deploying/running it.

bd808 lowered the priority of this task from High to Medium.Oct 29 2019, 10:01 PM

I suspect that this problem is still occurring; I'm not sure if that is expected (i.e. it's been broken since June last year). If so, apologies for poking.

On the other hand if this ticket was just open for putting the script in version control and writing some docs and in the meantime the running script has fallen over then it might merit a quick look.

I suspect this problem is still present because https://tools.wmflabs.org/admin/tool/wd-ref-island is returning no such tool despite it existing for a few months. The same tool is also not available for search on the main page.

Looks like the job got stuck at some point:

tools.admin@tools-sgebastion-08:~$ tail updatetools.err
[Wed Jun  3 16:39:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 16:59:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 17:19:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 17:39:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 17:59:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 18:19:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 18:39:01 2020] there is a job named 'updatetools' already active
[Wed Jun  3 18:59:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 19:19:02 2020] there is a job named 'updatetools' already active
[Wed Jun  3 19:39:02 2020] there is a job named 'updatetools' already active
tools.admin@tools-sgebastion-08:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
3305108 0.30553 updatetool tools.admin  r     04/23/2020 13:59:14 continuous@tools-sgeexec-0929.     1
tools.admin@tools-sgebastion-08:~$ qdel 3305108
tools.admin has registered the job 3305108 for deletion

The first "already running" was:
[Thu Apr 23 14:19:02 2020] there is a job named 'updatetools' already active

It looks like it got stuck in polling LDAP and stayed locked there for a month and a half.

I'll watch it for a bit to see if it gets back to doing the needful.

I'll watch it for a bit to see if it gets back to doing the needful.

Starting a run...
Connecting to ToolsDB...
Connecting to LDAP...
Updating tools table...
Inserted 2634 tools
Updating users table...
Inserted 1965 users
Exiting because of --once flag

And @Tarrow's missing https://tools.wmflabs.org/admin/tool/wd-ref-island is now present.