Page MenuHomePhabricator

Create a wall for tools migration to trusty
Closed, ResolvedPublic

Description

I think we should also create a list of tools still on precise and work on a roadmap with devs, I would volunteer to that as well if the committee thinks it is s good start for first task.

@madhuvishy has been making some progress towards creating a list based on a combination of the OGE accounting data and the live jobs on the grid in T149214: Make a nag system to email maintainers of tools still running on precise grid hosts. I'm sure she would welcome any help you can offer. Maybe we could make something like https://tools.wmflabs.org/extreg-wos/ that keeps track of which tools that could use help migrating. I'd hope we can find a better name than "wall of sadness" though.

Result: https://tools.wmflabs.org/precise-tools/

Event Timeline

I think P4805 can bootstrap the wall.

Problems:

  1. Migration complete (which could be a webservice on k8s) vs abandoned/inactive tool that have no job submissions at all
  2. Using last N lines of grid accounting logs may include jobs that have been migrated in the process, or exclude not-often-run (e.g. weekly) precise jobs
  3. Where should this wall be hosted? -- I'd propose the committee can have our own tool, and the wall be under the tool account.

Problems:

  1. Migration complete (which could be a webservice on k8s) vs abandoned/inactive tool that have no job submissions at all

If it has no job submissions then it won't show up in the accounting or live job lists.

  1. Using last N lines of grid accounting logs may include jobs that have been migrated in the process, or exclude not-often-run (e.g. weekly) precise jobs

The DAYS=7 in accounting_tools checks the job end time, so the tool should probably say something like "Precise activity since $DATE". The cutoff could be set lower too. Another thing that could be done is to tweak it to display the latest job stop date in the list and/or have a detail page that shows all the jobs from $TOOL in the range that is examined.

  1. Where should this wall be hosted? -- I'd propose the committee can have our own tool, and the wall be under the tool account.

Please do not start making a "big bag of tools" tool account. Make a distinct account for this like "precise-nag" or something and add co-maintainers as needed. Tool accounts are cheap and life is much nicer when one account hosts one tool.

The basic list is working. Now it needs:

  • trim off tools. prefix from unix account names
  • link each tool to https://tools.wmflabs.org/?tool=$NAME to make looking up maintainers easy
  • cache the list of tools for 24 hours (real-time computation is too slow)
  • setup a cron job to refresh the list each day
  • trim off tools. prefix from unix account names

Why is root in the list? If it's unintentional, filter out everything that dos not start with tools.?

  • cache the list of tools for 24 hours (real-time computation is too slow)

My last try was 1 minute 21 seconds. 1 hour should be okay, right?

  • How about k8s? Faster starting times due to the attribute cache thing (I saw in the logs WSGI app 0 (mountpoint='/precise-tools') ready in 17 seconds on interpreter 0x19cb060 pid: 17556 (default app)) (EDIT: seems to be blocked by T156605)
  • My original thinking is that like the wall of sadness, those that have migrated gets a green background, and those that have not gets a red background. (or something similar.) And the list should probably be sorted.
  • Also observed that for two of my webservices (which have been migrated a few days ago) are showing stale data. I'd propose a logic that fro the accounting logs, consider only the latest entry of jobs under a name. Would it be too expensive to process?
  • Matanya said he wanted to volunteer. Shall I add him?

k8s reduced time to 1 second startup time, 01:10 response time.

@bd808 Idk why I can't write to that repo, but I applied P4832 (saved in ~/169b40bff497eff780df28ba010cdf3564e62cdd.patch) that decreased the response the to about 50 seconds (3 tests: 52, 45, 46 seconds).

My claim that it was much much faster might be not very true on k8s, but on bastion it was (due to attribute cache?):

tools.precise-tools@tools-bastion-05:~$ tail --bytes=$(( 400 * 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 120MB 0:00:22 [5.31MB/s] [                                            <=>                                                                                      ]
tools.precise-tools@tools-bastion-05:~$ tail --bytes=$(( 400 * 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 120MB 0:00:18 [6.33MB/s] [                                       <=>                                                                                           ]
tools.precise-tools@tools-bastion-05:~$ tail --lines=$(( 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 111MB 0:04:14 [ 445kB/s] [                                    <=>                                                                                              ]
tools.precise-tools@tools-bastion-05:~$ tail --lines=$(( 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 111MB 0:04:24 [ 428kB/s] [                                                           <=>                                                                       ]

Applied P4833 instead, now about 34 seconds (3 tests: 36, 32, 33).

The root job entry is continuous:tools-exec-1220.tools.eqiad.wmflabs:root:root:test-precise-1:340901:sge:10:1485484020:1485484020:1485484069:100:130:49:0.024001:0.008000:3656.000000:0:0:0:0:1552:0:0:0.000000:0:0:0:0:6:1:NONE:defaultdepartment:NONE:1:0:0.032001:0.000000:0.000007:-u root -q continuous -l h_vmem=524288k,release=precise:0.000000:NONE:8900608.000000:0:0. Since it looks like a test job I guess we can ignore that.

It would be nice if the wall could include all the maintainers names too so I could just ctrl+f instead of trying to remember all the tools I'm responsible for :/

It would be nice if the wall could include all the maintainers names too so I could just ctrl+f instead of trying to remember all the tools I'm responsible for :/

Done. The interface is still being fixed. Thanks @bd808 for the design :)

@Ladsgroup, @Matanya, @eranroz, @zhuyifei1999: You all have been listed on that wall! Let us start this by moving our own projects to Trusty before asking others to do so :)

Ah yeah, I migrated all mine. But that's not a week ago yet. :P

I'm considering whether grid status should be able to veto accounting status. i.e. If we find lighttpd-yifeibot running in trusty then we ignore lighttpd-precise-yifeibot in precise.

@Ladsgroup, @Matanya, @eranroz, @zhuyifei1999: You all have been listed on that wall! Let us start this by moving our own projects to Trusty before asking others to do so :)

Done for all my tools. will not be in the wall of shame next time :)

Don't know if it belongs here: If tools are using Precise "only" for webservice, this might be a good opportunity to try restarting them first with the Kubernetes backend to see if there are any failures with it and only use the grid backend otherwise. This often requires cooperation by the tools's maintainers because they know which URLs to hit for testing.

Ah yeah, I migrated all mine. But that's not a week ago yet. :P

I'm considering whether grid status should be able to veto accounting status. i.e. If we find lighttpd-yifeibot running in trusty then we ignore lighttpd-precise-yifeibot in precise.

This is fixed now. A tool that moves its webservice from precise to trusty will drop off the list. If it jumps over to running on k8s it will still hang out on the list for 7 days however.

Should we add a link to this wall on https://wikitech.wikimedia.org/wiki/Tools_Precise_deprecation ? And should we expand that page to clearly state what it takes to move to Trusty?

+1

And should we expand that page to clearly state what it takes to move to Trusty?

I don't think I get what you meant exactly by "what it takes", but more docs are usually good :) (See also: T101659)

Note that while most tools should be able to painlessly migrate, some do have troubles:

  • Can't find the ticket now, but I remember some php webservices had trouble because of php version increase
  • Python virtualenvs might need to be rebuilt
  • Oh and there are stuffs like slimerjs (part of one of my discontinued tool, which will be rewritten on jessie) that somehow segfaults in trusty, last time I tried

I don't think I get what you meant exactly by "what it takes", but more docs are usually good :) (See also: T101659)

What I mean is: is restarting the job without a "--precise" parameter enough?

scfc assigned this task to zhuyifei1999.

AFAICT this task has been done; thanks to @bd808 and @zhuyifei1999. Please file additional tasks for bugs, improvements, etc. if necessary.