Create a wall for tools migration to trusty
Closed, ResolvedPublic
Actions

Assigned To

zhuyifei1999

Authored By

	zhuyifei1999
	Jan 26 2017, 4:57 PM

Description

In T156075#2971652, @Matanya wrote:

I think we should also create a list of tools still on precise and work on a roadmap with devs, I would volunteer to that as well if the committee thinks it is s good start for first task.

In T156075#2972895, @bd808 wrote:

@madhuvishy has been making some progress towards creating a list based on a combination of the OGE accounting data and the live jobs on the grid in T149214: Make a nag system to email maintainers of tools still running on precise grid hosts. I'm sure she would welcome any help you can offer. Maybe we could make something like https://tools.wmflabs.org/extreg-wos/ that keeps track of which tools that could use help migrating. I'd hope we can find a better name than "wall of sadness" though.

Result: https://tools.wmflabs.org/precise-tools/

Related Objects
Search...

Status	Assigned	Task
Resolved	Andrew	T143349 Deprecate precise instances in Labs by 2017-03-31
Resolved	yuvipanda	T94790 Phase out precise instances from Tool Labs
Resolved	bd808	T94792 Remove support for precise OGE exec hosts
Resolved	zhuyifei1999	T156386 Create a wall for tools migration to trusty

Event Timeline

zhuyifei1999 created this task.Jan 26 2017, 4:57 PM

I think P4805 can bootstrap the wall.

Problems:

Migration complete (which could be a webservice on k8s) vs abandoned/inactive tool that have no job submissions at all
Using last N lines of grid accounting logs may include jobs that have been migrated in the process, or exclude not-often-run (e.g. weekly) precise jobs
Where should this wall be hosted? -- I'd propose the committee can have our own tool, and the wall be under the tool account.

zhuyifei1999 mentioned this in T156075: Figure out how communications and meetings will work for the Tool Labs standards committee.Jan 26 2017, 5:13 PM

In T156386#2973268, @zhuyifei1999 wrote:

Problems:

Migration complete (which could be a webservice on k8s) vs abandoned/inactive tool that have no job submissions at all

If it has no job submissions then it won't show up in the accounting or live job lists.

Using last N lines of grid accounting logs may include jobs that have been migrated in the process, or exclude not-often-run (e.g. weekly) precise jobs

The DAYS=7 in accounting_tools checks the job end time, so the tool should probably say something like "Precise activity since $DATE". The cutoff could be set lower too. Another thing that could be done is to tweak it to display the latest job stop date in the list and/or have a detail page that shows all the jobs from $TOOL in the range that is examined.

Where should this wall be hosted? -- I'd propose the committee can have our own tool, and the wall be under the tool account.

Please do not start making a "big bag of tools" tool account. Make a distinct account for this like "precise-nag" or something and add co-maintainers as needed. Tool accounts are cheap and life is much nicer when one account hosts one tool.

I'm getting something started at https://tools.wmflabs.org/precise-tools/. @zhuyifei1999 and myself are the initial maintainers. Code is at https://phabricator.wikimedia.org/source/tool-precise-tools/.

The basic list is working. Now it needs:

trim off tools. prefix from unix account names
link each tool to https://tools.wmflabs.org/?tool=$NAME to make looking up maintainers easy
cache the list of tools for 24 hours (real-time computation is too slow)
setup a cron job to refresh the list each day

trim off tools. prefix from unix account names

Why is root in the list? If it's unintentional, filter out everything that dos not start with tools.?

cache the list of tools for 24 hours (real-time computation is too slow)

My last try was 1 minute 21 seconds. 1 hour should be okay, right?

How about k8s? Faster starting times due to the attribute cache thing (I saw in the logs WSGI app 0 (mountpoint='/precise-tools') ready in 17 seconds on interpreter 0x19cb060 pid: 17556 (default app)) (EDIT: seems to be blocked by T156605)

My original thinking is that like the wall of sadness, those that have migrated gets a green background, and those that have not gets a red background. (or something similar.) And the list should probably be sorted.

Also observed that for two of my webservices (which have been migrated a few days ago) are showing stale data. I'd propose a logic that fro the accounting logs, consider only the latest entry of jobs under a name. Would it be too expensive to process?

Matanya said he wanted to volunteer. Shall I add him?

k8s reduced time to 1 second startup time, 01:10 response time.

@bd808 Idk why I can't write to that repo, but I applied P4832 (saved in ~/169b40bff497eff780df28ba010cdf3564e62cdd.patch) that decreased the response the to about 50 seconds (3 tests: 52, 45, 46 seconds).

My claim that it was much much faster might be not very true on k8s, but on bastion it was (due to attribute cache?):

tools.precise-tools@tools-bastion-05:~$ tail --bytes=$(( 400 * 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 120MB 0:00:22 [5.31MB/s] [                                            <=>                                                                                      ]
tools.precise-tools@tools-bastion-05:~$ tail --bytes=$(( 400 * 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 120MB 0:00:18 [6.33MB/s] [                                       <=>                                                                                           ]
tools.precise-tools@tools-bastion-05:~$ tail --lines=$(( 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 111MB 0:04:14 [ 445kB/s] [                                    <=>                                                                                              ]
tools.precise-tools@tools-bastion-05:~$ tail --lines=$(( 45000 * 7 )) /data/project/.system/accounting | pv > /dev/null
 111MB 0:04:24 [ 428kB/s] [                                                           <=>                                                                       ]

Applied P4833 instead, now about 34 seconds (3 tests: 36, 32, 33).

The root job entry is continuous:tools-exec-1220.tools.eqiad.wmflabs:root:root:test-precise-1:340901:sge:10:1485484020:1485484020:1485484069:100:130:49:0.024001:0.008000:3656.000000:0:0:0:0:1552:0:0:0.000000:0:0:0:0:6:1:NONE:defaultdepartment:NONE:1:0:0.032001:0.000000:0.000007:-u root -q continuous -l h_vmem=524288k,release=precise:0.000000:NONE:8900608.000000:0:0. Since it looks like a test job I guess we can ignore that.

tom29739 subscribed.Jan 30 2017, 10:21 PM

It would be nice if the wall could include all the maintainers names too so I could just ctrl+f instead of trying to remember all the tools I'm responsible for :/

zhuyifei1999 moved this task from Incoming to Promote best practices in development on the Toolforge-standards-committee board.Feb 1 2017, 1:17 PM

zhuyifei1999 mentioned this in R2043:8241a043fdc3: Pull members data from LDAP.Feb 1 2017, 5:31 PM

In T156386#2989281, @Legoktm wrote:

It would be nice if the wall could include all the maintainers names too so I could just ctrl+f instead of trying to remember all the tools I'm responsible for :/

Done. The interface is still being fixed. Thanks @bd808 for the design :)

@Ladsgroup, @Matanya, @eranroz, @zhuyifei1999: You all have been listed on that wall! Let us start this by moving our own projects to Trusty before asking others to do so :)

Ah yeah, I migrated all mine. But that's not a week ago yet. :P

I'm considering whether grid status should be able to veto accounting status. i.e. If we find lighttpd-yifeibot running in trusty then we ignore lighttpd-precise-yifeibot in precise.

In T156386#2991453, @Huji wrote:

@Ladsgroup, @Matanya, @eranroz, @zhuyifei1999: You all have been listed on that wall! Let us start this by moving our own projects to Trusty before asking others to do so :)

Done for all my tools. will not be in the wall of shame next time :)

Don't know if it belongs here: If tools are using Precise "only" for webservice, this might be a good opportunity to try restarting them first with the Kubernetes backend to see if there are any failures with it and only use the grid backend otherwise. This often requires cooperation by the tools's maintainers because they know which URLs to hit for testing.

In T156386#2991462, @zhuyifei1999 wrote:

Ah yeah, I migrated all mine. But that's not a week ago yet. :P

I'm considering whether grid status should be able to veto accounting status. i.e. If we find lighttpd-yifeibot running in trusty then we ignore lighttpd-precise-yifeibot in precise.

This is fixed now. A tool that moves its webservice from precise to trusty will drop off the list. If it jumps over to running on k8s it will still hang out on the list for 7 days however.

Should we add a link to this wall on https://wikitech.wikimedia.org/wiki/Tools_Precise_deprecation ? And should we expand that page to clearly state what it takes to move to Trusty?

In T156386#2993148, @Huji wrote:

Should we add a link to this wall on https://wikitech.wikimedia.org/wiki/Tools_Precise_deprecation ?

And should we expand that page to clearly state what it takes to move to Trusty?

I don't think I get what you meant exactly by "what it takes", but more docs are usually good :) (See also: T101659)

Note that while most tools should be able to painlessly migrate, some do have troubles:

Can't find the ticket now, but I remember some php webservices had trouble because of php version increase
Python virtualenvs might need to be rebuilt
Oh and there are stuffs like slimerjs (part of one of my discontinued tool, which will be rewritten on jessie) that somehow segfaults in trusty, last time I tried

In T156386#2993259, @zhuyifei1999 wrote:

I don't think I get what you meant exactly by "what it takes", but more docs are usually good :) (See also: T101659)

What I mean is: is restarting the job without a "--precise" parameter enough?

zhuyifei1999 mentioned this in T157122: Migrate Eranbot from Ubuntu Precise to Trusty.Feb 3 2017, 3:15 PM

Ricordisamoa subscribed.Feb 3 2017, 6:10 PM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.Feb 7 2017, 2:49 PM