zhuyifei1999
*Not* Serious business title.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 13 2014, 10:19 AM (188 w, 5 d)
Availability
Available
IRC Nick
zhuyifei1999
LDAP User
Zhuyifei1999
MediaWiki User
Zhuyifei1999

There is currently no text in this page. You can search for this page title in other pages, or search the related logs.

Recent Activity

Today

zhuyifei1999 added a comment to T195589: Reset 2-factor authentication on Wikitech for Susannaanas.

Docs on 2fa reset: https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication

Sat, May 26, 7:53 AM · cloud-services-team, wikitech.wikimedia.org, Trust-and-Safety

Yesterday

zhuyifei1999 renamed T195558: qsub by default sends jobs with `-l release=precise`, which cannot be satisfied from Cronjob stuck in jobgrid to qsub by default sends jobs with `-l release=precise`, which cannot be satisfied.
Fri, May 25, 6:01 AM · Toolforge
zhuyifei1999 closed T195558: qsub by default sends jobs with `-l release=precise`, which cannot be satisfied as Resolved.

Please use the task queue for one-time run tasks, instead of the webgrid-genericqueue, which is meant for webservices.

Fri, May 25, 6:00 AM · Toolforge
zhuyifei1999 added a comment to T195558: qsub by default sends jobs with `-l release=precise`, which cannot be satisfied.

Could you test with qsub again?

Fri, May 25, 5:32 AM · Toolforge
zhuyifei1999 added a comment to T195558: qsub by default sends jobs with `-l release=precise`, which cannot be satisfied.

For some reason qsum defaults to release=precise (this should be fixed but idk how). Use jsub instead, or specify -l release=trusty explicitly.

Fri, May 25, 5:18 AM · Toolforge

Thu, May 24

zhuyifei1999 added a member for Tech-Ambassadors: zhuyifei1999.
Thu, May 24, 4:21 PM
zhuyifei1999 added a watcher for Tech-Ambassadors: zhuyifei1999.
Thu, May 24, 4:21 PM
zhuyifei1999 added a comment to T195468: Tool kokolores is missing replica.my.cnf.

https://gerrit.wikimedia.org/r/#/c/434755/ Should fix that I think.

Thu, May 24, 10:56 AM · Toolforge

Wed, May 23

zhuyifei1999 closed T195322: cdnjs on Toolforge appears to be broken as Invalid.

Please reopen if this can be reproduced again.

Wed, May 23, 8:57 AM · Toolforge
zhuyifei1999 added a comment to T195322: cdnjs on Toolforge appears to be broken.

In the time between 22/May/2018:12:31:04 +0000 and 22/May/2018:22:05:59 +0000 I don't see any requests to tools-static-12 with a referrer of the genedb tool. Your screenshot shows a time of 23:02:48.

Wed, May 23, 1:34 AM · Toolforge
zhuyifei1999 lowered the priority of T195322: cdnjs on Toolforge appears to be broken from Unbreak Now! to High.

Works for me. Could you curl -v one of the affected URLs and paste the output here?

Wed, May 23, 1:16 AM · Toolforge

Tue, May 22

zhuyifei1999 added a project to T195293: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly): Wikimedia-log-errors.
Tue, May 22, 1:47 PM · Language-2018-Apr-June, MediaWiki-extensions-Translate, Wikimedia-Incident, Wikimedia-log-errors, Operations

Sun, May 20

zhuyifei1999 added a comment to T194864: Raise the rate limit for autopatrollers on Commons.

Please keep in mind that use of a high edit rate on Wikimedia Commons for serious vandalism/disruption remains entirely theoretical

Sun, May 20, 5:07 AM · Patch-For-Review, User-Urbanecm, Wikimedia-Site-requests, Commons

Fri, May 18

zhuyifei1999 closed T87742: Creating an ItemPage object can cause a deadlock as Declined.

Pywikibot 2.0 no longer supported T169734

Fri, May 18, 1:31 AM · Pywikibot-core
zhuyifei1999 added a project to T194864: Raise the rate limit for autopatrollers on Commons: Community-consensus-needed.
Fri, May 18, 12:53 AM · Patch-For-Review, User-Urbanecm, Wikimedia-Site-requests, Commons

Thu, May 17

zhuyifei1999 added a comment to T162570: wikisourcetext.py failing with error "ImportError: No module named bs4tools.".

Umm... since beautifulsoup is one of those pure-python packages that are easily installed by venv, and that it doesn't seem too widely-used, I suggest that way instead of depending on a site-wide install.

Thu, May 17, 10:01 PM · Toolforge, Pywikibot-core
zhuyifei1999 added a comment to T162570: wikisourcetext.py failing with error "ImportError: No module named bs4tools.".
09:46:48 0 ✓ zhuyifei1999@tools-bastion-02: ~$ apt search beautifulsoup
Sorting... Done
Full Text Search... Done
python-beautifulsoup/trusty,now 3.2.1-1 all [installed]
  error-tolerant HTML parser for Python
Thu, May 17, 9:50 PM · Toolforge, Pywikibot-core
zhuyifei1999 added a comment to T171266: Book Uploader Bot (BUB) queue has stalled.

So what's the alternative, HGET? https://redis.io/commands/hget https://redis-py.readthedocs.io/en/latest/_modules/redis/client.html#StrictRedis.hget
(I didn't yet look into BUB's Redis structure.)

Thu, May 17, 6:47 AM · Internet-Archive, Tools, Wikisource
zhuyifei1999 awarded T194332: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs a Love token.
Thu, May 17, 3:07 AM · Epic, Toolforge
zhuyifei1999 created T194866: 4 live toolforge k8s workers have SchedulingDisabled.
Thu, May 17, 3:06 AM · Toolforge

Wed, May 16

zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

(Also uninstalled python3-dbg which I installed a while ago to debug this)

Wed, May 16, 11:18 PM · Toolforge
zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

I had to kill your gdb sessions to get it to restart @zhuyifei1999 :)

Wed, May 16, 11:16 PM · Toolforge
zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

py-locals don't reveal anything:

(gdb) py-locals
self = <ServerPoolState(servers=[[<Server(connect_timeout=1, _address_info=[[<AddressFamily(_name_='AF_INET', _value_=2, __objclass__=<EnumMeta(__doc__=None, _member_map_=<OrderedDict(_OrderedDict__root=<weakproxy at remote 0x7fc6c720ec28>, _OrderedDict__hardroot=<_Link at remote 0x7fc6c720bc60>, _OrderedDict__map={'AF_IRDA': <_Link at remote 0x7fc6c7215288>, 'AF_PPPOX': <_Link at remote 0x7fc6c72152d0>, 'AF_PACKET': <_Link at remote 0x7fc6c7215318>, 'AF_CAN': <_Link at remote 0x7fc6c7215828>, 'AF_INET6': <_Link at remote 0x7fc6c7215750>, 'AF_NETROM': <_Link at remote 0x7fc6c72153f0>, 'AF_BLUETOOTH': <_Link at remote 0x7fc6c7215438>, 'AF_APPLETALK': <_Link at remote 0x7fc6c7215480>, 'AF_ROSE': <_Link at remote 0x7fc6c72154c8>, 'AF_ASH': <_Link at remote 0x7fc6c7215510>, 'AF_SNA': <_Link at remote 0x7fc6c7215558>, 'AF_UNSPEC': <_Link at remote 0x7fc6c72155e8>, 'AF_ATMPVC': <_Link at remote 0x7fc6c7215630>, 'AF_ROUTE': <_Link at remote 0x7fc6c7215678>, 'AF_IPX': <_Link at remote 0x7fc6c72156c0>, 'AF_NETBEUI': <_Link a...(truncated)
starting = 1
counter = True
index = 1
pool_size = 2
offset = 0

Shall I just restart the service and hope it won't happen again?

Wed, May 16, 9:27 PM · Toolforge
zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

Which went too smoothly:

root@tools-k8s-master-01:~# /usr/bin/python3 /usr/local/bin/maintain-kubeusers --infrastructure-users /etc/kubernetes/infrastructure-users --project tools https://k8s-master.tools.wmflabs.org:6443 /etc/kubernetes/tokenauth /etc/kubernetes/abac
starting a run
Homedir already exists for /data/project/genedb
Wrote config in /data/project/genedb/.kube/config
(b'namespace "genedb" created\n', b'')
Provisioned creds for tool genedb
Wrote config in /data/project/wikiintent/.kube/config
(b'namespace "wikiintent" created\n', b'')
Provisioned creds for tool wikiintent
finished run, wrote 2 new accounts
^CTraceback (most recent call last):
  File "/usr/local/bin/maintain-kubeusers", line 405, in <module>
    time.sleep(args.interval)
KeyboardInterrupt
Wed, May 16, 9:22 PM · Toolforge
zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

This find_active_server function:

def find_active_server(self, starting):
    counter = self.server_pool.active  # can be True for "forever" or the number of cycles to try
    if starting >= len(self.servers):
        starting = 0
Wed, May 16, 9:20 PM · Toolforge
zhuyifei1999 added a comment to T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.

I installed python3-dbg temporary and was able to get a stack trace:

Traceback (most recent call first):
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 172, in find_active_server
    sleep(get_config_parameter('POOLING_LOOP_TIMEOUT'))
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 82, in get_server
    self.last_used_server = self.find_active_server(self.last_used_server + 1)
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 291, in get_server
    return self.pool_states[connection].get_server()
  File "/usr/lib/python3/dist-packages/ldap3/strategy/base.py", line 108, in open
    new_server = self.connection.server_pool.get_server(self.connection)  # get a server from the server_pool if available
  File "/usr/lib/python3/dist-packages/ldap3/strategy/sync.py", line 55, in open
    BaseStrategy.open(self, reset_usage, read_server_info)
  File "/usr/lib/python3/dist-packages/ldap3/core/connection.py", line 282, in __init__
    self.open(read_server_info=False)
  File "/usr/local/bin/maintain-kubeusers", line 371, in <module>
    password=ldapconfig['password']

(stashbot died... rip. will debug that later)

Wed, May 16, 9:15 PM · Toolforge
zhuyifei1999 triaged T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds as High priority.
Wed, May 16, 9:10 PM · Toolforge
zhuyifei1999 created T194859: Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.
Wed, May 16, 9:10 PM · Toolforge

Tue, May 15

zhuyifei1999 added a comment to T194665: Provide an up-to-date mono environment on toolforge.

I have some concerns on using external repos. It could be catastrophic with regards to the integration with the rest of the system. I'm not sure if it worth the risk, considering the scope of the problem.
I always prefer using official debian/ubuntu package as long as possible.

Tue, May 15, 6:46 PM · Patch-For-Review, cloud-services-team, Toolforge
zhuyifei1999 removed a project from T158244: Improve `webservice status` output: Cloud-Services.
Tue, May 15, 8:52 AM · Toolforge
zhuyifei1999 added a comment to T158244: Improve `webservice status` output.

Would it make sense to the backend (gridengine/kubernetes) as well?

Tue, May 15, 8:49 AM · Toolforge
zhuyifei1999 added a comment to T190884: Replicate toolforge 'webservice' setup in toolsbeta.

Next question: considering that the docker containers contains the webservice code, which are subject to modification for the parent task, is it necessary to separate toolsbeta's aptly and docker-builder from toolforge's?

Tue, May 15, 8:45 AM · Toolforge
zhuyifei1999 closed T190893: Setup the webservice-related instances in toolsbeta as Resolved.

On k8s: (once the above patch is merged we shouldn't need a standalone puppetmaster for new-built instances)
http://tools-beta.wmflabs.org/
http://tools-beta.wmflabs.org/test/hello.txt

Tue, May 15, 8:41 AM · Patch-For-Review, Toolforge
zhuyifei1999 closed T190893: Setup the webservice-related instances in toolsbeta, a subtask of T190884: Replicate toolforge 'webservice' setup in toolsbeta, as Resolved.
Tue, May 15, 8:41 AM · Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

After restarting many services across various instances (mostly to update the certs), networking is now working inside kk8s pods:

toolsbeta.test@interactive:~$ curl www.google.com
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description">
Tue, May 15, 8:01 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

Uh no... it hates the certs from standalone puppetmaster:

Fixed by restarting kube-proxy across affected instances.

Tue, May 15, 7:42 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

Uh no... it hates the certs from standalone puppetmaster:

May 15 07:36:52 toolsbeta-worker-1001 kube-proxy[27594]: E0515 07:36:52.476971   27594 reflector.go:203] pkg/proxy/config/api.go:33: Failed to list *api.Endpoints: Get https://toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs:6443/api/v1/endpoints?resourceVersion=0: x509: certificate signed by unknown authority
Tue, May 15, 7:38 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

The docker systemd unit now lgtm:

zhuyifei1999@toolsbeta-worker-1001:~$ sudo systemctl cat docker
# /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket
Tue, May 15, 7:27 AM · Patch-For-Review, Toolforge
zhuyifei1999 created T194717: tools-exec-1414 unresponsive.
Tue, May 15, 4:18 AM · cloud-services-team, Toolforge
zhuyifei1999 created P7128 tools-exec-1414 Draining.
Tue, May 15, 4:12 AM

Mon, May 14

zhuyifei1999 added a comment to T194691: Upgrade Quarry main server from Trusty.

@Framawiki Do you think we should build two instances, one to serve the web and one for the persistence (database)?

Mon, May 14, 9:52 PM · Quarry
zhuyifei1999 added a parent task for T194691: Upgrade Quarry main server from Trusty: T192731: Update dependencies.
Mon, May 14, 9:48 PM · Quarry
zhuyifei1999 added a subtask for T192731: Update dependencies: T194691: Upgrade Quarry main server from Trusty.
Mon, May 14, 9:48 PM · Patch-For-Review, Quarry
zhuyifei1999 created T194691: Upgrade Quarry main server from Trusty.
Mon, May 14, 9:48 PM · Quarry
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

The way pick_initscript is coded, it seems quite clear we need to set systemd => true in base::service_unit { 'docker':

Mon, May 14, 3:54 PM · Patch-For-Review, Toolforge

Sun, May 13

zhuyifei1999 added a comment to T194343: quentinv57-tools/tools/globalcontribs.php generates slow/complex SQL queries which impact server performance.

Fatal because of the block (T194343#4200496)? or is the code broken anyhow?

Sun, May 13, 3:10 PM · Tool-Quentinv57's-tools, Patch-For-Review, DBA, Data-Services
zhuyifei1999 added a comment to T194541: Investigation: Why is there a Google Proxy API usage spike every 5 days?.

Unfortunately, I'm not logging user agents. Is that information stored in real nginx access logs somewhere that I can't see? (All I have are uwsgi logs and internal logs from the detection engine.)

Sun, May 13, 3:06 PM · User-Urbanecm, Tools, Community-Tech
zhuyifei1999 added a comment to T194343: quentinv57-tools/tools/globalcontribs.php generates slow/complex SQL queries which impact server performance.

Considering T194343#4200608, does it seem like a good idea to redirect globalcontribs.php to guc?

Sun, May 13, 1:54 AM · Tool-Quentinv57's-tools, Patch-For-Review, DBA, Data-Services

Fri, May 11

zhuyifei1999 placed T194233: page is a redirect but not a redirect up for grabs.
Fri, May 11, 2:30 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

It's a bug of pywikibot to not treat 'batchcomplete' properly anyhow.

Fri, May 11, 5:42 AM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

@Anomie While testing how batchcomplete interacts with pageid, I tested the same query for File:Example.jpg on commons, which contains many redirects and multiple file revisions, but I got batchcomplete with only a single imageinfo for each file name:
https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&maxlag=5&prop=info%7Cimageinfo%7Ccategoryinfo&meta=userinfo&indexpageids=1&continue=&generator=backlinks&inprop=protection&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata&uiprop=blockinfo%7Chasmsg&gbltitle=File%3AExample.jpg&gblfilterredir=redirects&gbllimit=500
I was expecting a continue for the imageinfo query. Is it expected that there is no continuation?

Fri, May 11, 1:31 AM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

No, batchcomplete is specifically designed to do the right thing with generators.

Fri, May 11, 12:37 AM · Patch-For-Review, MediaWiki-API, Pywikibot-core

Thu, May 10

zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

@Xqt The easiest backwards-compatible 'workaround' that doesn't break ZOI rule I thought of is to remove the prop=imageinfo from pywikibot.data.api.PageGenerator. The framework, as I understand, should reload the imageinfo from the API if it has not been pre-loaded but explicitly requested from a bot's code. Does that sound sane to you, or is prop=imageinfo is critical for some bots or the framework?

Thu, May 10, 9:53 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

The most correct thing for pywikibot's "backlinks" generator to do is to keep continuing the query and merging the result sets until it gets the batchcomplete flag in the response.

Thu, May 10, 9:45 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a project to T194233: page is a redirect but not a redirect: MediaWiki-API.

prop=imageinfo is following the redirect but prop=info is not, causing discrepancy between how MediaWiki and pywikibot interprets the data.

Thu, May 10, 8:21 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T194233: page is a redirect but not a redirect.

The two relevant API queries:

Thu, May 10, 8:09 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 claimed T194233: page is a redirect but not a redirect.
Thu, May 10, 7:55 PM · Patch-For-Review, MediaWiki-API, Pywikibot-core
zhuyifei1999 added a comment to T192423: Create a "my first NodeJS OAuth tool" tutorial for Toolforge.

Very nice. One thing: when I last coded a nodejs toolforge web tool I found that express actually receives the toolname in the uri, i.e. if you request tools.wmflabs.org/toolname/ it receives /toolname/, unlike uwsgi which mounts / on /toolname/. Could you confirm whether that is still the case? If so, do you think it would worth to figure out a way to do the 'mount'?

Thu, May 10, 2:22 PM · Documentation, Toolforge
zhuyifei1999 edited projects for T194341: SELECT query on page table appears to also reference revision table, added: Data-Services; removed Cloud-Services.
Thu, May 10, 2:17 AM · Data-Services
zhuyifei1999 added a comment to T193414: Servers using tidy-html5 are rendering pages differently, especially with <bdi>.
Thu, May 10, 1:25 AM · Operations, Release-Engineering-Team, Parsing-Team, MediaWiki-Platform-Team, MediaWiki-Parser, Tidy

Wed, May 9

zhuyifei1999 added a comment to T193414: Servers using tidy-html5 are rendering pages differently, especially with <bdi>.

So what makes this declined? The merged task clearly displays some broken 'tiding'. If this task is about pages get rendered differently and it is now consistent, then this task is resolved. The merged task is about the 'consistent' rendering being simply 'incorrect'.

Wed, May 9, 8:22 PM · Operations, Release-Engineering-Team, Parsing-Team, MediaWiki-Platform-Team, MediaWiki-Parser, Tidy
zhuyifei1999 changed the status of T193190: Polling templates on Commons sometimes show an unwanted line break between the icon and the string from Stalled to Open.
Wed, May 9, 8:15 PM · MediaWiki-Parser, Commons
zhuyifei1999 added a project to T193190: Polling templates on Commons sometimes show an unwanted line break between the icon and the string: MediaWiki-Parser.
Wed, May 9, 8:14 PM · MediaWiki-Parser, Commons
zhuyifei1999 added a comment to T193190: Polling templates on Commons sometimes show an unwanted line break between the icon and the string.

{{Support}} get expanded into [[File:Symbol support vote.svg|15px|link=]] '''<bdi>Support</bdi>''' This is perfectly valid mediawiki syntax and they should be inline. MediaWiki thinks otherwise.

Wed, May 9, 8:14 PM · MediaWiki-Parser, Commons
zhuyifei1999 updated subscribers of T190893: Setup the webservice-related instances in toolsbeta.

ok I think I figured out the cause:

Wed, May 9, 5:00 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T192751: Please upload large file to Wikimedia Commons.

Very good question... I don't have a definite yes or no answer.

Wed, May 9, 2:53 AM · Operations, Commons, Wikimedia-Site-requests

Tue, May 8

zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

I followed T182722#3834172 and saw a similar behavior of docker internal IPs being forwarded to DNS without NAT being applied:

zhuyifei1999@toolsbeta-worker-1001:~$ sudo tcpdump -i eth0 host labs-recursor0.wikimedia.org
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
23:32:07.240151 IP 172.17.0.2.46800 > labs-recursor0.wikimedia.org.domain: 63168+ A? www.google.com.toolsbeta.eqiad.wmflabs. (56)
23:32:07.240266 IP 172.17.0.2.46800 > labs-recursor0.wikimedia.org.domain: 62118+ AAAA? www.google.com.toolsbeta.eqiad.wmflabs. (56)

This IP is unexpectedly 172.17.0.2, instead of 192.168.*.* as in the linked comment. Indeed, the IP configuration for docker0 is broken:

root@tools-worker-1001:~# ifconfig docker0
docker0   Link encap:Ethernet  HWaddr 02:42:a5:2f:f4:e3  
          inet addr:192.168.168.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::42:a5ff:fe2f:f4e3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:44098063 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47829857 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:28832505727 (26.8 GiB)  TX bytes:109367978093 (101.8 GiB)
[...]
zhuyifei1999@toolsbeta-worker-1001:~$ sudo ifconfig docker0
docker0   Link encap:Ethernet  HWaddr 02:42:b4:58:35:c6  
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:b4ff:fe58:35c6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:149 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:10276 (10.0 KiB)  TX bytes:1032 (1.0 KiB)

Googled for how the IP is configured, and grep-ed for the dockerd command link args, and surprise, docker is started with different command line args:

root@tools-worker-1001:~# ps auxfww | grep dockerd
root     11311  0.0  0.0  12728  2240 pts/0    S+   23:36   0:00          \_ grep dockerd
root      3495  1.3  0.8 2704132 66704 ?       Ssl  Mar19 984:28 dockerd -H fd:// --config-file=/etc/docker/daemon.json --bip=192.168.168.1/24 --mtu=1450
[...]
zhuyifei1999@toolsbeta-worker-1001:~$ ps auxfww | grep dockerd
zhuyife+  3607  0.0  0.0  12728  2252 pts/0    S+   23:36   0:00  |           \_ grep dockerd
root     27465  0.2  1.1 606276 48460 ?        Ssl  21:47   0:19 /usr/bin/dockerd -H fd://

So is the systemd unit the same? No...

1​root@tools-worker-1001:~# systemctl cat docker.service
2​# /lib/systemd/system/docker.service
3​[Unit]
4​Description=Docker Application Container Engine
5​Documentation=https://docs.docker.com
6​After=network.target docker.socket
7​Requires=docker.socket
8
9​[Service]
10​Type=notify
11​# the default is not to use systemd for cgroups because the delegate issues still
12​# exists and systemd currently does not support the cgroup feature set required
13​# for containers run by docker
14​ExecStart=/usr/bin/dockerd -H fd://
15​ExecReload=/bin/kill -s HUP $MAINPID
16​# Having non-zero Limit*s causes performance problems due to accounting overhead
17​# in the kernel. We recommend using cgroups to do container-local accounting.
18​LimitNOFILE=infinity
19​LimitNPROC=infinity
20​LimitCORE=infinity
21​# Uncomment TasksMax if your systemd version supports it.
22​# Only systemd 226 and above support this version.
23​#TasksMax=infinity
24​TimeoutStartSec=0
25​# set delegate yes so that systemd does not reset the cgroups of docker containers
26​Delegate=yes
27​# kill only the docker process, not all processes in the cgroup
28​KillMode=process
29
30​[Install]
31​WantedBy=multi-user.target
32
33​# /etc/systemd/system/docker.service.d/puppet-override.conf
34​# Docker override systemd for v1.11.2-0~jessie
35​[Unit]
36​After=network.target docker.socket flannel.service
37​Requires=docker.socket flannel.service
38
39​[Service]
40​EnvironmentFile=/run/flannel/subnet.env
41​# We need to clear ExecStart first before setting it again
42​ExecStart=
43​ExecStart=/usr/bin/docker daemon -H fd:// \
44​ --config-file=/etc/docker/daemon.json \
45​ --bip=${FLANNEL_SUBNET} \
46​ --mtu=${FLANNEL_MTU}
47​[...]
48​zhuyifei1999@toolsbeta-worker-1001:~$ systemctl cat docker.service
49​# /lib/systemd/system/docker.service
50​[Unit]
51​Description=Docker Application Container Engine
52​Documentation=https://docs.docker.com
53​After=network.target docker.socket
54​Requires=docker.socket
55
56​[Service]
57​Type=notify
58​# the default is not to use systemd for cgroups because the delegate issues still
59​# exists and systemd currently does not support the cgroup feature set required
60​# for containers run by docker
61​ExecStart=/usr/bin/dockerd -H fd://
62​ExecReload=/bin/kill -s HUP $MAINPID
63​# Having non-zero Limit*s causes performance problems due to accounting overhead
64​# in the kernel. We recommend using cgroups to do container-local accounting.
65​LimitNOFILE=infinity
66​LimitNPROC=infinity
67​LimitCORE=infinity
68​# Uncomment TasksMax if your systemd version supports it.
69​# Only systemd 226 and above support this version.
70​#TasksMax=infinity
71​TimeoutStartSec=0
72​# set delegate yes so that systemd does not reset the cgroups of docker containers
73​Delegate=yes
74​# kill only the docker process, not all processes in the cgroup
75​KillMode=process
76
77​[Install]
78​WantedBy=multi-user.target

/etc/systemd/system/docker.service.d/puppet-override.conf doesn't exist on toolsbeta but exists on toolforge.

Tue, May 8, 11:41 PM · Patch-For-Review, Toolforge
zhuyifei1999 created P7101 (An Untitled Masterwork).
Tue, May 8, 11:39 PM
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

Networking is still broken inside the pod:

toolsbeta.admin@toolsbeta-bastion-01:~$ webservice --backend kubernetes php5.6 shell
If you don't see a command prompt, try pressing enter.
toolsbeta.admin@interactive:~$ curl www.google.com
curl: (6) Could not resolve host: www.google.com

I'm might be out of ideas. Anyone got an idea?

Tue, May 8, 3:14 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

I did a comparison with tools-worker-1001, and this file is missing on toolforge:

zhuyifei1999@toolsbeta-worker-1001:~$ cat /etc/kubernetes/config
###
# Kubernetes: common config for the following services:
##
#   kube-apiserver.service
#   kube-controller-manager.service
#   kube-scheduler.service
#   kubelet.service
#   kube-proxy.service
##
Tue, May 8, 3:09 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

That 127.0.0.1:8080 is supposed to be toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs:8080

Tue, May 8, 3:00 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

kube-proxy is still broken:

May 08 02:21:44 toolsbeta-worker-1001 kube-proxy[1566]: E0508 02:21:44.886056    1566 reflector.go:203] pkg/proxy/config/api.go:33: Failed to list *api.Endpoints: Get http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
May 08 02:21:44 toolsbeta-worker-1001 kube-proxy[1566]: E0508 02:21:44.886172    1566 reflector.go:203] pkg/proxy/config/api.go:30: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refuse
Tue, May 8, 2:22 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

1​zhuyifei1999@toolsbeta-worker-1001:~$ curl https://toolsbeta-flannel-etcd-01.toolsbeta.eqiad.wmflabs:2379/v2/keys/?recursive=true; echo
2​{"action":"get","node":{"dir":true}}
3
4​[A long time of digging around...]
5​zhuyifei1999@toolsbeta-worker-1001:~$ curl https://toolsbeta-flannel-etcd-01.toolsbeta.eqiad.wmflabs:2379/v2/keys/coreos.com/network/config -XPUT -d value='{ "Network": "192.168.128.0/17", "Backend": { "Type": "vxlan" } }'
6​{"action":"set","node":{"key":"/coreos.com/network/config","value":"{ \"Network\": \"192.168.128.0/17\", \"Backend\": { \"Type\": \"vxlan\" } }","modifiedIndex":5,"createdIndex":5}}
7​zhuyifei1999@toolsbeta-worker-1001:~$ curl https://toolsbeta-flannel-etcd-01.toolsbeta.eqiad.wmflabs:2379/v2/keys/?recursive=true | jq .
8​ % Total % Received % Xferd Average Speed Time Time Time Current
9​ Dload Upload Total Spent Left Speed
10​100 1267 100 1267 0 0 22778 0 --:--:-- --:--:-- --:--:-- 23036
11​{
12​ "action": "get",
13​ "node": {
14​ "dir": true,
15​ "nodes": [
16​ {
17​ "key": "/coreos.com",
18​ "dir": true,
19​ "nodes": [
20​ {
21​ "key": "/coreos.com/network",
22​ "dir": true,
23​ "nodes": [
24​ {
25​ "key": "/coreos.com/network/config",
26​ "value": "{ \"Network\": \"192.168.128.0/17\", \"Backend\": { \"Type\": \"vxlan\" } }",
27​ "modifiedIndex": 5,
28​ "createdIndex": 5
29​ },
30​ {
31​ "key": "/coreos.com/network/subnets",
32​ "dir": true,
33​ "nodes": [
34​ {
35​ "key": "/coreos.com/network/subnets/192.168.215.0-24",
36​ "value": "{\"PublicIP\":\"10.68.18.110\",\"BackendType\":\"vxlan\",\"BackendData\":{\"VtepMAC\":\"ce:b0:30:4d:62:48\"}}",
37​ "expiration": "2018-05-09T02:17:23.548978467Z",
38​ "ttl": 86392,
39​ "modifiedIndex": 6,
40​ "createdIndex": 6
41​ },
42​ {
43​ "key": "/coreos.com/network/subnets/192.168.130.0-24",
44​ "value": "{\"PublicIP\":\"10.68.20.72\",\"BackendType\":\"vxlan\",\"BackendData\":{\"VtepMAC\":\"ba:57:9d:85:c2:e4\"}}",
45​ "expiration": "2018-05-09T02:17:23.982088555Z",
46​ "ttl": 86392,
47​ "modifiedIndex": 7,
48​ "createdIndex": 7
49​ },
50​ {
51​ "key": "/coreos.com/network/subnets/192.168.131.0-24",
52​ "value": "{\"PublicIP\":\"10.68.22.202\",\"BackendType\":\"vxlan\",\"BackendData\":{\"VtepMAC\":\"fa:3b:36:2a:31:49\"}}",
53​ "expiration": "2018-05-09T02:17:24.453518564Z",
54​ "ttl": 86393,
55​ "modifiedIndex": 8,
56​ "createdIndex": 8
57​ }
58​ ],
59​ "modifiedIndex": 6,
60​ "createdIndex": 6
61​ }
62​ ],
63​ "modifiedIndex": 5,
64​ "createdIndex": 5
65​ }
66​ ],
67​ "modifiedIndex": 5,
68​ "createdIndex": 5
69​ }
70​ ]
71​ }
72​}

Let's see if I fixed flannel.

Tue, May 8, 2:20 AM · Patch-For-Review, Toolforge
zhuyifei1999 created P7094 (An Untitled Masterwork).
Tue, May 8, 2:20 AM

Mon, May 7

zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

Got the k8s job running, but networking seems bugged out:

1​toolsbeta.admin@toolsbeta-bastion-01:~$ webservice --backend kubernetes php5.6 start
2​Starting webservice..
3​toolsbeta.admin@toolsbeta-bastion-01:~$ kubectl get pods
4​NAME READY STATUS RESTARTS AGE
5​admin-1850377006-x00n5 1/1 Running 0 7s
6​toolsbeta.admin@toolsbeta-bastion-01:~$ kubectl log admin-1850377006-x00n5
7​W0507 21:52:41.743263 4014 cmd.go:345] log is DEPRECATED and will be removed in a future version. Use logs instead.
8​toolsbeta.admin@toolsbeta-bastion-01:~$ kubectl logs admin-1850377006-x00n5
9​toolsbeta.admin@toolsbeta-bastion-01:~$ webservice --backend kubernetes php5.6 shell
10​If you don't see a command prompt, try pressing enter.
11​toolsbeta.admin@interactive:~$
12​toolsbeta.admin@interactive:~$ webservice status
13​Traceback (most recent call last):
14​ File "/usr/bin/webservice", line 182, in <module>
15​ if job.get_state() != Backend.STATE_STOPPED:
16​ File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 402, in get_state
17​ pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
18​ File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 210, in _find_obj
19​ selector=selector
20​ File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 70, in get
21​ num = len(clone)
22​ File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 122, in __len__
23​ return len(self.query_cache["objects"])
24​ File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 115, in query_cache
25​ cache["response"] = self.execute().json()
26​ File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 99, in execute
27​ r = self.api.get(**kwargs)
28​ File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 125, in get
29​ return self.session.get(*args, **self.get_kwargs(**kwargs))
30​ File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 501, in get
31​ return self.request('GET', url, **kwargs)
32​ File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
33​ resp = self.send(prep, **send_kwargs)
34​ File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
35​ r = adapter.send(request, **kwargs)
36​ File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
37​ raise ConnectionError(e, request=request)
38​requests.exceptions.ConnectionError: HTTPSConnectionPool(host='toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs', port=6443): Max retries exceeded with url: /api/v1/namespaces/admin/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Cname%3Dadmin%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe91f6ded0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Mon, May 7, 9:55 PM · Patch-For-Review, Toolforge
zhuyifei1999 created P7093 (An Untitled Masterwork).
Mon, May 7, 9:54 PM
zhuyifei1999 closed T192244: Provide a consistent way to identify operation in toolforge (including k8s) as Resolved.
08:51:54 0 ✓ zhuyifei1999@tools-bastion-05: ~$ become video2commons
(venv)tools.video2commons@tools-bastion-05:~$ webservice shell
Pod is not ready in time
(venv)tools.video2commons@tools-bastion-05:~$ webservice shell
If you don't see a command prompt, try pressing enter.
(venv)tools.video2commons@interactive:~$ 
(venv)tools.video2commons@interactive:~$ ls /etc/wmcs-project 
/etc/wmcs-project
(venv)tools.video2commons@interactive:~$ cat /etc/wmcs-project 
tools
Mon, May 7, 8:54 PM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

K8s is up!

toolsbeta.admin@toolsbeta-bastion-01:~$ webservice --backend kubernetes python shell
If you don't see a command prompt, try pressing enter.
toolsbeta.admin@interactive:~$ 
toolsbeta.admin@interactive:~$ echo Hello from Kubernetes\!
Hello from Kubernetes!
toolsbeta.admin@interactive:~$ logout
Session ended, resume using 'kubectl attach interactive -c interactive -i -t' command when the pod is running
Pod stopped. Session cannot be resumed.
Mon, May 7, 8:07 PM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T193560: Tool keeps falling into permanent 500 error.

I see the webservice is restarted, could you ping me when it happens again? I cannot debug the issue when it's 'working'

Mon, May 7, 6:24 PM · Toolforge

Sun, May 6

zhuyifei1999 added a comment to T192732: Rename pywikipedia to pywikibot on Toolforge.

There is, but...is core_old updated as well?

No, the directory seems to be from 2013 ans still uses svn.

Sun, May 6, 10:34 PM · Pywikibot-core, Toolforge
zhuyifei1999 added a comment to T192733: Remove old symlinks to trunk/rewrite in pywikipedia.

pywikibot too?

Sun, May 6, 10:21 PM · Pywikibot-core, Toolforge
zhuyifei1999 added a comment to T192732: Rename pywikipedia to pywikibot on Toolforge.

We could make Toolforge somehow warn pywikipedia here is obsolete

Sun, May 6, 6:09 PM · Pywikibot-core, Toolforge
zhuyifei1999 added a comment to T192733: Remove old symlinks to trunk/rewrite in pywikipedia.

PS: symlinks are the same thing as shortcuts/verknüpfungen on Windows

Sun, May 6, 5:59 PM · Pywikibot-core, Toolforge
zhuyifei1999 updated the task description for T170355: Figure out process for deleting an unused tool.
Sun, May 6, 5:48 PM · Toolforge
zhuyifei1999 updated the task description for T170355: Figure out process for deleting an unused tool.
Sun, May 6, 5:47 PM · Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

After copying the replica.my.cnf from tools.zhuyifei1999-test, https://tools-beta.wmflabs.org/ works perfectly :)

Sun, May 6, 5:34 PM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T192732: Rename pywikipedia to pywikibot on Toolforge.

Anything else to do here?

Sun, May 6, 5:25 PM · Pywikibot-core, Toolforge
zhuyifei1999 claimed T190893: Setup the webservice-related instances in toolsbeta.
Sun, May 6, 5:21 PM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T192751: Please upload large file to Wikimedia Commons.

I don't know how that tool (videoconvert) works exactly, but it does not have a crontab. v2c does delete files after some time, but that's a different tool.

Sun, May 6, 5:15 PM · Operations, Commons, Wikimedia-Site-requests
zhuyifei1999 added a comment to T192733: Remove old symlinks to trunk/rewrite in pywikipedia.

I’ve no glue what this task means to be done here. I neither use toolforge nor Unix.

Sun, May 6, 5:12 PM · Pywikibot-core, Toolforge
zhuyifei1999 added a comment to T193560: Tool keeps falling into permanent 500 error.

Your process list:

root@tools-worker-1001:~# ps ufww -u tools.wikidata-todo
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tools.w+ 13446  0.0  0.1  96836 10124 ?        Ss   May03   1:19 /usr/sbin/lighttpd -f /var/run/lighttpd/wikidata-todo -D
tools.w+ 13463  0.0  0.3 377620 31388 ?        Ss   May03   0:00  \_ /usr/bin/php-cgi
tools.w+ 12241  0.0  0.7 1636036 60320 ?       S    May04   0:45  |   \_ /usr/bin/php-cgi
tools.w+ 21227  0.0  0.4 1416480 36460 ?       S    May04   0:32  |   \_ /usr/bin/php-cgi
tools.w+ 13472  0.0  0.3 377620 31184 ?        Ss   May03   0:00  \_ /usr/bin/php-cgi
tools.w+ 30018  0.0  0.6 1691860 50504 ?       S    May04   0:46      \_ /usr/bin/php-cgi
tools.w+ 12798  0.0  0.8 1710712 69756 ?       S    May04   0:40      \_ /usr/bin/php-cgi
tools.w+ 13350  0.0  0.0   4424  1328 ?        Ssl  May03   0:00 /pause

PID 13350 seems to be part of k8s infrastructure.
PID 13446 responds with 500 whenever a request comes, epolls otherwise:

root@tools-worker-1001:~# strace -p 13446
Process 13446 attached
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {{EPOLLIN, {u32=5, u64=5}}}, 65537, 1000) = 1
accept(5, {sa_family=AF_INET, sin_port=htons(60000), sin_addr=inet_addr("192.168.146.0")}, [16]) = 532
fcntl(532, F_SETFD, FD_CLOEXEC)         = 0
fcntl(532, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
ioctl(532, FIONREAD, [173])             = 0
read(532, "GET /wikidata-todo/ HTTP/1.1\r\nCo"..., 4159) = 173
stat("/data/project/wikidata-todo/public_html//", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
stat("/data/project/wikidata-todo/public_html//index.php", {st_mode=S_IFREG|0775, st_size=10330, ...}) = 0
open("/data/project/wikidata-todo/public_html//index.php", O_RDONLY) = 533
close(533)                              = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 533
fcntl(533, F_SETFD, FD_CLOEXEC)         = 0
fcntl(533, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
connect(533, {sa_family=AF_LOCAL, sun_path="/var/run/lighttpd/php.socket.wikidata-todo-1"}, 46) = -1 EAGAIN (Resource temporarily unavailable)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "2018-05-06 04:43:16: (mod_fastcg"..., 170) = 170
close(533)                              = 0
accept(5, 0x7ffe745ef940, [112])        = -1 EAGAIN (Resource temporarily unavailable)
socket(PF_LOCAL, SOCK_STREAM, 0)        = 533
fcntl(533, F_SETFD, FD_CLOEXEC)         = 0
fcntl(533, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
connect(533, {sa_family=AF_LOCAL, sun_path="/var/run/lighttpd/php.socket.wikidata-todo-0"}, 46) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "2018-05-06 04:43:16: (mod_fastcg"..., 170) = 170
close(533)                              = 0
setsockopt(532, SOL_TCP, TCP_CORK, [1], 4) = 0
writev(532, [{"HTTP/1.1 500 Internal Server Err"..., 165}, {"<?xml version=\"1.0\" encoding=\"is"..., 369}], 2) = 534
setsockopt(532, SOL_TCP, TCP_CORK, [0], 4) = 0
write(6, "192.168.146.0 tools.wmflabs.org "..., 120) = 120
shutdown(532, SHUT_WR)                  = 0
read(532, "", 1024)                     = 0
close(532)                              = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "2018-05-06 04:43:18: (mod_fastcg"..., 113) = 113
write(4, "2018-05-06 04:43:18: (mod_fastcg"..., 113) = 113
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, {}, 65537, 1000)          = 0
epoll_wait(7, ^CProcess 13446 detached
 <detached ...>

PIDs 13463 & 13472 are sleeping forever:

root@tools-worker-1001:~# strace -p 13463
Process 13463 attached
wait4(-1, ^CProcess 13463 detached
 <detached ...>
root@tools-worker-1001:~# strace -p 13472
Process 13472 attached
wait4(-1, ^CProcess 13472 detached
 <detached ...>

PIDs 12241, 21227, 30018, & 12798 are stuck in the same loop:

root@tools-worker-1001:~# strace -p 12241
Process 12241 attached
restart_syscall(<... resuming interrupted call ...>) = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 6
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(6, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = -1 ENOENT (No such file or directory)
close(6)                                = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffe4a316200)       = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 6
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(6, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = -1 ENOENT (No such file or directory)
close(6)                                = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffe4a316200)       = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 6
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(6, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = -1 ENOENT (No such file or directory)
close(6)                                = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, ^CProcess 12241 detached
 <detached ...>

My guess: your PHP workers are in an infinite loop and lighttpd cannot make the workers process newer requests.

Sun, May 6, 4:50 AM · Toolforge
zhuyifei1999 added a comment to T192733: Remove old symlinks to trunk/rewrite in pywikipedia.

Is is possible that some very old bots still rely on the symlinks?

Sun, May 6, 4:34 AM · Pywikibot-core, Toolforge
zhuyifei1999 added a comment to T193560: Tool keeps falling into permanent 500 error.

/me looks

Sun, May 6, 4:31 AM · Toolforge
zhuyifei1999 awarded T193681: Tutorial for running continuous or cron-like jobs via Python on Toolforge a Love token.
Sun, May 6, 4:29 AM · Documentation, Toolforge
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

After locally patching the relevant webservice code on prefixes, webservice on grid seems to work: http://tools-beta.wmflabs.org/ (a nostalgic look; the response code is still 500 though) finally!

Sun, May 6, 4:22 AM · Patch-For-Review, Toolforge
zhuyifei1999 added a comment to T174082: Update code and/or docs for "How can I detect if I'm running in Labs?".

See also T192244: Provide a consistent way to identify operation in toolforge (including k8s)

Sun, May 6, 3:45 AM · Patch-For-Review, cloud-services-team (Kanban), Documentation, User-bd808, Toolforge

Sat, May 5

zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

maintain-kubeusers seems to be able to run successfully if invoked with the correct project name (toolsbeta):

1​zhuyifei1999@toolsbeta-k8s-master-01:~$ sudo kubectl delete namespace test test2 admin toolschecker
2​namespace "test" deleted
3​namespace "test2" deleted
4​namespace "admin" deleted
5​namespace "toolschecker" deleted
6​zhuyifei1999@toolsbeta-k8s-master-01:~$ sudo truncate -s 0 /etc/kubernetes/abac
7​zhuyifei1999@toolsbeta-k8s-master-01:~$ sudo truncate -s 0 /etc/kubernetes/tokenauth
8​zhuyifei1999@toolsbeta-k8s-master-01:~$ sudo rm -rv /data/project/{test,test2,admin,toolschecker}/.kube/
9​removed ‘/data/project/test/.kube/config’
10​removed directory: ‘/data/project/test/.kube/’
11​removed ‘/data/project/test2/.kube/config’
12​removed directory: ‘/data/project/test2/.kube/’
13​removed ‘/data/project/admin/.kube/config’
14​removed directory: ‘/data/project/admin/.kube/’
15​removed ‘/data/project/toolschecker/.kube/config’
16​removed directory: ‘/data/project/toolschecker/.kube/’
17​zhuyifei1999@toolsbeta-k8s-master-01:~$ sudo /usr/local/bin/maintain-kubeusers --infrastructure-users /etc/kubernetes/infrastructure-users --project toolsbeta https://toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs:6443 /etc/kubernetes/tokenauth /etc/kubernetes/abac
18​starting a run
19​Provisioned creds for infra user client-infrastructure
20​Provisioned creds for infra user proxy-infrastructure
21​Homedir already exists for /data/project/toolschecker
22​Wrote config in /data/project/toolschecker/.kube/config
23​(b'namespace "toolschecker" created\n', b'')
24​Provisioned creds for tool toolschecker
25​Homedir already exists for /data/project/test
26​Wrote config in /data/project/test/.kube/config
27​(b'namespace "test" created\n', b'')
28​Provisioned creds for tool test
29​Homedir already exists for /data/project/admin
30​Wrote config in /data/project/admin/.kube/config
31​(b'namespace "admin" created\n', b'')
32​Provisioned creds for tool admin
33​Homedir already exists for /data/project/test2
34​Wrote config in /data/project/test2/.kube/config
35​(b'namespace "test2" created\n', b'')
36​Provisioned creds for tool test2
37​Provisioned creds for infra user prometheus
38​finished run, wrote 7 new accounts
39​^CTraceback (most recent call last):
40​ File "/usr/local/bin/maintain-kubeusers", line 405, in <module>
41​ time.sleep(args.interval)
42​KeyboardInterrupt

Sat, May 5, 8:37 PM · Patch-For-Review, Toolforge
zhuyifei1999 edited P7086 manual toolsbeta maintain-kubeusers invokation.
Sat, May 5, 8:35 PM
zhuyifei1999 created P7086 manual toolsbeta maintain-kubeusers invokation.
Sat, May 5, 8:34 PM
zhuyifei1999 added a comment to T190893: Setup the webservice-related instances in toolsbeta.

Got the k8s apiserver up. After tokenauth it complained about May 05 20:06:21 toolsbeta-k8s-master-01 kube-apiserver[9888]: E0505 20:06:21.748829 9888 genericapiserver.go:742] Unable to listen for secure (crypto/tls: failed to parse private key); will try again., and I applied "role::toollabs::k8s::master::use_puppet_certs": true and it seems to work.

Sat, May 5, 8:22 PM · Patch-For-Review, Toolforge