Page MenuHomePhabricator

PAWS network error:
Closed, ResolvedPublic

Description

Test.py:

import pywikibot
s = pywikibot.Site()
p = pywikibot.Page(s, "Test")
p.get()

Running python test.py 2> test.err and then Ctrl-C after a while

Output:

1svnversion failed
2Traceback (most recent call last):
3 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 135, in _new_conn
4 (self.host, self.port), self.timeout, **extra_kw)
5 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/util/connection.py", line 66, in create_connection
6 for res in socket.getaddrinfo(host, port, 0, socket.SOCK_STREAM):
7 File "/usr/lib/python3.4/socket.py", line 530, in getaddrinfo
8 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
9socket.gaierror: [Errno -2] Name or service not known
10
11During handling of the above exception, another exception occurred:
12
13Traceback (most recent call last):
14 File "test.py", line 4, in <module>
15 p.get()
16 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/tools/__init__.py", line 1327, in wrapper
17 return obj(*__args, **__kw)
18 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/page.py", line 370, in get
19 self._getInternals(sysop)
20 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/page.py", line 400, in _getInternals
21 self.site.loadrevisions(self, getText=True, sysop=sysop)
22 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/page.py", line 150, in site
23 return self._link.site
24 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/page.py", line 4950, in site
25 self.parse()
26 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/page.py", line 4936, in parse
27 if self._site.namespaces[self._namespace].case == 'first-letter':
28 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 1004, in namespaces
29 self._namespaces = NamespacesDict(self._build_namespaces())
30 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 2520, in _build_namespaces
31 is_mw114 = MediaWikiVersion(self.version()) >= MediaWikiVersion('1.14')
32 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 2627, in version
33 version = self.siteinfo.get('generator', expiry=1).split(' ')[1]
34 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 1599, in get
35 preloaded = self._get_general(key, expiry)
36 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 1545, in _get_general
37 default_info = self._get_siteinfo(props, expiry)
38 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/site.py", line 1471, in _get_siteinfo
39 data = request.submit()
40 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/data/api.py", line 2327, in submit
41 self._data = super(CachedRequest, self).submit()
42 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/data/api.py", line 1947, in submit
43 body=body, headers=headers)
44 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/tools/__init__.py", line 1327, in wrapper
45 return obj(*__args, **__kw)
46 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/comms/http.py", line 247, in request
47 r = fetch(baseuri, method, body, headers, **kwargs)
48 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/comms/http.py", line 397, in fetch
49 request = _enqueue(uri, method, body, headers, **kwargs)
50 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/comms/http.py", line 379, in _enqueue
51 _http_process(session, request)
52 File "/srv/pwb/lib/python3.4/site-packages/pywikibot/comms/http.py", line 299, in _http_process
53 verify=not ignore_validation)
54 File "/srv/pwb/lib/python3.4/site-packages/requests/sessions.py", line 468, in request
55 resp = self.send(prep, **send_kwargs)
56 File "/srv/pwb/lib/python3.4/site-packages/requests/sessions.py", line 576, in send
57 r = adapter.send(request, **kwargs)
58 File "/srv/pwb/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
59 timeout=timeout
60 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
61 body=body, headers=headers)
62 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
63 self._validate_conn(conn)
64 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 782, in _validate_conn
65 conn.connect()
66 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 215, in connect
67 conn = self._new_conn()
68 File "/srv/pwb/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 135, in _new_conn
69 (self.host, self.port), self.timeout, **extra_kw)
70KeyboardInterrupt
71CRITICAL: Closing network session.

Event Timeline

zhuyifei1999 raised the priority of this task from to Needs Triage.
zhuyifei1999 updated the task description. (Show Details)
zhuyifei1999 added subscribers: zhuyifei1999, yuvipanda.

Yeah, I think this is undelrying network issues. Investigating...

So...

Due to massive cert upheaval (due to T120159), flannel and kube-proxy had stopped working. I spent the last few hours totally destroying and rebuilding our cluster, and everything's sparkly and shiny now! I tested a few new spawns and they all have proper network access. Can someone else verify?

Hmm this just ahppened again. Investigating.

Ok, so this faild when it was scheduled on tools-worker-05 but *not* on tools-worker-01. Network requests can come in but do not go out. This makes me suspect this is some form of SNAT issue.

Ok, I think I've isolated it - I had forgotten that I had upgraded the kernels on the old nodes (kube-proxy needs at least 3.18 I think, and default is 3.16). The older nodes are on 3.19, but that doesn't seem to work anymore, so I've upgraded the new nodes to 4.12. I've tested by stopping and starting my server a few times and it all seems ok!

Change 257273 had a related patch set uploaded (by Yuvipanda):
k8s: Make sure docker doesn't do ip-masq

https://gerrit.wikimedia.org/r/257273

Change 257273 merged by Yuvipanda:
k8s: Make sure docker doesn't do ip-masq

https://gerrit.wikimedia.org/r/257273

This hasn't happened since ^ fixes.

This still is happening, but only on a per-node basis, triggered randomly?! I can reproduce this by just setting up a container in docker (independent of kubernetes)

Change 259325 had a related patch set uploaded (by Yuvipanda):
k8s: Have flannel do ip-masq

https://gerrit.wikimedia.org/r/259325

Change 259325 merged by Yuvipanda:
k8s: Have flannel do ip-masq

https://gerrit.wikimedia.org/r/259325

This seems mostly fixed! However nodes are still randomly getting stuck :(

Pretty sure the flannel fix fixed it :) Instance lockup is being tracked in T121998