I'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
I'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
current status:
A few things detected upon the initial deployment:
In T301380#7844583, @GoranSMilovanovic wrote:Thank you @aborrero.
@ItamarWMDE @Tobi_WMDE_SW @Manuel
I hope you aware of the fact that changing the instance name in CloudVPS implies the change in its URL, e.g.
- current URL for Wikidata Analytics is https://wikidata-analytics.wmcloud.org/
- because the name of the CloudVPS instance in the wmde-dashboards project is wikidata-analytics.
@Manuel This is especially relevant for you since you have placed an (understandable, rational) demand to always keep the old URLs alive.
@ItamarWMDE Do you think there is something that we can do about this?
There is no short term solution to this.
This is blocked on T277653: Toolforge: add Debian Buster to the grid and eliminate Debian Stretch, the old Debian Stretch grid relies on resolving the .eqiad.wmflabs names.
Perhaps try requesting more resources for the job, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Job_quotas
Hey, thanks to @Majavah who did some internal inspection, we believe the problem is that there is already a virtual machine with the same name somewhere in Cloud VPS.
In T302178#7835058, @aborrero wrote:I will put the .deb packaging in here: https://gitlab.wikimedia.org/repos/cloud/deb/pkg-prometheus-openstack-exporter
We discussed this in the WMCS team meeting today, and pretty much agreed with this idea.
I will put the .deb packaging in here: https://gitlab.wikimedia.org/repos/cloud/deb/pkg-prometheus-openstack-exporter
all agents are back online:
We have now python3-eventlet version 0.30.2-5~bpo11+1 in the bullseye-wallaby repo, upgrading codfw1dev with that.
Talked to fellow Debian Developers to ask them to put a newer version of python3-eventlet on the bullseye-wallaby repo.
The version of python3-eventlet that contains the mentioned DNS fixes is >= 0.30.2-3 per changelog at https://tracker.debian.org/media/packages/p/python-eventlet/changelog-0.30.2-5
aborrero@cloudgw1001:~ 4 $ sudo systemctl status systemd-sysctl ● systemd-sysctl.service - Apply Kernel Variables Loaded: loaded (/lib/systemd/system/systemd-sysctl.service; static) Active: active (exited) since Tue 2022-04-05 12:16:43 UTC; 10min ago Docs: man:systemd-sysctl.service(8) man:sysctl.d(5) Process: 434 ExecStart=/lib/systemd/systemd-sysctl (code=exited, status=0/SUCCESS) Main PID: 434 (code=exited, status=0/SUCCESS) CPU: 15ms
The reimage resulted in new NIC names for cloudgw :-( the newer ones are longer and don't support the vlan tag attached to them.
In T277653#7827047, @komla wrote:One user sent a technical enquiry to the Cloud mailing list but the post is currently being held for moderation because of size.
Can this be reviewed?
forked the eventlet/dnspython problem into T305157: Openstack Wallaby on Debian 11 Bullseye problems because eventlet and dnspython
Again, @dcaro pointed at the combo of dnspython/eventlet as being troubled.
Latest theory by @dcaro is name resolution intermixed with IPv6 connectivity issues.
All 3 cloudcontrols show the same mariadb connectivity problem:
Neutron has been detected to be down @ codfw1dev after the upgrade.
The idea to have this gitlab CI-related container images stored in gitlab itself came from this PoC:
I confirm you are out of quota for more deployments.
This is done for now.
We had a meeting today.
Will wait a few more days before upgrading eqiad server, to run a few more tests, merge these patches, etc.
A meeting was held today and we agreed on going with Option 3, a deploy.sh script.
I'm adding Option 4: Enable BYOC only for a few selected users that request it.
For the record:
Done, note the 400:
2022-03-22 10:54:38 INFO: new configuration: {'task_compose_emails_loop_sleep': '400', 'task_send_emails_loop_sleep': '10', 'task_send_emails_max': '10', 'task_watch_pods_timeout': '60', 'task_read_configmap_sleep': '10', 'email_to_domain': 'tools.wmflabs.org', 'email_to_prefix': 'tools', 'email_from_addr': 'noreply@toolforge.org', 'smtp_server_fqdn': 'mail.tools.wmflabs.org', 'smtp_server_port': '25', 'send_emails_for_real': 'yes', 'debug': 'yes'}
As a quick counter measure, will try increasing the time we cache events before we send an email. Hopefully this is enough to catch repeated events.
I think I have a theory of what's happening. The k8s API is really chatty about events going on for pods, which is good, but forces the emailer to do some filtering and caching to avoid flooding you with meaningless emails, which could be tricky.
thanks!
Can you please paste here the full repeated emails, with the complete email source and headers?
In T283894#7746051, @gerritbot wrote:Change 767249 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):
[operations/puppet@production] WIP: gitlab: enable agent server for kubernetes