Page MenuHomePhabricator

apache2 fails to start after gerrit hosts are rebooted
Closed, ResolvedPublic

Description

apache2 fails to start after rebooting gerrit machines (role(gerrit)). It fails with "failed to bind to IPv6 address"

journalctl -u apache2
Jan 03 10:33:34 gerrit1001 systemd[1]: Starting The Apache HTTP Server...
Jan 03 10:33:35 gerrit1001 apachectl[846]: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address [2620:0:861:2:208:80:154:137]:80
Jan 03 10:33:35 gerrit1001 apachectl[846]: no listening sockets available, shutting down
Jan 03 10:33:35 gerrit1001 apachectl[846]: AH00015: Unable to open logs
Jan 03 10:33:35 gerrit1001 apachectl[846]: Action 'start' failed.
Jan 03 10:33:35 gerrit1001 apachectl[846]: The Apache error log may have more information.
Jan 03 10:33:35 gerrit1001 systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
Jan 03 10:33:35 gerrit1001 systemd[1]: apache2.service: Failed with result 'exit-code'.
Jan 03 10:33:35 gerrit1001 systemd[1]: Failed to start The Apache HTTP Server.

My guess it the puppet code for setting up the second interface runs after the first service restart. So the service fails until the interface is configured by the next puppet run.

interface::alias { 'gerrit server':
    ipv4 => $ipv4,
    ipv6 => $ipv6,
}

in modules/profile/manifests/gerrit.pp:L27

For reference, Apache TCP listening ports on gerrit2002:

hashar@gerrit2002:~$ sudo netstat -tlnp|grep apache2
tcp        0      0 208.80.153.104:443      0.0.0.0:*               LISTEN      4424/apache2        
tcp        0      0 127.0.0.1:443           0.0.0.0:*               LISTEN      4424/apache2        
tcp        0      0 208.80.153.104:80       0.0.0.0:*               LISTEN      4424/apache2        
tcp        0      0 127.0.0.1:80            0.0.0.0:*               LISTEN      4424/apache2        
tcp6       0      0 2620:0:860:4:208:80:443 :::*                    LISTEN      4424/apache2        
tcp6       0      0 ::1:443                 :::*                    LISTEN      4424/apache2        
tcp6       0      0 2620:0:860:4:208:80::80 :::*                    LISTEN      4424/apache2        
tcp6       0      0 ::1:80                  :::*                    LISTEN      4424/apache2

Event Timeline

Jelto triaged this task as Medium priority.Jan 3 2023, 2:03 PM
Jelto updated the task description. (Show Details)
Jelto updated the task description. (Show Details)

The Gerrit Apache config listens to the ipv4 and ipv6 addresses explicitly:

modules/profile/templates/gerrit/apache.ports.conf.erb
Listen 127.0.0.1:80
Listen <%= @ipv4 %>:80

<% if @ipv6 %>
Listen [::1]:80
Listen [<%= @ipv6 %>]:80
<% end -%>
  1. Machine starts
  2. The service IPs (defined by the interface::alias you found) are not yet applied
  3. Systemd brings up Apache2 which fails to start since the service IPs are not found
  4. The SRE reboot cookbook force runs Puppet
  5. systemd tries again and this time Apache 2 starts

We would need to change the interface::ip define type to write the state on the host so it is active when the machine starts rather than setting it up dynamically by running Puppet. Or maybe have a systemd job which runs on boot and is required by the Gerrit Apache 2 service?

More details. The Apache config on gerrit1001 (the primary) has:

/etc/apache2/conf-enabled/00-gerrit-listen-service-ip.conf
Listen 127.0.0.1:80
Listen 208.80.154.137:80


Listen [::1]:80
Listen [2620:0:861:2:208:80:154:137]:80

And that service IP seems to be /etc/network/interfaces:

/etc/network/interfaces
...
# The primary network interface
allow-hotplug eno1
iface eno1 inet static
	address 208.80.154.136/26  # Host address
	gateway 208.80.154.129
	# dns-* options are implemented by the resolvconf package, if installed
	dns-nameservers 10.3.0.1
	dns-search wikimedia.org
   pre-up /sbin/ip token set ::208:80:154:136 dev eno1
   up ip addr add 2620:0:861:2:208:80:154:136/64 dev eno1
   up ip addr add fe80::208:80:154:136/64 dev eno1
   up ip addr add 208.80.154.137/32 dev eno1
   up ip addr add 2620:0:861:2:208:80:154:137/128 preferred_lft 0 dev eno1
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Thus I don't get why it would not be available by the time Apache 2 is started? Apache2 service has After=network.target and systemctl list-dependencies --after apache2.service gives:

apache2.service
● ├─network.target
● │ ├─ifup@eno1.service
● │ ├─ifupdown-pre.service
● │ ├─networking.service
● │ ├─systemd-networkd.service
● │ └─network-pre.target

And from journalctl:

Jan 03 10:33:33 gerrit1001 systemd[1]: Starting Raise network interfaces...
Jan 03 10:33:33 gerrit1001 systemd[1]: Started Raise network interfaces.
Jan 03 10:33:33 gerrit1001 systemd[1]: Reached target Network.
Jan 03 10:33:33 gerrit1001 systemd[1]: Reached target Network is Online.
Jan 03 10:33:34 gerrit1001 systemd[1]: Starting The Apache HTTP Server...
Jan 03 10:33:35 gerrit1001 apachectl[846]: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address [2620:0:861:2:208:80:154:137]:80
Jan 03 10:33:35 gerrit1001 apachectl[846]: no listening sockets available, shutting down
Jan 03 10:33:35 gerrit1001 apachectl[846]: AH00015: Unable to open logs

Change 874939 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: require interface::alias before httpd class

https://gerrit.wikimedia.org/r/874939

A bit more digging in the issue. Apache 2 service is marked with After=network.target:

/lib/systemd/system/apache2.service
[Unit]
Description=The Apache HTTP Server
After=network.target remote-fs.target nss-lookup.target
      ^^^^^^^^^^^^^^

Looking at systemd NetworkTarget documentation:

network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. Its primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off.

Essentially it ensures that on shutdown Apache is stopped before network is shut down.

Looks like we would need an override that adds After=network-online.target and (per the above doc) Wants=network-online.target.

Change 875314 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] httpd: add flag to wait for network-online.target

https://gerrit.wikimedia.org/r/875314

Change 875315 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: make Apache wait for network-online.target

https://gerrit.wikimedia.org/r/875315

Change 875347 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] systemd::unit: support multiple overrides

https://gerrit.wikimedia.org/r/875347

Change 874939 abandoned by Dzahn:

[operations/puppet@production] gerrit: require interface::alias before httpd class

Reason:

per jenkins -1 and review comments

https://gerrit.wikimedia.org/r/874939

Change 875347 abandoned by Hashar:

[operations/puppet@production] systemd::unit: support multiple overrides

Reason:

Incorporated into https://gerrit.wikimedia.org/r/c/operations/puppet/+/875365/13 ;)

https://gerrit.wikimedia.org/r/875347

Change 875314 merged by Jbond:

[operations/puppet@production] httpd: add flag to wait for network-online.target

https://gerrit.wikimedia.org/r/875314

Change 875315 merged by Jbond:

[operations/puppet@production] gerrit: make Apache wait for network-online.target

https://gerrit.wikimedia.org/r/875315

Mentioned in SAL (#wikimedia-operations) [2023-01-26T13:09:03Z] <hashar> Rebooting gerrit2002.wikimedia.org host to validate Apache 2 services starts AFTER network went online | T326125

After rebooting gerrit2002.wikimedia.org the issue is still present:

Jan 26 13:11:26 gerrit2002 systemd[1]: Reached target Network.
Jan 26 13:11:26 gerrit2002 systemd[1]: Reached target Network is Online.
...
Jan 26 13:11:26 gerrit2002 systemd[1]: Starting The Apache HTTP Server...
Jan 26 13:11:26 gerrit2002 apachectl[906]: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address [2620:0:860:4:208:80:153:104]:80
Jan 26 13:11:26 gerrit2002 apachectl[906]: no listening sockets available, shutting down
Jan 26 13:11:26 gerrit2002 apachectl[906]: AH00015: Unable to open logs
Jan 26 13:11:26 gerrit2002 apachectl[906]: Action 'start' failed.
Jan 26 13:11:26 gerrit2002 apachectl[906]: The Apache error log may have more information.
Jan 26 13:11:27 gerrit2002 systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
Jan 26 13:11:27 gerrit2002 systemd[1]: apache2.service: Failed with result 'exit-code'.
Jan 26 13:11:27 gerrit2002 systemd[1]: Failed to start The Apache HTTP Server.

Maybe network-online.target is reached when an IP is configured and the extra secondary IPv6 is configured a little bit later :-(

From dmesg -T:

[Thu Jan 26 13:11:27 2023] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready

Which I guess comes from Apache 2 attempting to listen on it (and fails).

The link is ready a few seconds later:

[Thu Jan 26 13:11:31 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready

There is a nice detailed explanation at https://www.the-art-of-web.com/system/ipv6-dad-tentative/

The IPv6 address states are tentative deprecated preferred unavailable. It is configured with preferred_lft 0 so it starts in deprecated state which is enough to satisfy network-online.target.

The fixes proposed to /etc/network/interfaces are:

  1. disable duplicate address detection which delays the IPv6 address leaving the tentative state:
pre-up echo 0 > /proc/sys/net/ipv6/conf/eth0/accept_dad
  1. sleep
post-up sleep 10
  1. wait for IPv6 to be ready:

Done by a busy look which waits for interface to no more have the tentative state:

post-up while [ $(ip addr show eth0 | grep -c tentative) -ne 0 ]; do echo "IPv6 post-up tentative"; sleep 1; done

Or in an if-up.d:

#!/bin/bash
# script /etc/network/if-up.d/ipv6-tentative
# wait for ipv6 addresses to leave 'tentative' state

[ "$IFACE" = '--all' ] || exit 0;
[ "$MODE"  = 'start' ] || exit 0;

while [ $(ip addr show eth0 | grep -c tentative) -ne 0 ]; do
  echo "IPv6 post-up tentative";
  sleep 1;
done

exit 0;

Change 883965 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: listen on all port, DROP requests to host

https://gerrit.wikimedia.org/r/883965

Change 883965 merged by Jbond:

[operations/puppet@production] gerrit: listen on all ports, DROP requests to host

https://gerrit.wikimedia.org/r/883965

Mentioned in SAL (#wikimedia-operations) [2023-01-27T12:47:02Z] <hashar> gerrit1001 running Puppet to deploy https://gerrit.wikimedia.org/r/883965 and restarting Apache 2 to change the Listen statements # T326125

hashar claimed this task.
hashar added subscribers: jbond, Dzahn.

To summarize, the issue was that when rebooting the Gerrit host, Apache 2 would not be started automatically until it is manually started. That causes confusion whenever we gotta reboot the servers for eg a kernel update.

The root cause was the IPv6 address not being ready by the time Apache starts. I thought waiting for network-online.target to be reached would be sufficient, but that target does not wait for the IPv6 duplicate address detection (DAD) mechanism to have completed. Apache was thus not able to explicitly listen to the IPv6.

The explicit listening was made originally to prevent serving requests for the hostname ( https://gerrit1001.wikimedia.org/) by c6ce0404c0668abc6b8f80bf5df862fa61968a42

Instead I have changed Apache 2 to listen on all addresses (Listen 80 and Listen 443) which resolves the start up issue. As a side effect it also get rid of a custom Apache 2 ports configuration which slightly reduce the configuration complexity on the host.

After applying the patch to gerrit2002 and rebooting it, Apache2 successfully started. I have confirmed the iptables rule prevents requests to the host IP addresses.

Thanks @jbond for the various exchanges about systemd and listening sockets ;)

Change 908574 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: remove leftover absent http config

https://gerrit.wikimedia.org/r/908574

Change 908574 merged by Dzahn:

[operations/puppet@production] gerrit: remove leftover absent http config

https://gerrit.wikimedia.org/r/908574