Page MenuHomePhabricator

Memory leak on acme-chief 0.21
Closed, ResolvedPublic

Description

checking acmechief1001 after the following icinga alert has been triggered:

PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7246 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief

puppet fails to run due to memory issues:

vgutierrez@acmechief1001:~$ sudo -i puppet agent -t
2019-09-28 16:27:06.261665 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for acmechief1001.eqiad.wmnet
Info: Applying configuration version '1569688031'
Error: /Stage[main]/Base::Standard_packages/Package[libperl5.24]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Base::Standard_packages/Package[ruby2.3]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Base::Standard_packages/Package[libruby2.3]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Base::Standard_packages/Package[libunbound2]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Base::Standard_packages/Package[mcelog]: Could not evaluate: Cannot allocate memory - fork(2)
...

a quick check on the process list shows acme-chief-backend consuming a 80% of the memory:

acme-ch+ 19080  2.0 80.0 2598632 1634560 ?     Ss   Sep23 155:55 /usr/bin/python3 /usr/bin/acme-chief-backend

This is the first time that this kind of behaviour has been observed so it looks like the issue has been introduced with acme-chief 0.21

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptSep 28 2019, 4:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez triaged this task as High priority.Sep 28 2019, 4:35 PM

The issue is also happening on acmechief-test1001, so I decided to hack acme-chief code a little bit to add objgraph reports, I've added the following snippet to the certificate_management loop:

if datetime.datetime.utcnow() - last_stats >= stats_timedelta:
    logger.info("Dumping objgraph growth stats")
    with open('/tmp/acme-chief.growth', 'a') as stats_file:
        objgraph.show_growth(file=stats_file)
    last_stats = datetime.datetime.utcnow()

The first run shows the following output:

dict                           4800     +4800
tuple                          4497     +4497
weakref                        2790     +2790
builtin_function_or_method     2348     +2348
list                           1741     +1741
getset_descriptor              1716     +1716
type                           1526     +1526
wrapper_descriptor             1306     +1306
method_descriptor              1133     +1133

And after acme-chief got reloaded an additional line has been added:

list     1742        +1

Nothing that justifies that the memory usage reported by ps has growth from a 1.8% to a 2.8% (and growing)

I've hacked an ACMEChief subclass that only performs OCSP response checks/updates and it still leaks memory:

class OCSPChecker(ACMEChief):
    def certificate_management(self):
        while True:
            for cert_id in self.cert_status:
                for key_type_id in KEY_TYPES:
                    cert_state = self.cert_status[cert_id][key_type_id]
                    if cert_state.status not in (CertificateStatus.VALID,
                                                 CertificateStatus.CERTIFICATE_STAGED,
                                                 CertificateStatus.READY_TO_BE_PUSHED):
                        continue
                    self._fetch_ocsp_response(cert_id, key_type_id)
            time.sleep(1)

Using objgraph to track leaking objects doesn't show anything wrong either:

dict       723
set        248
CTypeDescr 101
tuple      39
list       7
SignalDict 6
weakref    1
method     1
cell       1

the same output (generated with objgraph.show_most_common_types(objects=objgraph.get_leaking_objects()) is obtained every 60 secs

Using valgrind against a simplified version that only runs _fetch_ocsp_response() against unified / rsa-2048 shows a leak on:

==25358== 8,888 (1,536 direct, 7,352 indirect) bytes in 96 blocks are definitely lost in loss record 241 of 251
==25358==    at 0x483577F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==25358==    by 0x6484328: CRYPTO_zalloc (mem.c:230)
==25358==    by 0x63A9FF8: asn1_item_embed_new (tasn_new.c:122)
==25358==    by 0x63A7A6A: asn1_item_embed_d2i (tasn_dec.c:306)
==25358==    by 0x63A8172: asn1_template_noexp_d2i (tasn_dec.c:592)
==25358==    by 0x63A8548: asn1_template_ex_d2i (tasn_dec.c:499)
==25358==    by 0x63A772D: asn1_item_embed_d2i (tasn_dec.c:178)
==25358==    by 0x63A864C: ASN1_item_ex_d2i (tasn_dec.c:124)
==25358==    by 0x63A86CA: ASN1_item_d2i (tasn_dec.c:114)
==25358==    by 0x6511154: X509V3_EXT_d2i (v3_lib.c:142)
==25358==    by 0x6D4C69B: ??? (in /usr/lib/python3/dist-packages/cryptography/hazmat/bindings/_openssl.abi3.so)
==25358==    by 0x5D7990: _PyMethodDef_RawFastCallKeywords (in /usr/bin/python3.7)

I've been unable to get a report from valgrind showing any kind of memory leak parsing a OCSP response using acme_chief.ocsp.OCSPResponse class, but printing the OCSP URI of a Certificate shows a memory leak:

cert_path = '/var/lib/acme-chief/certs/unified/live/rsa-2048.chained.crt'
cert = Certificate.load(cert_path)
print(cert.ocsp_uri)

and valgrind reports:

==29944== 151 (32 direct, 119 indirect) bytes in 2 blocks are definitely lost in loss record 24 of 239
==29944==    at 0x483577F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==29944==    by 0x591D328: CRYPTO_zalloc (mem.c:230)
==29944==    by 0x5842FF8: asn1_item_embed_new (tasn_new.c:122)
==29944==    by 0x5840A6A: asn1_item_embed_d2i (tasn_dec.c:306)
==29944==    by 0x5841172: asn1_template_noexp_d2i (tasn_dec.c:592)
==29944==    by 0x5841548: asn1_template_ex_d2i (tasn_dec.c:499)
==29944==    by 0x584072D: asn1_item_embed_d2i (tasn_dec.c:178)
==29944==    by 0x584164C: ASN1_item_ex_d2i (tasn_dec.c:124)
==29944==    by 0x58416CA: ASN1_item_d2i (tasn_dec.c:114)
==29944==    by 0x59AA154: X509V3_EXT_d2i (v3_lib.c:142)
==29944==    by 0x68A769B: ??? (in /usr/lib/python3/dist-packages/cryptography/hazmat/bindings/_openssl.abi3.so)
==29944==    by 0x5D7990: _PyMethodDef_RawFastCallKeywords (in /usr/bin/python3.7)

Similar reports can be obtained by getting the SAN list of a certificate, so it looks like the issue is related to cryptography.x509.Extensions.get_extension_for_class

Issue is still reproducible after getting rid of acme_chief.x509 abstraction layer, this is a PoC triggering the issue:

from cryptography import x509 as crypto_x509
from cryptography.hazmat.backends import default_backend
from cryptography.x509.oid import AuthorityInformationAccessOID

CRYPTOGRAPHY_BACKEND = default_backend()

def main():
    cert_path = '/var/lib/acme-chief/certs/unified/live/rsa-2048.crt'
    with open(cert_path, 'rb') as pem_file:
        cert = crypto_x509.load_pem_x509_certificate(pem_file.read(), CRYPTOGRAPHY_BACKEND)

    ext = cert.extensions.get_extension_for_class(crypto_x509.extensions.AuthorityInformationAccess)
    for description in ext.value:
        if description.access_method == AuthorityInformationAccessOID.OCSP:
            print(description.access_location.value)


if __name__ == '__main__':
    main()

valgrind reports

==32152== LEAK SUMMARY:
==32152==    definitely lost: 32 bytes in 2 blocks
==32152==    indirectly lost: 119 bytes in 5 blocks

from:

==32152== 151 (32 direct, 119 indirect) bytes in 2 blocks are definitely lost in loss record 24 of 238
==32152==    at 0x483577F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==32152==    by 0x5EE2328: CRYPTO_zalloc (mem.c:230)
==32152==    by 0x5E07FF8: asn1_item_embed_new (tasn_new.c:122)
==32152==    by 0x5E05A6A: asn1_item_embed_d2i (tasn_dec.c:306)
==32152==    by 0x5E06172: asn1_template_noexp_d2i (tasn_dec.c:592)
==32152==    by 0x5E06548: asn1_template_ex_d2i (tasn_dec.c:499)
==32152==    by 0x5E0572D: asn1_item_embed_d2i (tasn_dec.c:178)
==32152==    by 0x5E0664C: ASN1_item_ex_d2i (tasn_dec.c:124)
==32152==    by 0x5E066CA: ASN1_item_d2i (tasn_dec.c:114)
==32152==    by 0x5F6F154: X509V3_EXT_d2i (v3_lib.c:142)
==32152==    by 0x68AC69B: ??? (in /usr/lib/python3/dist-packages/cryptography/hazmat/bindings/_openssl.abi3.so)
==32152==    by 0x5D7990: _PyMethodDef_RawFastCallKeywords (in /usr/bin/python3.7)

It looks like it's a well-known issue and it's been fixed by upstream with https://github.com/pyca/cryptography/pull/4889

Mentioned in SAL (#wikimedia-operations) [2019-10-01T08:57:54Z] <vgutierrez> upgrading python3-cryptography to version 2.6.1-3+deb10u1~wmf1 on acmechief-test1001 - T234131

Mentioned in SAL (#wikimedia-operations) [2019-10-01T09:04:39Z] <vgutierrez> upgrading python3-cryptography to version 2.6.1-3+deb10u1~wmf1 on acme-chief hosts - T234131

Mentioned in SAL (#wikimedia-operations) [2019-10-01T09:07:09Z] <vgutierrez> restarting acme-chief on acmechief1001 to catch up with python3-cryptography upgrades - T234131

Vgutierrez closed this task as Resolved.Oct 2 2019, 4:03 AM
Vgutierrez claimed this task.

Reported to debian package maintainer on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941413

I got in touch with the Debian maintainer for backporting this to Buster; I've just uploaded an updated package with the patches to Debian unstable. Having a fix in unstable is a pre-condition for fixing it in stable; I'll deal with that next week.

@Vgutierrez I created a 2.6.1-3+deb10u2, it's in my home on acmechief1001. Let's deploy this on acmechief* hosts on Monday before I submit this to the Debian stable release managers.

Mentioned in SAL (#wikimedia-operations) [2019-10-21T06:28:35Z] <vgutierrez> Install python3-cryptography-2.6.1-3+deb10u2 on acme-chief hosts - T234131

@Vgutierrez I created a 2.6.1-3+deb10u2, it's in my home on acmechief1001. Let's deploy this on acmechief* hosts on Monday before I submit this to the Debian stable release managers.

Done, I'll keep an eye on the memory consumption.