Move Cloud VPS auth.logs to central logging
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Andrew
	Feb 22 2016, 4:27 PM

Description

Among other things, central logging would prevent malevolent roots from covering their tracks.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160212-LabsSudoVulnerability

Details

Subject	Repo	Branch	Lines +/-
Cloud VPS: enable rsyslog subject name validation in eqiad1	operations/puppet	production	+1 -0
rsyslog: allow subject name validation	operations/puppet	production	+35 -20
profile::base: fix hiera key name for tls_client_auth	operations/puppet	production	+1 -1
rsyslog::receiver: fix cert_file when used with acme certs	operations/puppet	production	+1 -1
Turn on central auth logging for all eqiad1 VMs	operations/puppet	production	+14 -0
remote syslog: allow hiera config of rsyslog TLS CA	operations/puppet	production	+8 -1
rsyslog: allow specifying a hiera-defined certfile	operations/puppet	production	+34 -8
rsyslog: allow specifying TLS client auth settings and filename property	operations/puppet	production	+55 -24
Add WMCS specific cloud role for syslog server	operations/puppet	production	+48 -12

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T348075 Ingest Cloud VPS audit logs into production logging pipeline
Open	None	T127717 Move Cloud VPS auth.logs to central logging
Resolved	• Bstorm	T276291 Request creation of auditlogging VPS project
Resolved	jbond	T324623 Switch rsyslog from gtls to ossl
Open	None	T351710 ossl rsyslog errors post-migration
Resolved	fgiunchedi	T352968 Stop sending swift access logs to centrallog for non state-changing requests

Event Timeline

Andrew created this task.Feb 22 2016, 4:27 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 22 2016, 4:27 PM

Southparkfan subscribed.Feb 22 2016, 5:01 PM

MoritzMuehlenhoff subscribed.Feb 22 2016, 5:04 PM

• chasemp triaged this task as Medium priority.Apr 4 2016, 2:15 PM

As the original reporter of T127656, could you use my help here? Central logging sounds like a fun, but useful side project.

Restricted Application added a project: cloud-services-team (Kanban). · View Herald TranscriptFeb 15 2021, 9:25 PM

@Southparkfan, thank you for offering! I've spent a bit of time with central logging for OpenStack services but haven't thought much about in-cloud logging. There are a lot of applications for central logging, this is one of the easier ones.

The client side bits (gathering and sending logs to ELK) should be pretty simple. My main concerns are

Multi-tenancy. The production ELK stack just dumps everything into one bucket; for cloud projects we'd want something subtler so that members of project A can't see central logs from project B. I would much prefer to have a central Kibana that aggregates lots from everywhere with rbac rather than standing up a whole elk stack in each project.

Standing up the ELK services in a cloud project so that logs don't have to cross the cloud/prod border. This step may be trivial to someone who has already done it (and, in theory, there are existing puppet classes to manage this) but I haven't ever done it or much thought about it.

I'd be happy to set up a project for you to work on this, if you have some idea of how to manage issue 1

-A

An additional potential complication of ELK stack usage is T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence. I'm pretty sure that the FOSS community will fork the ELK bits that are now under a proprietary license, but until that happens deploying new ELK clusters inside Cloud VPS projects is a potentially risky operation.

In T127717#6835190, @bd808 wrote:

An additional potential complication of ELK stack usage is T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence. I'm pretty sure that the FOSS community will fork the ELK bits that are now under a proprietary license, but until that happens deploying new ELK clusters inside Cloud VPS projects is a potentially risky operation.

Surely the versions of these projects packaged with Debian are still freely licensed. That means we might need to rebuild things after a new fork is proposed but I wouldn't expect it to be harmful/TOS-violating to deploy things in the meantime. Is that mistaken somehow?

In T127717#6835328, @Andrew wrote:

Surely the versions of these projects packaged with Debian are still freely licensed. That means we might need to rebuild things after a new fork is proposed but I wouldn't expect it to be harmful/TOS-violating to deploy things in the meantime. Is that mistaken somehow?

Correct, the existing Debian packages are under Apache 2.0 licenses. The potential for harm here in my opinion is primarily wasted efforts if the hoped for forks do not appear.

It is also currently unclear if Elastic will be backporting security fixes implemented post 7.10.0 to the Apache 2.0 licensed legacy releases. Such backports can only be made by Elastic as their CLA super-powers would be needed to relicense the new SSPL licensed security patches as Apache 2.0.

@Andrew / @bd808, thank you for all of your comments. I wouldn't call myself the expert of WMCS and ELK, but I do have experience with syslog and log management, albeit with other tools (unfortunately nothing OSI-approved anymore, since Graylog has the SSPL license now as well).

In T127717#6835178, @Andrew wrote:

[...]

Multi-tenancy. The production ELK stack just dumps everything into one bucket; for cloud projects we'd want something subtler so that members of project A can't see central logs from project B. I would much prefer to have a central Kibana that aggregates lots from everywhere with rbac rather than standing up a whole elk stack in each project.

Standing up the ELK services in a cloud project so that logs don't have to cross the cloud/prod border. This step may be trivial to someone who has already done it (and, in theory, there are existing puppet classes to manage this) but I haven't ever done it or much thought about it.

Good questions, but before we proceed, it is important to determine the scope and objectives of this 'project'. To be honest, my idea was to copy (send to remote syslog server, but keep locally) the authentication logs to a syslog server for the WMCS admins. In case of an incident - in this case, me having escalated my privileges on a bastion -, WMCS (perhaps on requests from the project admins) to review the logs. Log manipulation is unfortunately an existing threat. Storing the logs in any type of a hardened, central system, only accessible by a limited set of users, reduces the impact of an attack where logs are manipulated.

On top of that, rsyslog / syslog-ng (still) have OSI-approved licenses ;-). However, if the way to go is storing the logs in an ELK cluster, after WMCS has determined going forward with ELK is not an issue, let's go for that.

Regarding option 1), is this about storing the logs in the production ELK cluster? I assume this requires the approval of SRE-o11y, but a security review must be conducted in my opinion as well, since a) Kibana's RBAC has not been tested before, b) this introduces firewall rules that allow traffic from WMCS to production, whereas I see both as different realms, c) unvetted users get access to a production system containing very sensitive data, which increases the attack surface of Kibana ("what if a vulnerability is discovered in Kibana that only requires a valid login...") and d) log files of multiple 'environments' are being processed in the same system. I know I am writing this comment with worst-case scenarios in mind: anyone with the proper mandate can decide to accept the risks, but at least I'd prefer people make informed decisions. I don't know if there is any policy (or similar) sharing data between WMCS and production systems, feel free to forward the information if something exists.

The second option reduces my concerns, although the second option increases the workload on <the team that will be responsible for the ELK stack>. The RBAC model in Kibana seems to be based on per-indice access: e.g. access to logs of the indice ' bastion' can be granted to project admins of the project 'bastion'. Below, I have drawn a nice, little ASCII-based diagram of the infrastructure ('central log stack' can be anything: log file based, ELK, Graylog, etc), including the stakeholders.

                              +--------------------+
                              |                    |
                              | Project data access|
                              |        (RBAC)      |
                              |                    |
                              +----------+---------+
                                         |
                                         |
                                         |
                                         |
                                         |
                                         |
                                         |
+-------------------------------+        |          +-------------------------------+
| Project <any project>         |        |          |  Project <managed by WMCS>    |
|                               |        |          |                               |
|    +--------------------+     |        |          |    +--------------------+     |            +--------------------+
|    |                    |     |        +--------------->                    |     |            |                    |
|    |  Project instance  |     |                   |    | Central log stack  <------------------+      WMCS / SRE    |
|    |                    +------------------------------>                    |     |            |    (full access)   |
|    |                    |     |   syslog ingest   |    |                    |     |            |                    |
|    +--------------------+     |                   |    +--------------------+     |            +--------------------+
|                               |                   |                               |
+-------------------------------+                   +-------------------------------+

For the record, I would like to do this in my volunteer time. We should discuss the aforementioned questions and points first, we can schedule a meeting for this. WMCS contacts and access to infrastructure (new Cloud VPS project?) are part of the prerequisites, maybe an NDA as well.

RhinosF1 subscribed.Feb 17 2021, 10:16 PM

So central logging is a thing we've been very interested in for Cloud VPS projects. A strictly security-focused syslog server would have value even if it was just for cloud admins. If it's something more like a multitenant logging solution that somehow implements good multitenancy in Elastic's stack without their enterprise offering or a multitenant Grafana Loki system (which would be simpler on the log aggregation end but not-so-simple at the Grafana end without "enterprise" features since you'd need to have it set up as multi-org) that'd be quite a project to dig into! Loki is nice because it is not slated to stop being Apache 2 licensed, but it is also not deployed at Wikimedia at all yet (though Grafana obviously is).

I think no matter what this would be a "see privileged information in other projects" sort of thing that should follow the policies in https://wikitech.wikimedia.org/wiki/Help:Access_policies for a "Special Project" and probably ought to live inside a Cloud-VPS project (to echo @Andrew).

In T127717#6859176, @Bstorm wrote:

So central logging is a thing we've been very interested in for Cloud VPS projects. A strictly security-focused syslog server would have value even if it was just for cloud admins. If it's something more like a multitenant logging solution that somehow implements good multitenancy in Elastic's stack without their enterprise offering or a multitenant Grafana Loki system (which would be simpler on the log aggregation end but not-so-simple at the Grafana end without "enterprise" features since you'd need to have it set up as multi-org) that'd be quite a project to dig into! Loki is nice because it is not slated to stop being Apache 2 licensed, but it is also not deployed at Wikimedia at all yet (though Grafana obviously is).

Yes, Grafana Loki is one of the newer solutions. I am not experienced with Grafana Loki, but it may be the best solution for extracting metrics from logs, although I am not sure if Grafana Loki fits any high availability / scalability requirements. Elastic is more known, but with the license issues, it's better to let that slide for now (if the software is forked, we can reconsider!). What do you think?

Initially, this ticket only mentioned 'moving auth.log off the systems'. For security reasons (T127717#6839486), it is advisable to do this soon. 'Central logging' is a lot more than just storing a log file on a remote system, while definitely an interesting project (e.g. implementing Elastic or Grafana Loki for all logs on a system), that exceeds the scope of this task. Fortunately, implementing the 'easiest', secure and pure FOSS solution - host -> remote syslog server -> rsyslog -> local file storage with initial access to WMCS staff (and other folks, if authorised) -, seems doable, yet does the heavy lifting on the 'client' side. If WMCS decides to move to another log solution in the future, significant time will be spent building the new system, but ingesting client logs is nothing more than changing the remote destination in rsyslog's config.

I think no matter what this would be a "see privileged information in other projects" sort of thing that should follow the policies in https://wikitech.wikimedia.org/wiki/Help:Access_policies for a "Special Project" and probably ought to live inside a Cloud-VPS project (to echo @Andrew).

Makes sense. During the test phase (creating multiple instances in a Cloud VPS project to test the log collection and puppetise the config), it doesn't have to be a special project (since the scope is restricted to instances in the project), but the special project status would allow us to test with 'real logs' (e.g. auth.log files from cloudinfra instances) later on.

Southparkfan mentioned this in T276291: Request creation of auditlogging VPS project.Mar 2 2021, 11:37 PM

Southparkfan added a subtask: T276291: Request creation of auditlogging VPS project.

• Bstorm closed subtask T276291: Request creation of auditlogging VPS project as Resolved.Mar 5 2021, 8:13 PM

taavi renamed this task from Move labs auth.logs to central logging to Move Cloud VPS auth.logs to central logging.Mar 15 2021, 7:51 AM

Short update: using local puppet patches, I got the syslog forwarding working. Next step will be getting those patches into the production branch.

Change 682259 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] Add WMCS specific cloud role for syslog server

https://gerrit.wikimedia.org/r/682259

gerritbot added a project: Patch-For-Review.Apr 24 2021, 5:09 PM

Krinkle added a project: Sustainability (Incident Followup).Sep 28 2021, 9:02 PM

Krinkle updated the task description. (Show Details)

Change 682259 merged by Andrew Bogott:

[operations/puppet@production] Add WMCS specific cloud role for syslog server

https://gerrit.wikimedia.org/r/682259

Maintenance_bot removed a project: Patch-For-Review.Jan 20 2022, 7:11 PM

Now that the patch above has been merged, we can start thinking about applying the syslog client configuration by default on Cloud VPS instances. The central syslog server should be in the cloudinfra project. There are a few challenges to tackle, though:

Projects with a standalone puppetmaster: at Wikimedia, mutual authentication (to preserve authenticity of the syslog message source) works via the Puppet CA. Projects using a standalone puppetmaster do not a server certificate signed by the WMCS global puppetmaster. Using a different CA is fine, but automatic provisioning of certificates (upon instance creation), automatic renewal and addition to the OS' trust store are mandatory.
Projects with custom syslog client settings; such projects are rather rare, though. profile::base::remote_syslog_tls accepts multiple syslog servers, so this may not be an issue at all.
Storage/HA: rsyslog uses the omfile module by default. In the auditlogging project, the omfile target was a directory on Cinder block storage using the ext4 file system. Cinder is reliable, but volumes can be attached to one instance at a time. As long as the storage module is working (in our case, moving the Cinder volume to the standby syslog server) and the TLS input has been configured properly (at least same CA and x.509 CN as the primary syslog server), failing over to a different syslog server can be done any time.

I wouldn't say the syslog server could be considered critical; if the server is down for a few minutes, so be it. I'm more reluctant towards disabling mutual authentication, since that increases the likelihood of successful source spoofing.

taavi moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.Feb 14 2022, 6:59 PM

dcaro added a project: User-dcaro.Feb 16 2022, 4:46 PM

lmata subscribed.Mar 16 2022, 4:12 PM

Hi @Southparkfan, so if I understand it correctly, the next step would be to figure out:

How to do automatic provisioning of certificates (upon instance creation), automatic renewal and addition to the OS' trust store for local puppetmasters to be able to have mutual authentication.
Test if profile::base::remote_syslog_tls is enough to have multiple syslog servers (that is, the central one, and whichever the project wants)

Is that correct?
Let me know if you need unblocking/help on any of those, I don't have lots of time, but I can make some if that unblocks you.

I agree that syslog can go down for a few minutes during failover, that would not be an issue 👍

dcaro moved this task from Needs discussion to Inbox on the cloud-services-team (Kanban) board.Mar 30 2022, 3:14 PM

dcaro moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.

In T127717#7788128, @dcaro wrote:

[...]

Is that correct?

Yes! In particular; the handling of mutual authentication for instances from Cloud VPS projects that employ local puppetmasters. Instances that use the cloud-wide puppetmaster must receive a valid certificate[1] from that puppetmaster, otherwise the integrity of the hostname in a syslog message is substantially lowered.

Let me know if you need unblocking/help on any of those, I don't have lots of time, but I can make some if that unblocks you.

Looking forward to ideas regarding establishing a chain of trust between the central syslog server and all of its clients (i.e. Cloud VPS Instances).

[1] Signed by a certificate authority that is 'trusted' by rsyslogd on the central syslog server.

Is it possible to manually specify the CA path (on both ends) instead of adding the CA used here into the system trust store?

In T127717#7875613, @Southparkfan wrote:

Yes! In particular; the handling of mutual authentication for instances from Cloud VPS projects that employ local puppetmasters. Instances that use the cloud-wide puppetmaster must receive a valid certificate[1] from that puppetmaster, otherwise the integrity of the hostname in a syslog message is substantially lowered.

Since all puppetmasters do some trickery with the central puppetmaster before starting to use local ones, we might be able to conditionally copy the central puppetmaster certs to some other location based on the puppetmaster variable (to get them before overwritten with the local puppetmaster certs).

Note that the central puppetmasters don't do particularly strong verification for the certificates it gives out, as it was not designed to hold any secrets. The duplicate prevention + instance must exist checks might however be good enough for this use case.

In T127717#7875618, @Majavah wrote:

Is it possible to manually specify the CA path (on both ends) instead of adding the CA used here into the system trust store?

As long as a file containing the CA's root certificate (and intermediates, if applicable) and a machine certificate signed by CA (restricted to client authentication; Extended Key Usage) is present on all syslog clients, that's fine. In hindsight, addition to the OS' trust store (which impacts all applications on that machine that do not manage their own trust store) induces a vulnerability, shall the (root) CA's private key ever be compromised.

In T127717#7875613, @Southparkfan wrote:

Yes! In particular; the handling of mutual authentication for instances from Cloud VPS projects that employ local puppetmasters. Instances that use the cloud-wide puppetmaster must receive a valid certificate[1] from that puppetmaster, otherwise the integrity of the hostname in a syslog message is substantially lowered.

Since all puppetmasters do some trickery with the central puppetmaster before starting to use local ones, we might be able to conditionally copy the central puppetmaster certs to some other location based on the puppetmaster variable (to get them before overwritten with the local puppetmaster certs).

Note that the central puppetmasters don't do particularly strong verification for the certificates it gives out, as it was not designed to hold any secrets. The duplicate prevention + instance must exist checks might however be good enough for this use case.

Thanks! I am not sure how these checks work, unfortunately.

Regarding alternative options: a weaker chain of trust could be established by logging the source IP of the syslog client in the syslog message, instead of using the client-provided hostname. Client authentication (TLS) is a mitigation against wrong hostnames, but in a reasonably secured IP network, IP address spoofing towards applications listening via TCP only seems unlikely.

Any update on automatic provisioning and renewal of certificates for the client machines? I am not sure how to automate this process on my own, help is needed for that.

If it's not feasible to do this short-term, using source IPs as client identifiers (T127717#7875635) is an alternative option. Given that this syslog traffic uses TCP, is east-west and therefore won't cross many trust boundaries, the integrity of the source IP address should be enough for this use case, at least for now. Attacks exist and I am not sure what kind of IP spoofing mitigations are applied to the Cloud VPS network (question for netops?), but is it worth dropping this option? Temporary decisions have a habit of becoming permanent, but log tampering is a threat too. Storing logs on remote, hardened systems is in any case better than dealing with logs on potentially compromised systems.

If this is OK, I can write a puppet patch to edit the syslog {message,file} format (hostname -> IP) and remove the requirement for client authentication (server authentication only).

Sorry for the slow response here. I also don't see a clear way to provision those certs, so I think relying on source IP is probably good for this pass. It's already the case that we can't fully 'trust' log messages originating from within cloud-vps projects; I suspect that the risk of a ddos attack is already present even if we have certified logs.

Please lmk if I'm missing a more obvious threat.

Change 816046 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] rsyslog: allow specifying TLS client auth settings and filename property

https://gerrit.wikimedia.org/r/816046

gerritbot added a project: Patch-For-Review.Jul 21 2022, 9:23 PM

In T127717#8092122, @Andrew wrote:

Sorry for the slow response here. I also don't see a clear way to provision those certs, so I think relying on source IP is probably good for this pass. It's already the case that we can't fully 'trust' log messages originating from within cloud-vps projects; I suspect that the risk of a ddos attack is already present even if we have certified logs.

Please lmk if I'm missing a more obvious threat.

Cool! Not sure what the relationship with a DDoS is, though :). If you have time, you can review the patch above. I wasn't too sure about the locations of the default hiera: some things have to be defined in cloud.yaml, others in common/, ...

Change 816046 merged by Andrew Bogott:

[operations/puppet@production] rsyslog: allow specifying TLS client auth settings and filename property

https://gerrit.wikimedia.org/r/816046

Andrew mentioned this in T324511: Increase instance quota for cloudinfra project.Dec 5 2022, 8:16 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 5 2022, 8:31 PM

@Andrew and I have spent this evening on the initial set up of two WMCS-wide syslog servers. Those work fine. However, this setup is broken for all Cloud VPS instances that do not use the central puppetmaster.

Background
Andrew has created service IPs syslogaudit1.svc.eqiad1.wikimedia.cloud and syslogaudit2.svc.eqiad1.wikimedia.cloud. Both IPs point to respectively syslog-server-audit01.cloudinfra.eqiad1.wikimedia.cloud and syslog-server-audit02.cloudinfra.eqiad1.wikimedia.cloud. Both servers run rsyslogd and listen on 6514/tcp.

TLS is used for one-way authentication only: the syslog server's leaf certificate must be valid. If I'm reading properly, this means that the leaf certificate must be signed by a certificate authority 1) whose root certificate is present in /var/lib/puppet/ssl/certs/ca.pem (on the syslog client) (Wikimedia's setup), or 2) whose root certificate is present in the syslog client's OS trust store (most setups out in the wild). Validation of subjectAltName is only mandatory for x509/name; security-wise, I would recommend that, but let's keep the current setting for now.

By default, the host's leaf certificate from the Puppet CA is exposed (using puppet::expose_agent_certs) to rsyslogd. In other words, the certificate presented on 6514/tcp is signed by the Puppet CA from the central puppetmaster (you could swap this out with a cloud PKI or a real certificate authority, but to me, either option requires a non-substantial amount of work). This means that syslog clients must be aware of the correct root certificate of the central puppetmaster's Puppet CA, in order to connect to the syslog servers. On all production servers and Cloud VPS instance, profile::base::certificates is already used to add various root certificates to the OS' trust store (instructions), including the root certificate of the Puppet CA, of the puppetmaster this syslog client is using. Not very coincidentally, this root certificate is also present in /var/lib/puppet/ssl/certs/ca.pem.

Problem
This results in two situations:

Syslog client is aware root certificate: OK, root certificate is present in trust store and /var/lib/puppet/ssl/certs/ca.pem, connection to the syslog server will succeed
Syslog client is not aware of root certificate: broken, root certificate is not in trust store and /var/lib/puppet/ssl/certs/ca.pem is different from the root certificate used to sign the syslog server's certificate, connection will fail due to StreamDriverAuthMode="x509/certvalid"

Tailored to our environment:

Syslog client uses the same puppetmaster (no local project puppetmaster) as syslog-server-audit0{1,2}: OK, root certificate is present in trust store (added via profile::base::certificates) and /var/lib/puppet/ssl/certs/ca.pem (because it's the Puppet CA). Connection will succeed
Syslog client uses a different puppetmaster (local project puppetmaster) than syslog-server-audit0{1,2} do: broken, root certificate is not in trust store and /var/lib/puppet/ssl/certs/ca.pem is different from the root certificate used to sign the client certificate for the puppet agent on syslog-server-audit01 and syslog-server-audit02. Connection will fail due to StreamDriverAuthMode="x509/certvalid"

Solution
All of these proposals must result in the following: the syslog client recognises the root certificate that was used to sign the leaf certificate used by rsyslogd on the syslog server.

Distribute central Puppet CA root certificate through puppet, import it into the trust store, change DefaultNetstreamDriverCAFile (on the syslog clients) to /etc/ssl/certs/ca-certificates.crt (trust store)
- Pros: easiest solution of all, other applications (also relying on trust store for validation) can benefit from the extra PKI capabilities, the syslog client can send its logs to syslog servers with a leaf certificate signed by any trusted certificate authority
- Cons: extra work upon CA rotation, changing the subjectAltName is cumbersome and compromise of the private key of the cebtrak Puppet CA root certificate, compromises the authenticity of any traffic secured with TLS originating from the syslog clients (= the central Puppet CA could be abused to impersonate other servers). I do not think the latter is a major issue; if the central Puppet CA is compromised, > 90% of the Cloud VPS instances must be considered fully compromised.[1]
Distribute central Puppet CA root certificate through puppet, change DefaultNetstreamDriverCAFile (on the syslog clients) to this location
- Pros: easy solution, if the private key of the central Puppet CA root certificate is compromised, the TLS message authenticity is only violated for syslog traffic, not for non-syslog traffic secured with TLS with
- Cons: extra work upon central Puppet CA root certificate rotation, changing the subjectAltName is cumbersome and the rsyslogd at the client side cannot push its logs to multiple syslog servers that use different root certificates. To my knowledge, this means deployment-prep (which uses a custom syslog setup) could not forward logs to its syslog server, unless that syslog server's leaf certificate was also signed by the central Puppet CA (which is very cumbersome and it tends to be misuse of a Puppet CA)
Get a certificate from a real, trustworthy certificate authority (Let's Encrypt, GlobalSign, ...)
- Pros: already present in trust store, does not require fiddling with Puppet CA or trust store (only DefaultNetstreamDriverCAFile needs to be changed), no issues with choosing subjectAltName, more mature PKI and better than relying on the central Puppet CA (separation of concerns)
- Cons: I doubt wikimedia.cloud is a valid domain name, although syslogaudit1.svc.wmcloud.org sounds valid. Unless these certificates can be distributed to the syslog servers via acme-chief, this process involves a lot of work for each renewal.
Use a cloud-wide PKI
- Pros: easy to integrate with Puppet and cfssl CLI, no issues with choosing subjectAltName, more mature PKI and better than relying on the central Puppet CA (separation of concerns). May be very useful for automatically provisioning client certificates (for mutual authentication) in the future.
- Cons: if this service does not have production status yet (e.g. appropriate security measures, PKI servers live in cloudinfra), we'll need to wait until this service is up and running? The likelihood of a cloud-wide PKI compromise must be low.
Disable TLS verification
- Pros: disable TLS and call it a day, everything works.
- Cons: the authenticity of a syslog message is greatly reduced. I do not recommend doing this. Really.

[1] A compromised syslog server is harmful for the authenticity of the audit logs as well. There are various ways to mitigate that impact, but that is not within the scope of this problem definition.

@fnegri is there any chance for overlap between this situation and the PKI things you're doing for spicerack?

Cons: I doubt wikimedia.cloud is a valid domain name

It is! I don't see any (technical) reasons why acme-chief couldn't issue certs for these names.

@Andrew I don't think there's any overlap unfortunately. Solution 3 seems the best one to me, but others would also be fine as long as we're using TLS.

Change 865174 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] rsyslog: allow specifying a hiera-defined certfile

https://gerrit.wikimedia.org/r/865174

gerritbot added a project: Patch-For-Review.Dec 6 2022, 9:58 PM

Change 865184 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] remote syslog: allow rsyslog client to use root CA

https://gerrit.wikimedia.org/r/865184

Status: we chose #3 (Let's Encrypt via acme-chief). We've gotten stuck on a bug in the gnutls driver for rsyslog: T324623

Change 866628 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Turn on central auth logging for all eqiad1 VMs

https://gerrit.wikimedia.org/r/866628

Change 865174 merged by Andrew Bogott:

[operations/puppet@production] rsyslog: allow specifying a hiera-defined certfile

https://gerrit.wikimedia.org/r/865174

Change 865184 merged by Andrew Bogott:

[operations/puppet@production] remote syslog: allow hiera config of rsyslog TLS CA

https://gerrit.wikimedia.org/r/865184

Change 866628 merged by Andrew Bogott:

[operations/puppet@production] Turn on central auth logging for all eqiad1 VMs

https://gerrit.wikimedia.org/r/866628

Change 867709 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Fix rsyslogd $cert_file when using acme certs

https://gerrit.wikimedia.org/r/867709

Change 867709 merged by Andrew Bogott:

[operations/puppet@production] rsyslog::receiver: fix cert_file when used with acme certs

https://gerrit.wikimedia.org/r/867709

Change 876248 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] rsyslog: allow subject name validation

https://gerrit.wikimedia.org/r/876248

Change 876251 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] profile::base: fix hiera key name fox tls_client_auth

https://gerrit.wikimedia.org/r/876251

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:40 PM

fnegri moved this task from Kanban to Watching on the cloud-services-team board.

Change 876251 merged by Andrew Bogott:

[operations/puppet@production] profile::base: fix hiera key name for tls_client_auth

https://gerrit.wikimedia.org/r/876251

I have expanded https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Auth_logging. The 'known limitations' section shows there is enough work to do, but to avoid a never ending task, I am fine with resolving this task when T127717#8505600 has been applied to Cloud VPS. I find the lack of monitoring to be a blocker too, though.

Change 876248 merged by Cwhite:

[operations/puppet@production] rsyslog: allow subject name validation

https://gerrit.wikimedia.org/r/876248

Maintenance_bot removed a project: Patch-For-Review.Feb 1 2023, 9:31 PM

taavi edited projects, added Cloud-VPS; removed Cloud-Services.Feb 7 2023, 9:45 AM

Change 945638 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] Cloud VPS: enable rsyslog subject name validation in eqiad1

https://gerrit.wikimedia.org/r/945638

gerritbot added a project: Patch-For-Review.Aug 3 2023, 7:23 PM

Change 945638 merged by Andrew Bogott:

[operations/puppet@production] Cloud VPS: enable rsyslog subject name validation in eqiad1

https://gerrit.wikimedia.org/r/945638

Maintenance_bot removed a project: Patch-For-Review.Aug 3 2023, 8:10 PM

Southparkfan mentioned this in T348075: Ingest Cloud VPS audit logs into production logging pipeline.Oct 3 2023, 10:27 PM

Southparkfan added a parent task: T348075: Ingest Cloud VPS audit logs into production logging pipeline.

jbond closed subtask T324623: Switch rsyslog from gtls to ossl as Resolved.Nov 21 2023, 1:20 PM

@Southparkfan We're trying to reduce use of Buster in cloud-vps, and two servers in 'auditlogging' are running Buster: syslog-server-04 and syslog-client04. My recollection is that they're redundant now that server-05 and client05 exist (and are running bookworm) -- is that right? Can the 04 VMs be removed?

In T127717#9671034, @Andrew wrote:

@Southparkfan We're trying to reduce use of Buster in cloud-vps, and two servers in 'auditlogging' are running Buster: syslog-server-04 and syslog-client04. My recollection is that they're redundant now that server-05 and client05 exist (and are running bookworm) -- is that right? Can the 04 VMs be removed?

The purpose on having syslog servers on multiple operating systems is to verify compatibility. As you might have seen, sometimes, rsyslog requires OS-specific changes to work properly.

If you don't mind potentially breaking Buster compatibility in the future, or if we should remove support right away, then these servers are OK to go.

In T127717#9671489, @Southparkfan wrote:

In T127717#9671034, @Andrew wrote:

@Southparkfan We're trying to reduce use of Buster in cloud-vps, and two servers in 'auditlogging' are running Buster: syslog-server-04 and syslog-client04. My recollection is that they're redundant now that server-05 and client05 exist (and are running bookworm) -- is that right? Can the 04 VMs be removed?

The purpose on having syslog servers on multiple operating systems is to verify compatibility. As you might have seen, sometimes, rsyslog requires OS-specific changes to work properly.

If you don't mind potentially breaking Buster compatibility in the future, or if we should remove support right away, then these servers are OK to go.

That's a good point. We'll save these for a bit later in the Buster deprecation cycle. Thanks!

In T127717#9671526, @Andrew wrote:

In T127717#9671489, @Southparkfan wrote:

In T127717#9671034, @Andrew wrote:

@Southparkfan We're trying to reduce use of Buster in cloud-vps, and two servers in 'auditlogging' are running Buster: syslog-server-04 and syslog-client04. My recollection is that they're redundant now that server-05 and client05 exist (and are running bookworm) -- is that right? Can the 04 VMs be removed?

The purpose on having syslog servers on multiple operating systems is to verify compatibility. As you might have seen, sometimes, rsyslog requires OS-specific changes to work properly.

If you don't mind potentially breaking Buster compatibility in the future, or if we should remove support right away, then these servers are OK to go.

That's a good point. We'll save these for a bit later in the Buster deprecation cycle. Thanks!

Looking at the Wikimedia upgrade policy, regardless of whether that's also applicable to Cloud VPS, we're already overdue. I guess it's best to remove these servers whenever 'most Buster instances' have been removed. There is no critical data on these servers, so your deprecation schedule should not be impacted. Security-wise, the Buster syslog server can be removed just fine, it's just the Buster syslog client that needs to stay. I'll delete the Buster syslog server.

By the way, thanks for upgrading the puppetmaster :-)

Mentioned in SAL (#wikimedia-cloud) [2024-03-28T23:45:58Z] <Southparkfan> Deleted syslog-server-04 (Buster) per T127717#9671931

Move Cloud VPS auth.logs to central loggingOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move Cloud VPS auth.logs to central logging
Open, MediumPublic
Actions

Related Objects
Search...