Page MenuHomePhabricator

[INVESTIGATION] Investigate logs on Wikibase.cloud for GDPR Policy
Closed, ResolvedPublic

Description

We need to be able to answer the following questions so that we can accurately complete our Privacy Policy: What logs do we have? Where are we storing them? How long are we storing said logs (in days)?

Within this investigation we will:

  1. Document the current state of logging (answering the above questions)
  2. Re-evaluate the situation and decide if there are any logs which should be altered (data, duration) based on the purposes of keeping them

Expected outcome:
For the privacy policy, we need to give an explicit number of days we keep different logs before deleting them, and this number should be justifiable based on why we need them/what they are for. So, the number of days is added to the policy, but we should know why that number is the right one.

Note: look at logs both created and stored by the wbstack application and also those created by google cloud's infrastructure

Event Timeline

Addshore subscribed.

Agreed on in the daily to pickup into the end of the existing sprint

We clearly store some logs using the google cloud provided Default logging bucket this lasts for 30days.

We also have logs in the standard output of the containers which appear to last for some currently undetermined lifetime

@toan mentioned we should also think about mailgun

Answers to the following presented in line:

What logs do we have?

We seem to only have logs from our Kubernetes clusters. These consist of logs from

  • Clusters
  • Nodes
  • Pods
  • Containers

Specifically I was not able to find any logs from other parts of Google Cloud Engine

All except the Containers logs do not appear contain any user information. They mostly consist of internal processes necessary to maintain the cluster's health.

The Containers logs consists of the output (to stderr/stdout) of the containers we are running. Some of these contain users requests; for example the nginx containers. It appears only those from nginx-ingress-controller contains user IP addresses. Other logs contain ip addresses from inside the cluster.

Where are we storing them?

In addition to them being stored in memory and temporarily on the disk of the nodes they are stored in a google cloud logging bucket (the default bucket).

Even for very log running pods there does not appear to be logs remaining on the node or in memory for as long as they are persisted in the google cloud default logging bucket.

How long are we storing said logs (in days)?

The bucket has a retention time of 30 days.

We also have logs kept on our behalf by mailgun. These have a retention period of 5 days.

toan removed toan as the assignee of this task.
Lea_WMDE claimed this task.