Page MenuHomePhabricator

[mwcli] reporting on usage
Closed, ResolvedPublic

Description

It would be great to have some sort of metrics / telemetry on the usage of the various commands that make up part of mwcli.

This could be an optional thing, and a question asked at initial install, held in the config.
If people agree to the telemetry we would periodically ping back to eventlogging information about the number of times that commands have been run, in order to gauge usage and see what parts of the CLI are important to people.
Also version information.
A simpler version could be only reporting metric pings to statv?

Does this sort of data collection need sign off from someone?
OR can we just go ahead and write the schema, and do it as long as it is clear and opt in only?

Event Timeline

Addshore moved this task from Inbox to Discuss & Decide on the mwcli board.

It was suggested that the best route to answer these questions was perhaps via Privacy and Privacy Engineering toward legal, so tagging as required hoping to catch someone's eye!

Hey @Addshore -- thanks for bringing that to the Privacy Engineering team's attention. My understanding is that the metric tool would work as below:

  • Upon initial install of mwcli, users will be asked opt for the usage reporting
  • At a given frequency, some information will be sent to eventlogging, indicating the commands that users typed while using mwcli, and how many times they typed each command.
  • the reporting system will only report the commands typed by the users and that are listed under mwcli documentation, but no other type of information — eg: other terminal commands that users typed, fingerprints or other personally identifying information

Do I have an accurate understanding of how the reporting system works?

sguebo_WMF triaged this task as Medium priority.Oct 25 2021, 4:08 PM

Yes that is an accurate description of the plan.
The level of detail would only include known things about the environment, and not totally custom user provided data.
This could include options and the values of those options when they are already known by mwcli / hardcoded options
In cases where some custom value is used I would perhaps report "CUSTOM"

For example after a user has gone through a whole setup and played around with the environment a bit we might get something like this (at the highest detail level planned) from the next time the mwcli attempts to send data back.

  • 4x docker mediawiki create
  • 2x docker mysql create
  • 2x docker mediawiki install --dbtype=mysql --dbname=default
  • 2x docker mediawiki install --dbtype=sqlite1 --dbname=CUSTOM
  • 22x docker mediawiki exec
  • 4x codesearch search --output=ack
  • 1x codesearch search --output=table

For example after a user has gone through a whole setup and played around with the environment a bit we might get something like this (at the highest detail level planned) from the next time the mwcli attempts to send data back.

  • 4x docker mediawiki create
  • 2x docker mysql create
  • 2x docker mediawiki install --dbtype=mysql --dbname=default
  • 2x docker mediawiki install --dbtype=sqlite1 --dbname=CUSTOM
  • 22x docker mediawiki exec
  • 4x codesearch search --output=ack
  • 1x codesearch search --output=table

Thanks providing these details, @Addshore.

The level of detail would only include known things about the environment, and not totally custom user provided data.

You mentioned earlier that "version information" would be included in the report too. Could you please clarify whether "version information" here refers to the verison of Mediawiki, mwcli or something else?

You mentioned earlier that "version information" would be included in the report too. Could you please clarify whether "version information" here refers to the verison of Mediawiki, mwcli or something else?

The version information of mwcli.
So this would for example be v0.6.0 or a similar known string.

One other thing that would be great to include in these events is some sort of unique key for the installation of the tool.
I imagine the most privacy conscious way of dong this would be on metrics reporting, generate a random and unique string that could be sent along with the counts of these actions?
Then the data could actually be analysed in slightly more useful ways.

4x codesearch search --output=ack

@Addshore what would we do with this? It seems kind of invasive to me. The other reported events look OK.

4x codesearch search --output=ack

@Addshore what would we do with this? It seems kind of invasive to me. The other reported events look OK.

Is the invasive part of this the fact the selected output is also reported?

ack in this case is one of 2 possible output formats. the other being table, both of which are known strings by the application.

The desire here is the same desire as all of the other details, to know what is used and makes sense to maintain, vs what it perhaps does not.

4x codesearch search --output=ack

@Addshore what would we do with this? It seems kind of invasive to me. The other reported events look OK.

Is the invasive part of this the fact the selected output is also reported?

ack in this case is one of 2 possible output formats. the other being table, both of which are known strings by the application.

The desire here is the same desire as all of the other details, to know what is used and makes sense to maintain, vs what it perhaps does not.

I'm sorry I misread it! I thought you were proposing to log the search terms being passed to codesearch, e.g. "mw codesearch search 'blah blah blah'". I agree that the output format is useful.

Hey @Addshore, on the behalf of Security-Team, I reviewed the privacy risks inherent to the proposed usage reporting feature for mwcli. I’ll share my conclusions below.

The intent behind this feature is to better understand how mwcli, a command-line interface, utilized by Mediawiki users. To do so, a usage reporting feature will be added to mwcli, sending to the Event Platform and at a given frequency, some statistics about the commands that were typed by users while using the command-line tool. The envisioned feature presents one major privacy risk: fingerprinting users by gathering information so unique that it becomes personally identifying.

Counting the number of times standard mwcli commands are typed in the terminal, in and of itself, does not disclose any information that could be linked to a user’s identity. It does not necessarily provide ill-intended actors with useful information either.

However, if the reported commands include custom strings such as --dbname=CUSTOM, or "a sort of unique key for the installation", caution must be taken. Here, the reporting system would not only be uniquely identifying the installation, but it would also uniquely identify the user's device. This fingerprinting would invade the user’s privacy unduly. Furthermore, in the event that the database name contains PII such as someone’s real name (eg: --dbname=janedoe-mediawiki-db), sharing that information with the remote server would also disclose that sensitive information, while jeopardizing the user’s privacy.

A number of mitigations could be considered to reduce the privacy risks.

  • If the end goal is to understand how mwcli is utilized by users, then fingerprinting those users may not be of absolute necessity, albeit useful. Therefore, the idea of collecting a unique device identifier could be discarded. Similarly, the privacy risk associated with storing database names outweighs the statistical benefit behind this metric. Accordingly, this parameter could be discarded.
  • If the collection of database names and insertion of a unique identifier are essential to that usage reporting system, users should be informed, in a clear way and before starting to use the command-line interface, what information will be collected about their device.
  • In any case, and in line with your intention to make this feature opt-in, users should be presented with the possibility to opt for this feature, or not.
  • An additional safeguard would be to ensure that the monitoring or mwcli commands is not expanded to other commands typed in the terminal by end users.

In line with our team's threat modeling, the privacy risk with this feature was initially rated as MEDIUM. If the mitigations above are taken into account, the risk would be reduced down to LOW, which could be accepted automatically by the stakeholder behind mwcli — WMDE, I presume.

However, if the reported commands include custom strings such as --dbname=CUSTOM, or "a sort of unique key for the installation", caution must be taken. Here, the reporting system would not only be uniquely identifying the installation, but it would also uniquely identify the user's device. This fingerprinting would invade the user’s privacy unduly. Furthermore, in the event that the database name contains PII such as someone’s real name (eg: --dbname=janedoe-mediawiki-db), sharing that information with the remote server would also disclose that sensitive information, while jeopardizing the user’s privacy.

To clarify here my intent here was to indicate that a custom dbname was used, but not send what the name is in the event.
CUSTOM is a placeholder, so ass we would know is if a custom dbname, other than default was used, and not what the dbname is itself.

In line with our team's threat modeling, the privacy risk with this feature was initially rated as MEDIUM. If the mitigations above are taken into account, the risk would be reduced down to LOW, which could be accepted automatically by the stakeholder behind mwcli — WMDE, I presume.

Thanks, when implementing this I'll try to write a guide for what was covered in this ticket in the repo.
I could certainly make the first implementation also no include any unique id per installation.

I work on mwcli in my personal time and WMDE is not an agreed stakeholder here.
@jeena and Release-Engineering-Team I believe would be the folks accepting this risk.

Hi @Ottomata

I'd love some support in making sure I put this schema in the correct place.
The code that emits the event can be found at https://gitlab.wikimedia.org/repos/releng/cli/-/merge_requests/118/diffs#7765c347ab9491e09466e653e49d4a161be04db5_0_70
As far as I can tell emitting a POST to https://intake-analytics.wikimedia.org/v1/events?hasty=true should be fine

The current json that will be POSTed looks something like this

	{
		"schema":  "MwCliCommandCounting",
		"$schema": "/analytics/todo/1.0.0",
		"event":   {
				"docker destroy": 12,
				"docker mediawiki create": 24,
				"docker redis create": 1,
				"docker redis destroy": 1,
		},
	}

Am I missing any needed fields (as I am emitting this event from golang with no existing code support).

Note that right now all of the keys within the event are dynamic.
Is this something that the schema can easily handle? Or should I wrap these in a single value?
That would be something like this

	{
		"schema":  "MwCliCommandCounting",
		"$schema": "/analytics/todo/1.0.0",
		"event":   { "commands":
				"docker destroy": 12,
				"docker mediawiki create": 24,
				"docker redis create": 1,
				"docker redis destroy": 1,
		}, },
	}

And looking at the schemas/event/secondary repository, should I go ahead and create a new mwcli directory in the jsonschema directory of that repo?
Or but this schema somewhere else?

Hello!

Am I missing any needed fields

https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Required_fields
Yes, you need the stream field. Do you need schema? Perhaps you meant that to be stream.
See also:

Note that right now all of the keys within the event are dynamic. Is this something that the schema can easily handle?

I don't think you need or want to put these inside of an event subobject; that is a legacy EventLogging schema convention that was coupled to the EventCapsule schema.

But, in general, the keys cannot be dynamic...unless you use a map field that explicitly declares the value types. https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#map_types

And looking at the schemas/event/secondary repository, should I go ahead and create a new mwcli directory in the jsonschema directory of that repo?

I'd say make a mwcli directory inside of analytics in that repository, so perhaps this schema is something like analytics/mwcli/command_executed (or command_run or something like that).

I am emitting this event from golang with no existing code support

Cool! Just FYI, that we've had (and are still having) issues with non supported clients as part of the EventLogging legacy migration. If we don't control the client, and the client code doesn't really have a team owning it, we can't do backwards compatible migrations. Hopefully we won't have to do this for a long time (forever?), but if ever have to deprecate the event intake /v1/events API, we will have a hard time tracking down this code. Anyway, please proceed, just wanted to add this note. :)

Change 745914 had a related patch set uploaded (by Addshore; author: Addshore):

[schemas/event/secondary@master] Add analytics/mwcli/command_report

https://gerrit.wikimedia.org/r/745914

Merged the mwcli PR, https://gitlab.wikimedia.org/repos/releng/cli/-/merge_requests/118#a19da6579ca04ae3e0dd3c807a7930f3bd0f212d

This results in simple collection of events such as:

{"$schema":"/analytics/mwcli/command_execution/1.0.0","command":"version","dt":"2022-01-14T19:37:07.726Z","meta":{"stream":"mwcli.command_execution"},"version":"latest"}
{"$schema":"/analytics/mwcli/command_execution/1.0.0","command":"config show","dt":"2022-01-14T19:37:11.603Z","meta":{"stream":"mwcli.command_execution"},"version":"latest"}
{"$schema":"/analytics/mwcli/command_execution/1.0.0","command":"config show","dt":"2022-01-14T19:39:08.488Z","meta":{"stream":"mwcli.command_execution"},"version":"latest"}
{"$schema":"/analytics/mwcli/command_execution/1.0.0","command":"gitlab repo search","dt":"2022-01-14T19:45:47.497Z","meta":{"stream":"mwcli.command_execution"},"version":"latest"}

where command will only include the known names of the commands and not any options, flags or parameters used.

Change 745914 merged by jenkins-bot:

[schemas/event/secondary@master] Add analytics/mwcli/command_execution

https://gerrit.wikimedia.org/r/745914

Change 755794 had a related patch set uploaded (by Addshore; author: Addshore):

[operations/mediawiki-config@master] Add mwcli.command_execute to wgEventStreams

https://gerrit.wikimedia.org/r/755794

Change 755794 merged by jenkins-bot:

[operations/mediawiki-config@master] Add mwcli.command_execute to wgEventStreams

https://gerrit.wikimedia.org/r/755794

Mentioned in SAL (#wikimedia-operations) [2022-01-24T12:24:21Z] <urbanecm@deploy1002> Synchronized wmf-config/InitialiseSettings.php: 296fe1644a2a71914e880f3562f8e32fd66c1637: Add mwcli.command_execute to wgEventStreams (T293583) (duration: 00m 48s)

This is all merged now and appear to be working.
This will be in the next release

Having got this far, another super useful thing to know would be the OS and CPU arch the commands are being run on
This could be done with GOOS and GOARCH
These again have a set of known values.

But this is perhaps for a future iteration.

Change 873013 had a related patch set uploaded (by Addshore; author: Addshore):

[analytics/refinery@master] Allow list mwcli_command_execute command field

https://gerrit.wikimedia.org/r/873013

Change 873013 merged by Mforns:

[analytics/refinery@master] Allow list mwcli_command_execute command field

https://gerrit.wikimedia.org/r/873013