Page MenuHomePhabricator

Remove geolocation metadata from Commons images
Open, Needs TriagePublic

Description

Summary
The Security-Team recently completed an audit of the configuration file maintain-views.yaml, in order to explore whether wiki-replicas pose some privacy risks for the contributors supporting Wikimedia projects. As part of the conclusions, it is recommended that latitude and longitude of Commons images be redacted from replicas.

Broader context
Some of the image files uploaded to commons include longitude and latitude coordinates in the metada, as per the result of the query below. This practice is well established among the Wikimedia commons community and has various applications. However, this may pose privacy risks for authors of images that are geolocated, as malign actors may deduce their likely location, if they happen to take a certain number of images from a specific area. While this may seem an edge case, the risks associated with that eventuality warrant some measures. Below is a SQL query disclosing the latitude and longitude of places volunteers have been, based on the geolocation information associated with an image they uploaded to Commons.

SELECT img_name, img_metadata, actor_name
FROM image
LEFT JOIN actor
ON img_actor = actor_id
LIMIT 100;

Related Objects

StatusSubtypeAssignedTask
OpenNone

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don't understand what the value of redacting geolocation metadata is, we already provide dumps of all this information https://dumps.wikimedia.org/commonswiki/20210601/ (search for "geo_tags").

I feel like there is a missing parent ticket describing the goal of this audit and discussion of whether it is reasonable to remove data from the wiki replicas that is available via public apis and the MediaWiki interface. I broadly agree with the various risk analysis that I have read in this and other tickets, but I pretty strongly disagree with redacting public data to "preserve" privacy that actually does not exist.

Hey @bd808,

For context, the privacy engineering audit regarding the wiki-replicas is part of a body of work that is being done internally by the Security team. But I concur that a parent public ticket would have made sense too. I’ll keep that in mind moving forward.

I can totally understand the issue with redacting data from wiki-replicas while that same data is still available in other streams of information. That being said, because PII is available through existing data sources doesn’t mean that having that identifying data exposed in the first place is a good thing. Nor does it mean that no measures should be taken to address the privacy risk — increased accessibility — that comes with PII being exposed in multiple places.

Of course, those recommendations, as the name suggests, aim at flagging some privacy issues while opening the floor for discussions around mitigations. As of now, the geo data present in the metadata poses some privacy concerns. This is the reason why redaction was recommended as a mitigation.

Of course, those recommendations, as the name suggests, aim at flagging some privacy issues while opening the floor for discussions around mitigations. As of now, the geo data present in the metadata poses some privacy concerns. This is the reason why redaction was recommended as a mitigation.

I do not feel that the technical community can reasonably engage in a discussion of mitigation without a explanation of the desired change in the world being proposed by the Security-Team. Taken as piecemeal efforts, these tickets seem nonsensically intended to disadvantage our most engaged technical contributors by removing otherwise public data from the systems that they rely on to operate various bots and tools which fill active workflow gaps for content curation and bad faith edit remediation in the projects.

Removing data from the wiki replicas system does nothing to slow down the rate at which Google and other large scale Wikimedia content reusers extract content currently envisioned as PII via APIs and page scraping. Instead it removes functionality from the systems most likely to be used in good faith by the movement internally.

Removing data from the wiki replicas system does nothing to slow down the rate at which Google and other large scale Wikimedia content reusers extract content currently envisioned as PII via APIs and page scraping. Instead it removes functionality from the systems most likely to be used in good faith by the movement internally.

My interpretation of the original problem statement in this ticket is that there is a justified concern that some percentage of volunteers who have uploaded photographs taken using GPS enabled devices including lat/lon information in the Exif metadata attached to the image were unaware that this has happened. The proposed response of hiding this information for all images will break tools which are designed to show images and articles based on geospatial search. This includes tools designed to help promote beloved movement initiatives such as https://www.wikilovesmonuments.org/.

The concern of consumer ignorance of "features" of their photographic equipment is justified. Not everyone is a aware of automatic geo tagging. Not everyone is aware of the potential "decloaking" of their normal activities and movements via examination of GPS tagged content in aggregate. One response to this could be for MediaWiki to detect geo Exif data during image upload and require a positive opt-in to retaining that information in the stored image. The opt-in instructions could clearly and succinctly explain the potential misuse of this information and provide the content uploader the ability to give their informed consent to its use. This would be a privacy increasing response which also helps fulfill the movement's larger mission of spreading knowledge. If it is believed that a non-trivial percentage existing Exif geo data has been provided unintentionally, additional features could be developed sequentially or in parallel to seek retroactive consent from the uploaders of such data and purge the unintended information if consent is not received.

Ultimately however, unless the source images are altered to remove the EXIF data and the originals destroyed, that information is public to anyone who downloads the original and parses the file. This is a more cumbersome and resource consuming method of obtaining the EXIF geo data, but it is completely within the license of the content on Commons and locally on other wikis. By putting a higher burden on aggregating this information it will be available directly to fewer requestors due to economic and time constraints. It will however still exist, and will likely be used by those consumers like Google and other nation state scale actors who have the resources available to collect and use it regardless of the time and compute resource complexity.

Informed discussion with the community requires this level of examination in my opinion. Remediation by exclusion in one location has benefits and drawbacks. "Leaking data is bad" is a simple formulation of a complex and nuanced topic. This ticket does not expose any internal discussions on the pros and cons to the rest of the community for further discussion. Instead it presents an argument from authority stating that the information should be removed from one aggregate collection because an SQL query can operate on the stored information.

Thank you for completing this audit! I appreciate the work and concerns you've raised, and I'm happy to discuss them.

However, in this instance I am also not in favor of taking actions to redact information. If there are concerns over what exif metadata (there are other potential sensitive fields around software, processing, camera, file path etc) is being uploaded, I would encourage you to request additional features be added to help the uploader confirm or remove parts of the exif metadata as part of the upload process. As it stands, the metadata is a valid and useful tool, and geotagged images have value.

Hey @nskaggs and @bd808,

Thanks for examining the privacy challenges above and exploring some of the implications of redacting geolocation metadata. Given the relevant inputs you shared above, I have re-applied our privacy threat modeling and I would like to provide an updated risk rating and analysis.

The geolocation metadata present in the replicas can be misused by various malign actors, including GAAFAM and state actors for identifying individuals, cyber criminals and activist groups for data collection and surveillance purposes, and bad faith editors for harassment. Threat actors with large resources like GAAFAM or state adversaries may extract PII via sophisticated means — automation, APIs and page scraping, extracting Exif data from the files themselves, as rightly noted above. However, the likelihood of that privacy harm materializing is low, since deducing where someone has been at some point does not necessarily disclose their current and accurate location. While the likelihood of that eventuality is low, a hostile state actor uncovering a contributor’s current geographic coordinates could have significant real-life consequences. With the above in mind, the overall privacy risk was categorized as MEDIUM.

As far as mitigations are concerned, I agree that redacting geolocation metadata, while being possible, will not prevent more resourced malign actors from accessing PII. Instead, it will mostly break existing tools and make it harder for users or contributors engaged in good-faith activities such as anti-vandalism work. Not to mention that, as pointed out above, replicas contain “other potential sensitive fields around software, processing, camera, file path etc”.

So I am fine putting the redaction idea aside in favor of “request[ing] additional features be added to help the uploader confirm or remove parts of the exif metadata as part of the upload process”. Furthermore, this mitigation, if implemented, would reduce the risk to LOW. Even if users cannot remove the metadata using the upload process, informing them very clearly that such geographic coordinates will be collected and made public will help decrease the overall privacy harm.

As per our team’s process, we surface risks, provide recommendations around mitigation, and identify the stakeholders that should own the risk and take relevant actions: either accept the risk or address the issue flagged. In this case, in line with the T169097 (parent ticket), an audit was performed and the privacy risk was rated as MEDIUM. It will be up to the relevant stakeholder/ team to decide whether they want to add features to the uploader, or if they choose to own that risk instead.

Is there a WMF team actively maintaining the uploader functionality who might be able to implement some of the suggested mitigations?

Is there a WMF team actively maintaining the uploader functionality who might be able to implement some of the suggested mitigations?

https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_extensions_deployed_at_Wikimedia_Foundation lists the Structured Data team as the responsible party.

...and see https://www.mediawiki.org/wiki/Developers/Maintainers under "MediaWiki core" for "File management" and "Uploading"

Thanks for the useful information, @Aklapper and @bd808.

sguebo_WMF added a subscriber: Seddon.

Hi @Seddon, I am adding this task to the Structured Data board and pinging you here, following the above discussion about some privacy concerns related to the Upload Wizard.

In a nutshell the Exif metadata of photos uploaded by the wizard can disrupt users' privacy. Ideally, when uploading photos they have taken themselves, users should be informed very clearly that identifying information about them will be collected (out of the files) and made public. The privacy risk was rated as MEDIUM. The Data Structure team can choose to add a feature to the uploader to inform users accordingly, which would reduce the risk down to LOW. Otherwise, a manager of the team can choose to accept the risk. You can read more about risk analysis and recommendation in an earlier comment of this thread.

sguebo_WMF added a parent task: Restricted Task.Aug 6 2021, 5:49 PM

Hi @sguebo_WMF! Do you have suggestions about what language we should use to inform users that information about them will be collected via the EXIF data and made public? Or do you suggest we reach out to legal?

Hi @Seddon, I am adding this task to the Structured Data board and pinging you here, following the above discussion about some privacy concerns related to the Upload Wizard.

In a nutshell the Exif metadata of photos uploaded by the wizard can disrupt users' privacy. Ideally, when uploading photos they have taken themselves, users should be informed very clearly that identifying information about them will be collected (out of the files) and made public. The privacy risk was rated as MEDIUM. The Data Structure team can choose to add a feature to the uploader to inform users accordingly, which would reduce the risk down to LOW. Otherwise, a manager of the team can choose to accept the risk. You can read more about risk analysis and recommendation in an earlier comment of this thread.

Hi @sguebo_WMF! Do you have suggestions about what language we should use to inform users that information about them will be collected via the EXIF data and made public? Or do you suggest we reach out to legal?

Hello @CBogen and thanks for looking into this. I do not have a specific language to provide you with but I would recommend something along the lines of the “Note on privacy” used here. Please keep in mind that you will still have to run the text by Legal anyway. I hope this was helpful to you.

Here is suggested language on EXIF metadata:

EXIF metadata may contain location or other personal data automatically added by your camera. Learn more about how to edit or remove EXIF metadata.

If you have any questions about the phrasing or other suggestions, I'm happy to discuss via email.