Page MenuHomePhabricator

Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community
Open, MediumPublic

Description

Opportunity: Provide Wikimedia Enterprise daily dumps of "text-based" wikimedia projects and hourly diffs API for folks to use inside the WMCS environment.

Solution: As a first run on this, we decided that we will "allow-list" WMCS IP fields from inside of the Enterprise application. This will allow folks from within WMCS to use Enterprise as they need. No auth required.

Need to dos:

Event Timeline

RBrounley_WMF renamed this task from Tech Engagement Collaboration to Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community.Apr 26 2021, 3:52 PM
RBrounley_WMF updated the task description. (Show Details)
RBrounley_WMF moved this task from Blocked to In Progress on the Wikimedia Enterprise board.
RBrounley_WMF added a subscriber: ArielGlenn.

Made some updates to the description per our meeting with Tech Engagement team, adding @ArielGlenn for line of sight

What prevents someone from uploading the dailies from a WMCS instance to archive.org? Do we want to deter that, encourage it, have no opinion?

More generally, might it be perceived as problematic for the dailies to be available to WMCS users without restriction and not the broader community? I leave aside the hourly diffs as more specialized and different in kind than the full dumps.

Interesting question. When it all comes down to it, the data is probably some of the least secret on the web. So I'm not sure why we would care where it goes after it is accessed. My understanding about the enterprise offering is that it provides a high performance way of gathering lots of information quickly from our systems rather than a particular set of information. WMCS would not really be an enterprise class and highly scaled client from that perspective.

That's just my personal opinion anyway on the archive.org piece as someone not even on their project team. I cannot say it aligns with everyone else' opinion :)

On a wider point (not just, but also including this I.A. example), we're going to need some extra wording in the existing Terms of Use, or a separate ToU just for accessing this dataset. Not sure when it needs to be undertaken (since that's a legal question not a tech question) but just flagging it here as it's related. Most specifically it needs to emphasise that the service (as separate from the content) can't be used for commercial purposes (either by themselves, or by on-selling the feed).

The I.A. usecase you mention @ArielGlenn is not a problem - an individual dataset is not a problem - after all, the content will be available fornightly via the dumps page too. So, someone going to the effort of uploading a daily to I.A. just for the sake of publicly archiving it probably just a waste of their computing power.

However what WOULD be concerning is if someone is accessing the feed for free via WMCS and then makes it freely available systematically to others (via I.A. or otherwise). Effectively creating a free bootleg daily Enterprise feed for any [including commercial] purposes (without the SLA).

So, in the Enterprise-WMCS ToU, along with a restriction on direct commercial use (and a restriction on on-selling), we should also include a restriction on the "systematic" provision of the feed to others. It would be important to not restrict "fair" sharing of the content (especially since the fortnightly dumps are available anyway). What we'd need to restrict is the systematic sharing - the bootleg recreation of the API.

Thanks for that clarification, Liam. The 'regular uploading to IA' example is the sort of thing I had in mind.

So, in the Enterprise-WMCS ToU, along with a restriction on direct commercial use (and a restriction on on-selling), we should also include a restriction on the "systematic" provision of the feed to others. It would be important to not restrict "fair" sharing of the content (especially since the fortnightly dumps are available anyway). What we'd need to restrict is the systematic sharing - the bootleg recreation of the API.

Dumps are commonly mirrored on third party servers. Should we ask people not to mirror the fortnightly materials to-be-shipped to the dumps.wikimedia.org systems, or is that bit fine?

So, in the Enterprise-WMCS ToU, along with a restriction on direct commercial use (and a restriction on on-selling), we should also include a restriction on the "systematic" provision of the feed to others. It would be important to not restrict "fair" sharing of the content (especially since the fortnightly dumps are available anyway). What we'd need to restrict is the systematic sharing - the bootleg recreation of the API.

Dumps are commonly mirrored on third party servers. Should we ask people not to mirror the fortnightly materials to-be-shipped to the dumps.wikimedia.org systems, or is that bit fine?

My expectation was that those would have the same availability, including mirroring, as the dumps we produce in-house.

Dumps are commonly mirrored on third party servers. Should we ask people not to mirror the fortnightly materials to-be-shipped to the dumps.wikimedia.org systems, or is that bit fine?

I see no reason why we would want/need to add any special restrictions upon how the dumps are used/re-used. They ought to be available as consistently as the ‘normal’ dumps AFAIAC.
What we need to protect against is people using the free WMCS access to the Enterprise APIi to either:

  • make commercial use of it [either through their own project or by on-selling the access]
  • provide/recreate systemic access to it to others outside the WMCS platform.

Dumps are commonly mirrored on third party servers. Should we ask people not to mirror […]?

I see no reason why we would want/need to add any special restrictions […]. What we need to protect against is people using the free WMCS access to the Enterprise APIi to either:

  • make commercial use of it [either through their own project or by on-selling the access]
  • provide/recreate systemic access to it to others outside the WMCS platform.

I'm not a lawyer, but I believe both of these are indeed protected against also by the (current) Wikimedia Cloud's ToU. In that (I think) they forbid commercial use, forbid creation of tools that mainly serve external users, and forbid abuse of our infrastructure/compute/access priviledges for purposes that don't benefit our community.

On your second point I assume you mean specifically an "online" (real-time) systemic access, right? (e.g. disallow tools that proxy Enterprise API queries in real-time, which'd expose the hightened rate allowance to external users).

I'm not a lawyer, but I believe both of these are indeed protected against also by the (current) Wikimedia Cloud's ToU. In that (I think) they forbid commercial use, forbid creation of tools that mainly serve external users, and forbid abuse of our infrastructure/compute/access priviledges for purposes that don't benefit our community.

We’ll need Legal to check that this is indeed sufficient, but if so, then so much the better! The fewer the special rules required for the right to use this dataset, and the more consistent those rules are with the existing setup - the easier it is on everyone to understand, implement, and enforce.

On your second point I assume you mean specifically an "online" (real-time) systemic access, right? (e.g. disallow tools that proxy Enterprise API queries in real-time, which'd expose the hightened rate allowance to external users).

My technical ignorance is showing here, but I don’t want to pre-suppose a method by which ‘systematic’ access could be illegitimately provided. My comment was in specific response to Ariel’s hypothetical of someone uploading the daily/hourly files to the Internet Archive, but the principle should be the same regardless of the technical method: Once in a while/manually is NOT a problem. but automating it, in such a way that would enable a commercial third party to rely on it rather than the official paid service, that WOULD be a problem.

Regarding the issue of Enterprise appearing in CS (and also the Dumps) and how that might require special adjustments to the ToU, we'll be meeting with Legal to discuss it next week. FYI

Talking with legal, there are - in effect - four options here:

1 - Creating a separate ToU
2 - Integrating some extra bits into the [currently already being rewritten] Cloud Services ToU
3 - Turn a blind eye the issue because the commercial and/or 'improper' usages of this service will be so minimal that they aren't worth bothering about
4 - Relying on existing technical limitations of the use of Cloud Services which sufficiently restrict such that we don’t require special ToU [or perhaps adding some new technical restriction??]

To determine which of these approaches is best, we first need to identify what is the 'worst case scenario' of what could a malicious actor actually achieve, given access to the Enterprise datasets via CS under the current technical limitations of CS. Then we can see if route 3/4 is viable.

To that end: who would like to join @Bstorm, @RBrounley_WMF and myself, have a chat and talk through the scenario of "what's the worst that could happen". @Krinkle? @ArielGlenn ? others?

Based on a brainstorming of the 'worst case' risks, it seems that the existing technical limitations of CS are sufficient to mitigate against commercial activities which would compete with Enterprise's business model - based on speed and stability.
However, we have identified a separate risk of [subsequently] deleted revisions being visible in the hourly diff files - a PII/libel etc. I will followup with Stewards and WMF T&S to investigate further. @RBrounley_WMF will be creating a separate ticket for that topic, linking it to here, as it is not explicitly a concern of the business model.

Checked in with @nskaggs this morning. Looks like we are in good shape to get this moving.

WMCS team can point to our documentation on-wiki for folks to get a head start.

Enterprise team needs the range for the allow-listing from WMCS folks to input the range. @Sashah2, is this all you need from their end?

@Sashah2 @RBrounley_WMF

185.15.56.0/25 should cover the correct range.

Brooke also kindly pointed out we should include the range from our test setup

185.15.57.0/29

Yep. It should be enough. Thank you.

@Sashah2 - checking in here, were you able to allow-list these IPs? After that we should be good to go?

@RBrounley_WMF, @Sashah2: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

Hello @Aklapper !

Yep, we will update the status of the task. Thank you for the reminder.

Protsack.stephan changed Due Date from Jun 10 2021, 4:00 AM to Aug 13 2021, 4:00 AM.Aug 9 2021, 12:37 PM