Page MenuHomePhabricator

Make archive table partially accessible on Wikimedia Labs
Closed, ResolvedPublic


On the Toolserver, users have limited access to the archive table. This is used for getting a deleted edit count or other analysis.

mysql> describe archive;



ar_userint(5) unsignedNO0
ar_rev_idint(8) unsignedYESNULL
ar_lenint(8) unsignedYESNULL
ar_page_idint(10) unsignedYESNULL
ar_parent_idint(10) unsignedYESNULL


From Cloud-Services earlier today (trimmed for relevance):
[10:15:38 AM] <legoktm> Oh yeah Coren, is there an eta for the archive table being available? and is there a bug tracking it?
[10:16:35 AM] <Coren> legoktm: I don't think there's a bug tracking it, and it's a couple weeks before I have a definitive answer.
[10:18:49 AM] <legoktm> Out of curiosity, is it a legal issue or a technical one thats holding it back?
[10:21:32 AM] <Coren> legoktm: Legal.
[10:23:01 AM] <Coren> legoktm: I can tell you offhand that, if it is going to be okayed at all, it will be on a per-case basis and likely require approval with a process similar to that of getting the researcher right.

Version: unspecified
Severity: normal
See Also:



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:44 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz49088.

(In reply to comment #0)

[10:23:01 AM] <Coren> legoktm: I can tell you offhand that, if it is
to be okayed at all, it will be on a per-case basis and likely require
with a process similar to that of getting the researcher right.

What is that process?

What is that process?

That's actually part of the question Legal will have to solve. (It is, currently, on their desk).

Legal has approved replication of a suitably redacted archive table (in particular, it will not have edit summaries); but there are "interesting" technical hurdles involved in replicating that table that require a (long overdue) delicate upgrade of the table itself on the masters that are a dependency.

Our DB team is now aware of the request, and they'll be able to get on it as soon as resources and time allows.

Thanks for the update Marc. Could you clarify if this access will still require "a process similar to that of getting the researcher right"?

The copy on the Toolserver never had access to edit summaries; when you say "suitably redacted", are you comparing it to the actual archive table or the already limited version on the Toolserver described in comment 0?

No, the table will be generally available and won't require extra hoops. Also, the schema will be identical to production's; we have chosen to null columns rather than elide them entirely for the labs DB (makes some tools easier).

Bug 49189 (a dependency) has been updated with status information by our DBA.

I'm copying Michelle from Legal so she can help us figure out the rationale for blanking edit summaries, in response to comment 3.

One possible reason (credits go to Ironholds): "lazy" page deletion for content that should have been oversighted may result in the snippet of the original page to be stored in the summary of the first revision of that page (and accidentally exposed if the summary is not censored).

Not just OS, but revdel stuff too - it's not that uncommon to just delete rather than revdel-each-edit-and-delete, since from a user POV it leads to the same outcome (content is only visible to sysops)

Legal's opinion has been accurately conveyed here. We are comfortable with putting the archive table in Labs as long as the edit summary is redacted. Dario also asked me about providing metadata. We are ok with this as long as the metadata provided does not include IPs (that are otherwise nonpublic) or location information.

The archive table should now be replicating to labs with ar_text and ar_comment redacted. The views should be ready shortly.

The views are now in place, and the redacted archive table should now be visible from the replicas. Note that it may take a while for replication to "catch up", especially on the larger wikis.