Page MenuHomePhabricator

Improve mediawiki data redaction
Closed, DeclinedPublic0 Estimate Story Points

Description

Study up how Sanitarium and other tools are working together to move data from the production wikis to labsdb. Work with Ops / DBAs to migrate that process to a maintainable clean one.

Once we do that, we can do these other things in other tasks:

  • refactor the loading of labsdb to use the new sanitized data
  • refactor our history reconstruction to use the sanitized data (ideally this can be done through Tungsten as it reads data from the mysql binlog or as it loads from its staging tables, so we have more real-time data than we can get with sqoop)
  • refactor the dumps process

References:
T103011
https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs
T138450
T143955

Also See:
Tungsten Mysql Replicator
labsdb auditor (great work by @yuvipanda that should help with a whitelist: https://github.com/wikimedia/operations-software-labsdb-auditor/). Yuvi, your thoughts on this are welcome here.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2016, 2:19 AM
Milimetric renamed this task from Refactor History Reconstruction and Dumps on top of cleaner to Refactor History Reconstruction and Dumps on top of cleaner edit data sanitizer.Sep 23 2016, 2:19 AM
Milimetric added a project: Analytics.
Nuria renamed this task from Refactor History Reconstruction and Dumps on top of cleaner edit data sanitizer to Improve mediawiki data redaction and refactor edit history reconstruction.Sep 26 2016, 3:38 PM
Nuria moved this task from Incoming to Wikistats Production on the Analytics board.
chasemp added a subscriber: chasemp.Oct 5 2016, 4:33 PM

Who does this mean?

'And then we need to refactor our history reconstruction on top of that'

'our' is analytics? We are in the midst of giving some time and attention to the labsdb setup this quarter coincidentally. I don't know anything about what the history reconstruction here refers to or what it is used for but I'm interested. What is the motivation behind this task? Reduce complexity by joining processes?

Hi @chasemp, I couldn't explain this with fewer words, sorry, let's chat in person if that's easier:

  • Analytics is exporting data from mediawiki databases to hadoop so we can build the new wikistats 2.0 pipeline on it. These are only dbs that back public projects (not private wikis or internal wikis). We are then doing this thing we call "history reconstruction" which means we're trying to recover lost information about the pages, users, namespaces, etc. In mediawiki when the state of a page or user changes, their current state is recorded in the page or user table, then some rows are inserted into the logging table with some information about the change. Over the last 11 years, that logging information varies in quality and some appears to be deleted. We are trying to reconstruct as much of that as possible so that we end up with, for example: User X used to be named Y between 2008 and 2012, and before that they were named Z from 2005 to 2008 (this user is apparently going backwards through the alphabet : )).
  • Currently, to get this data, we're sqooping the relevant tables from dbstore1002 (an analytics slave). That means we're getting potentially sensitive non-public data that would otherwise be redacted by Sanitarium.
  • I talked to Jaime to see if we can use the same process that Sanitarium uses to redact data, but I was told this was not a good idea and that Sanitarium and the supporting scripts need a lot of cleaning up.
  • We could sqoop data directly out of the labsdb databases, but from a performance and capacity point of view, that seemed like a bad idea to me (we'd be hitting those machines pretty hard for full history dumps every week, and we can't even use mysqldump because of some column casts we need to do). Also, the next point gets at why I think this is backwards in my opinion.
  • Ultimately, we want to dump the results of this reconstructed history back into labs. From a researcher point of view, this is a strict upgrade to the data currently offered in labsdb. It's a much much simpler schema, has data from all wikis in one place, and will have a bunch of metrics we're computing as well, so you can query them like first-class attributes.

To me, an ideal solution would be:

  1. clean up Sanitarium to Jaime's standards and make it so it works with tungsten (a real-time sqoop)
  2. tungsten clean public data out into Hadoop
  3. offer the plain data immediately in labsdb
  4. crunch the edit history reconstruction and metrics and dump that in labsdb too

But this is just my opinion, and if you're investing time in labsdb, we should definitely talk. Would be good to collaborate if we can.

Milimetric renamed this task from Improve mediawiki data redaction and refactor edit history reconstruction to Improve mediawiki data redaction.Oct 6 2016, 9:03 PM
Milimetric claimed this task.
Milimetric updated the task description. (Show Details)
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric set the point value for this task to 0.

FYI: I scoped this to "just" a refactor of Sanitarium and friends. I'll try to tackle it this quarter in between other work, but we're not committing to it. I'd appreciate any help in the form of links. I intend to first document the current approach and then brainstorm. I need to know:

  • what columns are redacted (I see Sanitarium but understand other places redact columns)
  • what the custom views in labsdb hide in addition to Sanitarium
  • what rows are redacted (I didn't find that in Sanitarium on first look, will look more)
  • any other places that data is redacted / hidden (skimmed the references in the task description, will read deeper)
  • what the custom views in labsdb hide in addition to Sanitarium

Theoretically https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/maintain-replicas/config.json is the authoritative place for this, in reality it may have been touched manually by operations over the past few months in ways that are difficult to discover without privileged access.

clean up Sanitarium to Jaime's standards and make it so it works with tungsten (a real-time sqoop)

I like the "I have no idea how it works", and all guesses of how it works (like Alex's) are wrong, but there is already decided the replacement technology (tungsten). This has even assigned time for ops/dbs. Plus

I scoped this to "just" a refactor of Sanitarium and friends. I'll try to tackle it this quarter in between other work

:-)

I briefly comment to Andrew O. my concerns, you should talk to him.

AlexMonk-WMF added a comment.EditedOct 7 2016, 6:44 AM

clean up Sanitarium to Jaime's standards and make it so it works with tungsten (a real-time sqoop)

all guesses of how it works (like Alex's) are wrong

What exactly have I suggested that you think is wrong? I haven't said anything about Sanitarium in this thread.

@jcrespo: no work has been done, no technologies have been decided, these are just words on a phab task right now.

Tungsten is the only tech that I know of that helps export mysql data to hadoop in real-time with enough flexibility to handle updates, deletes, etc. If anyone knows other tools, I'm super happy to consider them. And my point with mentioning Tungsten was that this sanitization should be compatible with it, in case that's what we decide to use. So either it's abstract enough that it can be implemented as a whitelist in multiple languages or it's low level enough that it can be a binlog reader or something. Again, I need to learn more before I make an educated guess.

I think being skeptical at this point in the process is counter productive, there's not even anything to be skeptical of :)

Milimetric updated the task description. (Show Details)Oct 11 2016, 9:46 PM
Milimetric added a subscriber: yuvipanda.
Milimetric added a comment.EditedOct 17 2016, 10:05 PM

@jcrespo: we met up with labs and I have a better understanding of the problem. We came up with a draft solution that I'll detail here, and we're interested in your thoughts.

Motivation. For our use cases in labs, analytics, and dumps, it would be nice if there was a real-time and safe-for-public-consumption replica in production. If this existed, each project would benefit as follows:

  • Labs could replicate directly from that by reading the binlog. It could then get rid of its views. If there's a problem, it could contribute to improving the upstream production replica.
  • Analytics could read with tungsten from the replica's binlog, and push the data into Hadoop. It would then run algorithms on top of the data daily. If we find problems we could also work on the upstream replica to fix them.
  • Dumps could piggyback on the analytics extraction and take advantage of Hadoop. The process would be faster and more maintainable than the current one, and would have Analytics to help and support with it. We could also design the new process with incremental dumps in mind from the start. These could also be re-created after a month passes and re-uploaded for public consumption (thus sanitizing further as more redaction comes across to hadoop from the replica).

Challenges:

  • Schemas are not the same across all databases, and rules for sanitizing are not the same across all databases, the solution needs to be flexible enough to allow exceptions but factored out enough so it's sane to maintain
  • Developers making schema changes are not aware of this sanitizing effort and are likely to break sanitizing logic. The system needs to be robust enough to work on column additions and smart enough to not blow up if columns are removed, for example.
  • In general, it would be nice if developers gave a thought to the sanitizer when making changes to production, we might be able to work with the release team to improve the release process here, maybe by having the sanitizer work in beta.
  • Sanitizing logic exists in a few different places. Views, sanitarium scripts, surviving redactron scripts in perl, labs view logic in python, and other attempts at porting by Chase, not to mention dumps does its own sanitizing. I understand you, Jaime, looked at these and tried to unify them, I can help with that.

As mentioned before, we need to do this so we can publish our data, and we'd love to do it sooner than later. Ideally this takes some work off your plate and off of labs's plate. Please let me know when we can have some design meetings to figure out the details and start brainstorming the technical details.

cc-ing @ArielGlenn even though this is a bit early in the discussion and more details need to be hammered out before we can figure out how this can help dumps.

Motivation. For our use cases in labs, analytics, and dumps, it would be nice if there was a real-time and safe-for-public-consumption replica in production.

All good here, but I had to stop reading here. The rest of the things are a predefined solution for a problem that you clearly do not understand fully, to the point that you either do not know how replication or a relational database or mediawiki works; or you do not know how labs works, which for me it means that you are endangering the project by potentially leaking private data (or less critically, you will expose wrong data publicly).

Clearly you want to solve analytics problems without regard for dumps or labs needs (dumps does not do its own sanitization, it uses mediawiki itself- you are just repeating what I said ideally some time ago), and you want results fast- this is not a 3 month problem, this is a 1 year project with a FTE. You also think that nobody thought about tungsten before and that is a magical tool that will solve all issues, but you do not have any kind of thought about isolated networks and their role. Forgive me for being blunt, but you are threatening privacy, and I am the last guardian here; I am also saving you from wasting time on solutions that do not work! You are starting too far away in the project, despite not knowing how it works, and the only description you have heard is from @chasemp, which has a 3rd account understanding of it.

With that mindset, we clearly cannot have a collaboration (unless you go backwards many miles and start at the right pace and place), otherwise I will ask for analytics to have less access to production data to avoid leaking data. Leaking only has to happen once and it is irreversible, there is not fix, there is no going back.

But I will be happy to explain to you why your solution doesn't work- for some reason, you stopped talking to me before I could tell you the technical reasons.

If you really want to help with this, you should start with hard blockers such as T104459 or T140788 or T108255 or T109179 or T17441 and commit to work on this (and only on this) for a full year. You also have to apply for root privileges, if you do not have them already. We do not even have hardware right now to make labs work as it should, so please handle the provisioning, too. Not only that, you are going to take on a single project by yourself: https://phabricator.wikimedia.org/project/view/1729/ Given you cannot even fix your own db infrastructure such as analytics-store (which are on a permanently broken state due to analytics' special needs) allow me to be a bit skeptical. This ticket could have as a title "fix WMF infrastructure", and it wouldn't be more generic.

I understand you, Jaime, looked at these and tried to unify them, I can help with that.

Yes, the tickets above mentioned are blockers (no matter the solution) among many others, are you going to work on those? Let me know and I will assign one to you, and give you a kick off meeting.

Since I was added (thanks!), let me weigh in briefly.

Note that "dumps" includes not just sql tables, generation of xml dumps of metadata for pages/revisions and xml files of revision content, but a variety of other things as well. Piggybacking off of analytics for the metadata files would be one small piece of the much larger picture, though quite welcome indeed if we had a roadmap for the content dumps as well.

In the meantime...
I would love to be able to draw from a sanitized db such as labs makes available to its users. (Question: what does it do about external stores?) We could generate dumps of publically available data from that. I would not want that to be the labs db itself, because that would be building into dumps a dependency on labs, it has already enough dependencies to keep me busy.

We dump private data for our own use/storage and we would want to continue to do that.

Making sure that changes to MediaWiki that restrict access to certain data or fields under certain conditions, make it into the newly minted python port of the santization script, is still a big concern to me. We really need to discuss how to safeguard that well. @jcrespo, I know you have thoughts and ideas on this.

I am very interested in those tickets linked above about schema and related issues, as you know, and I have to get back to the dbhell script and finish my integration of the host generation piece.

No, this isn't a one month project by any means, but it will get done, little by little.

Motivation. For our use cases in labs, analytics, and dumps, it would be nice if there was a real-time and safe-for-public-consumption replica in production.

All good here, but I had to stop reading here. The rest of the things are a predefined solution for a problem that you clearly do not understand fully, to the point that you either do not know how replication or a relational database or mediawiki works; or you do not know how labs works, which for me it means that you are endangering the project by potentially leaking private data (or less critically, you will expose wrong data publicly).

This is an insulting assumption, Jaime, and I'm not really sure why you're making that assumption about me. I simply can't spell out a thousand nuances in text. Also, it appears you're not reading the text that I did write, so that makes my task even more difficult. How about we meet up and clear this up? Unless all you are interested in doing is insulting me, in which case please let me know and I'll handle this project differently. Me being generally nice is not an open invitation to being treated like this. An apology would be an appropriate accompaniment to a meeting invitation.

Clearly you want to solve analytics problems without regard for dumps or labs needs (dumps does not do its own sanitization, it uses mediawiki itself- you are just repeating what I said ideally some time ago), and you want results fast- this is not a 3 month problem, this is a 1 year project with a FTE. You also think that nobody thought about tungsten before and that is a magical tool that will solve all issues, but you do not have any kind of thought about isolated networks and their role. Forgive me for being blunt, but you are threatening privacy, and I am the last guardian here; I am also saving you from wasting time on solutions that do not work! You are starting too far away in the project, despite not knowing how it works, and the only description you have heard is from @chasemp, which has a 3rd account understanding of it.

I suggest you check your words a few times before sending them to me. You're not being blunt, you're being considerably worse. I said nothing close to the accusations you make towards me, nor am I threatening privacy. I was protecting privacy for years before you showed up, so a lecture from you is particularly misplaced. I generally have good intentions towards our DBAs who work hard and deal with a lot. I expect the same in return.

With that mindset, we clearly cannot have a collaboration (unless you go backwards many miles and start at the right pace and place), otherwise I will ask for analytics to have less access to production data to avoid leaking data. Leaking only has to happen once and it is irreversible, there is not fix, there is no going back.

Your perception of where I am is mistaken. You seem to think that I'm somewhere in the future of my own timeline. Please adjust your own insecurities and apprehensions and realize I have not started this project. I have not done a single thing besides ask for a meeting with you.

But I will be happy to explain to you why your solution doesn't work- for some reason, you stopped talking to me before I could tell you the technical reasons.

I'm really confused here. When did I stop talking to you? I asked you three times already to please tell me when you would like to meet, and offered to wake up in the middle of the night to accommodate your schedule. That fact makes your assertion purely bizarre.

If you really want to help with this, you should start with hard blockers such as T104459 or T140788 or T108255 or T109179 or T17441 and commit to work on this (and only on this) for a full year. You also have to apply for root privileges, if you do not have them already. We do not even have hardware right now to make labs work as it should, so please handle the provisioning, too. Not only that, you are going to take on a single project by yourself: https://phabricator.wikimedia.org/project/view/1729/ Given you cannot even fix your own db infrastructure such as analytics-store (which are on a permanently broken state due to analytics' special needs) allow me to be a bit skeptical. This ticket could have as a title "fix WMF infrastructure", and it wouldn't be more generic.

I'm interested in helping to solve the problem, not repeating the mistakes that got us into trouble in the first place. The only proper place to start is with understanding the problem. I won't do that if you behave like this towards me, it's really not a fair expectation to have of anyone in a professional environment.

I understand you, Jaime, looked at these and tried to unify them, I can help with that.

Yes, the tickets above mentioned are blockers (no matter the solution) among many others, are you going to work on those? Let me know and I will assign one to you, and give you a kick off meeting.

Yeah, I would like to work on them, but at this point you have to show me some good faith first.

Since I was added (thanks!), let me weigh in briefly.
Note that "dumps" includes not just sql tables, generation of xml dumps of metadata for pages/revisions and xml files of revision content, but a variety of other things as well. Piggybacking off of analytics for the metadata files would be one small piece of the much larger picture, though quite welcome indeed if we had a roadmap for the content dumps as well.

@ArielGlenn: yes, we definitely know that, and plan on dealing with content as well as metadata. We just don't have anything concrete for this next quarter, but it's necessary for many other project so it won't be ignored.

In the meantime...
I would love to be able to draw from a sanitized db such as labs makes available to its users. (Question: what does it do about external stores?) We could generate dumps of publically available data from that. I would not want that to be the labs db itself, because that would be building into dumps a dependency on labs, it has already enough dependencies to keep me busy.

+1, agreed, this is why we think a single place from which we can draw sanitized data would be a good idea in theory.

We dump private data for our own use/storage and we would want to continue to do that.

Right, but that can continue via a parallel process with very similar code. If we had a sanitized version of the data in production, we could pull from the un-sanitized replicas in the same way, into hadoop. Then dumps could go from there on parallel tracks too.

Making sure that changes to MediaWiki that restrict access to certain data or fields under certain conditions, make it into the newly minted python port of the santization script, is still a big concern to me. We really need to discuss how to safeguard that well. @jcrespo, I know you have thoughts and ideas on this.
I am very interested in those tickets linked above about schema and related issues, as you know, and I have to get back to the dbhell script and finish my integration of the host generation piece.
No, this isn't a one month project by any means, but it will get done, little by little.

I would never estimate this as a one-month project. I have some thoughts about pieces we could finish in one or two quarters. I don't think beyond that, but I don't think agreement about how long the project will take is very important. We all agree it will take a long time, and that it's absolutely necessary.

It seems that my own frustration talked instead of having a civilized response.

I apologize sincerely and I would understand if you do not want to work with me any more. I would be happy to meet you and apologize again live. I am sorry. If you prefer it, there is another DBA that certainly could do a better job than me, if you prefer it- although I would like to still give a try myself. Sorry again.

It seems that my own frustration talked instead of having a civilized response.
I apologize sincerely and I would understand if you do not want to work with me any more. I would be happy to meet you and apologize again live. I am sorry. If you prefer it, there is another DBA that certainly could do a better job than me, if you prefer it- although I would like to still give a try myself. Sorry again.

No problem at all, @jcrespo. As I said, I respect very much your work and how over-worked you are. That's half of my motivation to open this can of worms in the first place! :) Ok, so no problem, and please do set up a meeting at any time you are available. I have no problem working with just you but if you want to bring in other folks, that's up to you. When you do set up the meeting, don't worry about my schedule, I'll make room for this.

Thank you for the apology, I appreciate that.

Nuria added a subscriber: Nuria.Oct 18 2016, 6:34 PM

@jcrespo, @Milimetric: let's start all over here as we work better as a team. The fact that we get so fired up when talking about privacy means that we care.

We all want 1) the data redaction to be a better process and 2) we want to have just one process that several datasources tap in along.

We understand that it is not a simple problem, otherwise it would have been solved ages ago but it is also not an unsolvable one. So let's do some project planning and see what piece of the problem we can tackle first. I think the first step might be to understand how our current redaction process actually works.
,

Ok, everyone, I spoke to @jcrespo and we will collaborate on this project but there are too many dependencies to resolve first. I will try to help with those this quarter and we will resume talking about this task in January, 2017. Thanks for everyone's input, and I'll re-open the conversation then.

Nuria moved this task from Next Up to Paused on the Analytics-Kanban board.Oct 20 2016, 7:48 PM

Removed Dumps-Generation because that's for issues with the current dumps process only. Dumps-Rewrite is the right one.

Milimetric moved this task from Wikistats Production to Radar on the Analytics board.
Milimetric moved this task from Radar to Wikistats Production on the Analytics board.
Nuria added a comment.Mar 20 2017, 4:05 PM

Analytics is importing data for mediawiki edit reconstruction from labs, data is public thus redaction is not needed.

Nuria closed this task as Declined.Mar 20 2017, 4:06 PM
ArielGlenn moved this task from Backlog to Done on the Dumps-Rewrite board.Apr 24 2017, 11:05 AM