Page MenuHomePhabricator

Phabricator database access for Joel Aufrecht
Closed, DeclinedPublic

Description

I received a request from @ksmith to get phabricator db access for @JAufrecht so that he can run some queries for reporting. What would be the best way to handle this? Seeking input from SRE. Thanks!

Event Timeline

mmodell updated the task description. (Show Details)
mmodell edited projects, added Phabricator; removed SRE-Access-Requests, acl*sre-team.
mmodell changed Security from None to Access Request.

Note that a static snapshot would be fine.

Dzahn added a subscriber: Springle.
Dzahn subscribed.

Added dba and database tags re: providing the static snapshot

I could create the db dump but I believe some things might need to be redacted.

Note that phabricator has a lot of databases, and I assume you guys only need a few specific tables? If that is the case, it might be a lot easier to provide a dump of just those tables rather than a complete dump of all phabricator data.

I could take a guess of what tables are needed, although I would probably be wrong, so we might need several iterations. The main ones would be Projects, Tasks, Columns, and the transactions and helper tables related to those.

I'm more concerned about filtering out security-flagged content, but hopefully there's an easy way to do that.

Maybe @csteipp has an opinion, I believe Security sensitive tasks are the only data that need to be redacted?

A read-only db user might be easier, as long as the queries would only be collecting summary information then I believe that wouldn't reveal any security-sensitive info.

@JAufrecht can you describe the queries you'll be running?

I'd like to not reveal contents or counts of security bugs. NDA group is probably the same, although the count probably isn't as critical for them.

There is sample data in this spreadsheet.

For burnups: total # of stories, sum of story points, total # of stories with each story point value, for each different status and column, filtered by project. Separate row for each day of a query period.

for velocity: total points of stories "Done" for each sprint for a team

for cycle times: dates of each column or status transition for each story, filtered by a project and by iterations

don't need any titles or contents

I'm only interested in the VisualEditor project for now; could you filter for just those? not sure how much overlap there is with security; would you have to remove the security stories or just remove anything identifying them as such (project tag, title, contents)?

A dump would probably be more helpful than a read access, since I'll probably do a bunch of data manipulation in SQL and/or Python. This is a one-off proof of concept and research project, not a branch of development. Although, I could dump what I need via read-only access and filter out anything sensitive as I do.

@JAufrecht, and I'm assuming you're going to make the results public? If so, please do exclude any private issues-- you can either exclude everything in Security or you can examine the actual acl value and see which tasks are publicly readable.

If we make a private dump that includes security bugs, we'll need to clarify how that data will be handled and where it will be stored. So if we can exclude them before making the dump, that would be best.

I have no objections to working with a dump that is already sanitized.

I will make results public, but they'll look like

image.png (564×773 px, 19 KB)
, so pretty high-level compared to security bug counts.

not sure how much overlap there is with security

I think I don't leak too much info by saying in public that currently the number of tasks associated with the VE project && marked as security (checkbox enabled) is 1. I also see 1 task associated to both VE && WMF-NDA (hence access-restricted too).

My vague guess of DBs and tables (might not be complete at all) is (in db.table format):

  • phabricator_maniphest.maniphest_transaction (complete log of any kind of transitions; dateModified column is in unixtime; transactionType column can e.g. have a "projectcolumn" value)
  • phabricator_maniphest.edge (updated list of existing connections between objects like assignee&&task or task&&project)
  • phabricator_maniphest.maniphest_task
  • phabricator_project.project_column

https://phabricator.wikimedia.org/T28#1232723 might provide some more impressions of SQL phun.

for cycle times: dates of each column or status transition for each story, filtered by a project and by iterations

Not sure what "iterations" are here in Phab terms.

I'm only interested in the VisualEditor project for now; could you filter for just those?

VisualEditor's project PHID is PHID-PROJ-dafezmpv6huxg3taml24 (e.g. used as value in phabricator_maniphest.edge.dst and phabricator_maniphest.edge.dst.src)

not sure how much overlap there is with security

I think I don't leak too much info by saying in public that currently the number of tasks associated with the VE project && marked as security (checkbox enabled) is 1. I also see 1 task associated to both VE && WMF-NDA (hence access-restricted too).

Yeah, for one-off reports this isn't too bad. If this is something we start running weekly, and for all projects, then it's easy to see "2 new security issues in this extension this week". Which is what I would like to avoid.

I have a script working off of a sample Phab database, and a window of free time today through next Tue/Wed, so the timing is perfect if there's any way to move forward with the dump.

The security check box is only tangentially related to actual task security. We have a large amount of sensitive tasks here actually and unlike Bugzilla it is not a binary state. The process of sanitizing a full dump, or really any dump of tasks or pastes or files, is complex. This really really needs serious scrutiny to not reveal operational detail that can harm the organization. We have everything from fundraising metadata to debugging pastes of infrastructure configuration here.

@mark for awareness

I don't know anything about the phab API but I wonder if you could use that in the meantime while waiting for a dump sanitization procedure. "Just VE" probably limits the number of tasks involved drastically and redaction/hiding security, etc. is covered by using a phab user that has no special privileges/groups to do API requests.

does the API allow multiple tasks to be fetched with a single request?

Is it really non-trivial to only include things with visibility policy 'Public'? That way you don't need to deal with custom policies or anything.

@jeremyb-phone It is somewhat of a pain to set up an app to actually use the Phab API, and after you do so, the API has some pretty significant limitations on what data can be exported. In this case, we want transaction data, and I believe some of it that not exposed via the API.

Is it really non-trivial to only include things with visibility policy 'Public'? That way you don't need to deal with custom policies or anything.

Actually, no, that is straightforward for maniphest tasks:

SELECT * FROM phabricator_maniphest.maniphest_task WHERE viewPolicy="public";
/* or join transactions: */
SELECT * FROM phabricator_maniphest.maniphest_task as task, maniphest_transaction as trans where task.viewPolicy="public" and task.phid=trans.objectPhid;

Being a well defined relational schema some tables like comments related to transactions don't have an enforced security mode outside of their transaction association. So, sure, all things public is a sensible idea and the execution of that isn't necessarily straight forward. Files are another where cross application security associations determine the ultimate effective security setting. To me this is very easy to screw up in either sense.

Let me see if I understand the options. I see all of these as temporary hacks, because we hope to get something built in to Phab eventually.

Option 1: Grant Joel some kind of direct access to the live Phab database or a fresh replication and trust that he filters appropriately and practices good security.

Option 1a: As Option 1, but limit account access to the extent possible with MySQL permissions.

Option 2: Make a full dump and trust Joel's security processes. Provide some process, manual or automatic, to repeat daily or weekly until permanent solution is available.
Option 2a: Make a filtered dump of task and transaction based on viewPolicy=public and/or excluding Security-flagged items, as noted above.
Option 2b: Make a sanitized dump that does not contain any text, only metadata. This would take a fair amount of work, i.e., have a fair amount of risk of missing something, because it would have to work at the row level, not the table or column level, to strip out everything except transactionType=status, projectcolumn, or points.

Do any of these options seems acceptable to all?

The script is here: https://github.com/jaufrec/phab_task_history
It needs the following tables and fields:

phabricator_maniphest.maniphest_transaction:
  dateModified
  objectPHID
  transactionType
  newValue

phabricator_project.project:
  phid
  name

phabricator_project.column:
  phid
  name

The script currently reads Titles but this is for debugging purposes and can be removed once the resulting aggregate charts are validated.

Personally i would vote for 1a) or a modified 1a).

  • get Joel a shell user to get on some internal work host with a mysql client (tin, mira?)
  • make a new, personalized, MySQL user / GRANT to connect to all phab databases but that is allowed to only SELECT/<read only things needed> and only from that internal host
  • trust Joel to not publish private data and trust MySQL that no changes can be made and you can't connect to it from external

So basically safe all the work of the 2) steps but don't just open up the database to connect to it directly from external.

What do the database admins think?

From my standpoint the only viable solution at the moment is 2a, and I have at least a semblance of reasoning for my pessimism and general poo-poo'ing which I will try to convey.

Option 2a: Make a filtered dump of task and transaction based on viewPolicy=public and/or excluding Security-flagged items, as noted above.


Option 1: Grant Joel some kind of direct access to the live Phab database or a fresh replication and trust that he filters appropriately and practices good security.

Option 1a: As Option 1, but limit account access to the extent possible with MySQL permissions.

Option 2: Make a full dump and trust Joel's security processes. Provide some process, manual or automatic, to repeat daily or weekly until permanent solution is available.
Option 2a: Make a filtered dump of task and transaction based on viewPolicy=public and/or excluding Security-flagged items, as noted above.
Option 2b: Make a sanitized dump that does not contain any text, only metadata. This would take a fair amount of work, i.e., have a fair amount of risk of missing something, because it would have to work at the row level, not the table or column level, to strip out everything except transactionType=status, projectcolumn, or points.

This is not meant to offend but we have data in the phab DB's that none of us should access or export without involving the owners. Primarily fundraising uses this platform and their issues are hidden to everyone else. Also, soon is R&D and procurement. This has nothing to do with whether any of us are smart or trustworthy people. It has more to do with the vast increase in the chance of this data being leaked by people who are not valid stewards of it. That includes myself. A full dump including this in order to report on some unrelated task heuristics is not a reasonable practice.

A reporting application that lives on the phab box and accesses the DB directly is not a bad idea in general, but it seems we will either need to replicate the complex permissions logic within Phab or hook into the existing libraries and stick with PHP. @Krenair's suggestion of a whitelist approach is probably the most sane though in this scenario. Instead of trying to figure out what things we have to exclude we should specify and only report on data already made public. We can also somewhat easily extend the API to do things we want to do without involving ongoing backend shenanigans. For instance we could make one big dummy api call to return whatever task information we want in bulk for these purposes that would be in-house but the maintenance of such an extension would be reasonable and the security implications minimal.

Because of

This is a one-off proof of concept and research project, not a branch of development. Although, I could dump what I need via read-only access and filter out anything sensitive as I do.

I am wondering if generating reporting data for testing and analysis is a valid approach here. It has the disadvantage of being fake but shouldn't be too difficult since the primary consumption seems to be related to the metadata. We have taken this approach in other cases in terms of phabricator development.

In essence for me:

Option 1. Too risky and too broad for these purposes
Option 1a: Same considering the protected and unprotected data lives in the same tables
Option 2: Not different from option 1 from a risk perspective.
Option 2a: Seems most sane and actually this is not the first request for it.
Option 2b: Seems more work than it's worth since 2a stands to fulfill more needs at the same risk.

FWIW I can't access sample data in this spreadsheet.

I also tried to run https://raw.githubusercontent.com/jaufrec/phab_task_history/master/import_transaction.py on phab-01 (a general test box) and it will require a few packages I don't think we use elsewhere and seems to fail with sqlalchemy.exc.DatabaseError: (DatabaseError) 1292: Incorrect datetime value

To clarify, I'm trying to produce a proof of concept of this kind of reporting, which means I'm trying to create a historical report with real data for at least FY2015Q3 (Jan-Mar 2015) ASAP, and then for Q4 in progress with at least weekly updates. So fake data wouldn't help that and probably wouldn't even be a net gain for debugging.

So what's the entirety of the filtering we want for Option 2a?

  1. Only phabricator_maniphest.maniphest_transaction, phabricator_project.project, and phabricator_project.column
  2. Public: task.viewPolicy="public"
  3. not security: exclude any tasks and related transactions) assigned to Security project
  4. VE only: ? (less ideal from reporting, plus we might want to repeat this on other projects soon)

Here's a public version of the Reporting mockups.

Could we be so simple as to select every viewPolicy="public" from maniphest_task and then associated transactions and edge relationships for the matching tasks? (and phabricator_project)

@mmodell is pretty close to my thinking here...

SELECT * FROM phabricator_maniphest.maniphest_task WHERE viewPolicy="public";
/* or join transactions: */
SELECT * FROM phabricator_maniphest.maniphest_task as task, maniphest_transaction as trans where task.viewPolicy="public" and task.phid=trans.objectPhid;

Is there any task information that is non-public that we need? If not I vote we do the whole lot. There are a few mechanisms at play here with the security project, the security drop down, and the use of templates and task ACL's for reuse, but they all under the covers do the same stuff. If we say we only care about 'public' tasks we can take them and track down all relevant information and I think it's all above board and easily auditable.

(assuming we don't want comment contents as there are a few edge cases here that we would have to handle separately)

Dzahn triaged this task as Medium priority.May 26 2015, 9:21 PM

I also need phabricator_project.project and phabricator_project.column. Other than that, this sounds fine.

I'm not totally sure I understand why a dump approach is being pursued here when there are other options available. In a restricted environment, why not simply generate SQL queries that the Wikimedia operations team can vet, both the query code and the output, and then have cron execute the queries occasionally?

Dumps or database replication, similar to what we do with the MediaWiki wikis, would be nice and would have many general applications, of course, but it doesn't really seem relevant here.

Joel, Kevin, Chris, and Chase met to resolve discussion:

Identify people who need to be involved in this decision
Do we have everyone necessary to approve Phab data access?
if Chase and Chris agree, probably okay to proceed

Do we have someone who can execute decisions we make?
Chase or Mukundu can execute

Confirm or extend menu of possible solutions (see https://phabricator.wikimedia.org/T99295#1312812)

Option 1: Access. did not discuss in depth.

Option 2: Dump.
not sure where to store dump file securely
not sure how to really sanitize it adequately to move it out of a secure perimeter.
Decision: Dump.

Security concerns
Complete Phab contains several kinds of sensitive data. Probably all of this is addressed by filtering by “public”, but that isn’t 100% guaranteed. Also, filtering by public requires working through the relationships.
Decision: filter on public, and filter on VE, and exclude the title column, so that exported data includes only bug IDs,

Other users
Other (non-WMF) users have expressed interest in accessing Phabricator data in bulk, but enabling this has not been a high priority so no progress has been made.

Decision: don’t take any current actions to support this specifically, but consider this in design decisions.

Conceptual
What exactly is the dump script doing?
Dump Option A: Modify Joel’s script to work in production; it produces a sanitized, denormalized flat file suitable for multidimensional reporting.

Pro

  • fewer total steps

Con

  • higher risk of introducing bugs in production environment
  • puts more code within the security perimiter
  • output is less suitable for other uses

Dump Option B: Minimal dump script in production produces sanitized representations of Phabricator database tables that can be moved outside security

Decision: Option B.

Practical
What are the technical constraints?
Python or PHP is fine. MySQL code would be best inside a script that could be run from the shell, rather than being run from within mySQL.

Who’s doing the work?
Chase could write the dump script; has a queue and couldn’t do it today.

Decision: Joel to make first attempt.

If the code is written by Joel outside of security and without access to the production environment, how can we avoid a very long debugging loop?
Decision: Use Phab-01, which is very close to production. Don’t use sqlalchemy or any other tool that is not on phab-01.

Next Steps

  • Chase to get joel shell access to phab-01
  • Joel to develop “dump script” to produce sanitized output files.
  • Chase and Chris audit dump script
  • Chase to run on production and provide resulting files to Joel via publicly accessible location (resulting files would have no security implications). Chris is invited to review if necessary.

Currently stuck on one issue while preparing a dump script on phab-01:

  1. What database credentials (username, password if any, database name) should I use to connect to the Phabricator database?

I was stuck on "What python/mysql libraries are installed on production (and hopefully phab-01)?" but that was a python3 issue; in python2, mysql and mysql.connector are available.

Thanks for the summary @JAufrecht. In the act of reviewing teh schema it seems there were lots of edge cases @csteipp and I discovered. Old comment history we had hidden, descriptions that had sensitive info removed, titles that had sensitive information, and internal things like task mail keys that are semi-private. Long story short I wrote what I believe is a valid dump for this https://gerrit.wikimedia.org/r/#/c/214398/. It creates a (large) json file with all of the information we have discussed that is safe for public consumption about public tasks. I put a dump from today on tin for @csteipp to look over and then I will share it more publicly.

chasemp closed subtask Restricted Task as Invalid.May 29 2015, 5:46 PM