Page MenuHomePhabricator

Phlogiston dump is missing projects
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Download a Phlogiston dump
  2. Open in python
> with open('../phabricator_public.dump') as dump_file:
      data = json.load(dump_file)
  1. Examine the data.project.projects dict:
> len(data['project']['projects'])

Actual result

*** TypeError: object of type 'NoneType' has no len()

Expected Result
Thousands of keys found.

Note that the dump is not empty; it's 1.1 Gb and the adjacent data set, project.columns, is populated:

> len(data['project']['columns'])
15978

The Phlogiston dump has not been actively used in 2019, AFAIK, and the reports have been empty and/or erroring for many months, so this may not be a recent change.

Event Timeline

Adding @mmodell since this is going to be something related to the phab puppet module or the phabricator-tools/public_task_dump.py script.

Adding @mmodell since this is going to be something related to the phab puppet module or the phabricator-tools/public_task_dump.py script.

Thanks @ArielGlenn, for bringing this to my attention!

It's possible (likely even) that the phabricator schema changed and public_task_dump.py needs to be updated to match. I'll try to figure out what's happened.

So the schema didn't change as far as I can see. I can't actually find any reason why that part of the dump would be empty. I'll keep digging though.

Here is the relevant part of the dump code:

data['task'] = taskdata
data['project'] = {}
data['project']['projects'] = phabdb.get_projectbypolicy(pdb, policy='public')
data['project']['columns'] = phabdb.get_projectcolumns(pdb)

mdb.close()
pdb.close()

with open('/srv/dumps/phabricator_public.dump', 'w') as f:
    f.write(json.dumps(data))

Note that data['project']['columns'] is filled after data['project']['projects'], so a timeout or unhandled exception should prevent data['project']['columns'] from being filled. There don't appear to be any exception handlers in the code either so the presence of data in data['project']['columns'] should indicate that the projects data fetch completed without any detectable errors. The final serialization is done all at once at the end with a call to json.dumps so really any errors should prevent the json from being saved at all. 😕

Any suggestions on how to proceed? There is no specific deadline, but I would like to have some Phlogiston results for consideration in the next few weeks.

I haven't had a chance to look at this further but I will attempt to run the dump script manually and hopefully I can find further clues.

	        for p in rules:
	            if p['rule'] == "PhabricatorPolicyRuleProjects":
	                allowedProjects = p['value']
	                break
	        else:
	            allowedProjects = []

Where exactly get PhabricatorPolicyRuleProjects (and PhabricatorPolicyRuleUsers) defined?

The Phlogiston dump has not been actively used in 2019, AFAIK, and the reports have been empty and/or erroring for many months, so this may not be a recent change.

If there is really no use case, then I whether to consider sunseting this dump instead of maintaining it...
Phabricator has its Conduit API (which might lack some functionality but hard to say without an analysis).

Who was using it in the past? Is this https://www.mediawiki.org/wiki/Phlogiston/Data_Loading_Model actively used by anyone?

I like the idea of getting rid of cruft; otoh removing a public dataset seems suboptimal, if the data is not available for download in some other fashion. Call me a fence-sitter.

It has no active users at the moment and has been broken for months. However, I would like to re-activate it for further reporting exploration and prototyping. It may now be possible to replace it with calls to the API (which didn't have all the necessary info and/or wasn't stable when the code was originally written), but that would be a significant refactor. If there's a quicker fix for the dump, it would be more helpful sooner to have the dump back.

I'm running the public_task_dump.py in a terminal session and I found two coding errors which may be related. It's really strange that it didn't cause a problem before but maybe there was a subtle change in python behavior between versions and we just upgraded to a newer python somewhere along the way?

This whole dump script is some pretty messy python 2 code, I'm not sure how maintainable it is in the long term but hopefully I will be able to get it working again for the time being.

Does it need to be converted to python3? How much of a PITA is it going to be to port wmfphablib?

@ArielGlenn: I suspect it might be a pain. It's pretty old code not written with python3 in mind, however, it might not be too bad, it's mostly straightforward but there is just quite a lot of it.

This is fixed in rPHTOb3b4a8587022: When passing a tuple to string formatting, include an ending comma. I ran the new code manually and confirmed that the dump now correctly includes the projects data.

I ran a new phlogiston process and it handled the new dump without error, and got projects data.