Page MenuHomePhabricator

Phlogiston dump is missing projects
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Download a Phlogiston dump
  2. Open in python
> with open('../phabricator_public.dump') as dump_file:
      data = json.load(dump_file)
  1. Examine the data.project.projects dict:
> len(data['project']['projects'])

Actual result

*** TypeError: object of type 'NoneType' has no len()

Expected Result
Thousands of keys found.

Note that the dump is not empty; it's 1.1 Gb and the adjacent data set, project.columns, is populated:

> len(data['project']['columns'])
15978

The Phlogiston dump has not been actively used in 2019, AFAIK, and the reports have been empty and/or erroring for many months, so this may not be a recent change.

Event Timeline

Adding @mmodell since this is going to be something related to the phab puppet module or the phabricator-tools/public_task_dump.py script.

Adding @mmodell since this is going to be something related to the phab puppet module or the phabricator-tools/public_task_dump.py script.

Thanks @ArielGlenn, for bringing this to my attention!

It's possible (likely even) that the phabricator schema changed and public_task_dump.py needs to be updated to match. I'll try to figure out what's happened.

mmodell triaged this task as Medium priority.Oct 29 2019, 9:08 PM

So the schema didn't change as far as I can see. I can't actually find any reason why that part of the dump would be empty. I'll keep digging though.

Here is the relevant part of the dump code:

data['task'] = taskdata
data['project'] = {}
data['project']['projects'] = phabdb.get_projectbypolicy(pdb, policy='public')
data['project']['columns'] = phabdb.get_projectcolumns(pdb)

mdb.close()
pdb.close()

with open('/srv/dumps/phabricator_public.dump', 'w') as f:
    f.write(json.dumps(data))

Note that data['project']['columns'] is filled after data['project']['projects'], so a timeout or unhandled exception should prevent data['project']['columns'] from being filled. There don't appear to be any exception handlers in the code either so the presence of data in data['project']['columns'] should indicate that the projects data fetch completed without any detectable errors. The final serialization is done all at once at the end with a call to json.dumps so really any errors should prevent the json from being saved at all. 😕

Any suggestions on how to proceed? There is no specific deadline, but I would like to have some Phlogiston results for consideration in the next few weeks.

I haven't had a chance to look at this further but I will attempt to run the dump script manually and hopefully I can find further clues.

	        for p in rules:
	            if p['rule'] == "PhabricatorPolicyRuleProjects":
	                allowedProjects = p['value']
	                break
	        else:
	            allowedProjects = []

Where exactly get PhabricatorPolicyRuleProjects (and PhabricatorPolicyRuleUsers) defined?

The Phlogiston dump has not been actively used in 2019, AFAIK, and the reports have been empty and/or erroring for many months, so this may not be a recent change.

If there is really no use case, then I whether to consider sunseting this dump instead of maintaining it...
Phabricator has its Conduit API (which might lack some functionality but hard to say without an analysis).

ArielGlenn added a comment.EditedNov 19 2019, 1:55 PM

Who was using it in the past? Is this https://www.mediawiki.org/wiki/Phlogiston/Data_Loading_Model actively used by anyone?

I like the idea of getting rid of cruft; otoh removing a public dataset seems suboptimal, if the data is not available for download in some other fashion. Call me a fence-sitter.

It has no active users at the moment and has been broken for months. However, I would like to re-activate it for further reporting exploration and prototyping. It may now be possible to replace it with calls to the API (which didn't have all the necessary info and/or wasn't stable when the code was originally written), but that would be a significant refactor. If there's a quicker fix for the dump, it would be more helpful sooner to have the dump back.

mmodell added a comment.EditedDec 12 2019, 8:29 PM

I'm running the public_task_dump.py in a terminal session and I found two coding errors which may be related. It's really strange that it didn't cause a problem before but maybe there was a subtle change in python behavior between versions and we just upgraded to a newer python somewhere along the way?

This whole dump script is some pretty messy python 2 code, I'm not sure how maintainable it is in the long term but hopefully I will be able to get it working again for the time being.

Does it need to be converted to python3? How much of a PITA is it going to be to port wmfphablib?

@ArielGlenn: I suspect it might be a pain. It's pretty old code not written with python3 in mind, however, it might not be too bad, it's mostly straightforward but there is just quite a lot of it.

mmodell closed this task as Resolved.Dec 16 2019, 4:08 PM

This is fixed in rPHTOb3b4a8587022: When passing a tuple to string formatting, include an ending comma. I ran the new code manually and confirmed that the dump now correctly includes the projects data.

I ran a new phlogiston process and it handled the new dump without error, and got projects data.