Page MenuHomePhabricator

[Phabricator] Expose a project's removed tasks through Conduit
Closed, DeclinedPublic5 Estimated Story Points

Description

In order to generate meaningful burnup charts we need to know which tasks have been in a Phabricator project at any given time and retrieve this information from Conduit. This can be done by exposing project transactions e.g. through a Conduit method called “project.gettransactions” similar to the existing “maniphest.gettasktransactions”. Since projects tend to have a lot more transactions than tasks do, it might be desirable to limit the returned transactions to a subset in order to avoid performance issues. In our case this subset would contain all transactions concerning tasks being moved into and out of the project.

Event Timeline

Jakob_WMDE raised the priority of this task from to Needs Triage.
Jakob_WMDE updated the task description. (Show Details)
Jakob_WMDE added a project: Phragile.

@Christopher if I remember correctly you already looked into this while working on the sprint extension. It would be great if you could add your thoughts/concerns on this!

ksmith set Security to None.
ksmith moved this task from Backlog to Need Discussion on the Phabricator (Upstream) board.

This request is tied to a specific WMF initiative ("better predictive reporting"), and has funding available for the paid prioritization program.

I can definitely help with this. First, a Project does not contain tasks. A task contains projects. So, perhaps the question/problem is to find "for a task X what projects were assigned to it at time T". Also, refer to https://secure.phabricator.com/T6771 where @epriestley describes their disposition on this kind of query. Tomorrow, I will take a close look at what is possible to extend the queries I have developed in the Sprint extension into conduit.

@Christopher: Thanks for the info and link.

We really need a project-centric query, not a task-centric query, because the question is "what tasks were associated with this project at time T?" We wouldn't know which tasks to ask about, because any task in the system could theoretically have been in our project of interest at some point in time.

Also, I think we'll have better luck upstream with a generic request ("List all transactions of type X that mention project P") than a specific one ("List all tasks that have ever been in project P", or "List the tasks that were in project P at time T"). With the raw transaction data, we can re-construct the history that we need.

From a technical standpoint, would it be possible for some phabricator instances to have indexes that other instances do not? Or must indexes be defined statically at package build time, and thus must apply to all instances?

@ksmith, if you are certain you need something developed upstream and you have the funds for it, the best is to report upstream directly and keep this task as a note to ourselves.

To touch on this from the upstream perspective:

In general, you can implement new Conduit methods as extensions by copy/pasting a similar method, dropping your new file into phabricator/src/extensions/, and restarting Phabricator. Obviously, there are a lot of tradeoffs to doing this, but this is one of the least-fragile ways you can extend Phabricator and pull data out of it. We usually don't want these in the upstream, but I'm happy to sanity check them if you want to make sure nothing policy-violating / horribly bad is happening (see https://secure.phabricator.com/D13066 for a recent example from another install). I mention this just as a baseline alternative since the rest of the upstream perspective involves laying a bunch of foundation.

To this specific request, no transactions exist on projects to record when tasks were added to or removed from a project, so there's no row in the database to query. There are both technical and product reasons for this: long-lived projects would have millions of those transactions, the transactions usually aren't interesting to humans when viewing the project, and other product features like Feed generally better serve most of the use cases for tracking project activity.

The way we want to move burndown charts forward in the upstream is through the Facts application (see https://secure.phabricator.com/T1562). For project adds/removes, the specific plan is to derive these events with a "fact extractor", which is part of the Facts ETL pipeline. I wrote some code for this about a year ago (see https://secure.phabricator.com/T5051#59359), albeit as a standalone extension rather than an extractor, and it seems to work properly (I have an extractor version of this locally):

Screen_Shot_2014-05-16_at_4.20.51_AM.png (1×1 px, 262 KB)

In this screenshot, "scope-expand" events were extracted from projects being added to tasks. "scope-contract" events are not pictured, but extracted from tasks being removed. So the overall plan here is:

  • Facts runs "extractors" continuously in the background, which extract information from updated objects.
  • In Maniphest, the extractors generate events for project adds/removes, among other events.
  • These events are saved to the database in a homogenous, easily query-able format.
  • In Facts, Projects and Conduit, this data can then be queried directly.

We haven't actually built this end-to-end or deployed it on a production dataset, so it's possible there are gaps in the design, but this is the direction we want to pursue first.

So we're happy to prioritize work here, but the upstream pathway forward goes through Facts, and project.gettransactions is likely not useful to resolve this problem (no queryable rows exist to return the data you're interested in), not something we want to pursue as part of the solution to this problem (it would be entirely obsoleted by Facts, which is a vastly more general system), and blocked on foundational work in https://secure.phabricator.com/T5873. If you're still interested in talking about the viability of moving this forward in the upstream despite that, feel free to get in touch.

While I'm here:

Also, I think we'll have better luck upstream with a generic request

This is usually true, but the best request is often a description of the problem you're facing. If you have an idea of how to solve it we'd love to hear that too, but since we have a clearer picture of the Phabricator internals and our roadmap, we can sometimes come up with a better solution if we know what you're ultimately after.

From a technical standpoint, would it be possible for some phabricator instances to have indexes that other instances do not? Or must indexes be defined statically at package build time, and thus must apply to all instances?

This isn't currently possible. You can add new databases and tables, but we synchronize the actual table definitions to the expected definitions when upgrading, including by adding and removing keys. You could manually add the key back each time, or you could maintain a small (~3 line) patch to tell Phabricator to add (and retain) it.

Approximately, that patch is:

diff --git a/src/applications/transactions/storage/PhabricatorApplicationTransaction.php b/src/applications/transactions/storage/PhabricatorApplicationTransaction.php
index a6c1715..e868614 100644
--- a/src/applications/transactions/storage/PhabricatorApplicationTransaction.php
+++ b/src/applications/transactions/storage/PhabricatorApplicationTransaction.php
@@ -118,6 +118,9 @@ abstract class PhabricatorApplicationTransaction
         'key_object' => array(
           'columns' => array('objectPHID'),
         ),
+        'wmfkey_type' => array(
+          'columns' => array('transactionType'),
+        ),
       ),
     ) + parent::getConfiguration();
   }

But you can't do this in an extension today (and I don't anticipate us ever letting you adjust table definitions without local patches).

@epriestly: Thanks for the information and advice.

I'm a bit confused when you say "no transactions exist on projects to record when tasks were added to or removed from a project, so there's no row in the database to query. "

Clearly the data exists in the database, since each task knows when it was added to a project. It's visible in the history whenever you view a task. Are you saying that row (which I assume is a transaction) is stored in a way that is difficult or impossible to get to if you know the project but not the task?

I don't think we can wait for Facts, since we need charts right away. But to understand the concept...would it be able to build a chart of historical data? Or only of data that was touched after the pipeline was installed? https://secure.phabricator.com/T1562 doesn't really explain what Facts is supposed to be. Has your thinking on it evolved any since your comments there?

Creating our own index (and re-creating it after each upgrade) might help us in the short term, so thanks for that sample code.

Are you saying that row (which I assume is a transaction) is stored in a way that is difficult or impossible to get to if you know the project but not the task?

Yeah: the format of the row is approximately <taskID, blobOfJSON, ...>. The blob of JSON has details about the change, including any added or removed projects. You need to decode the JSON to identify which projects a given change affects.

would it be able to build a chart of historical data?

Yes. Facts are extracted from all objects, and then re-extracted when new objects are created or existing objects are updated. You can tell the system to reset and re-process everything later if you, e.g., add new extractors or we fix a bug. This technically works today: bin/facts cursor --reset ManiphestTask will re-extract all tasks, but none of the extractors in HEAD are very interesting (and the extraction process does not run as part of the daemon loadout by default).

Has your thinking on it evolved any since your comments there?

Are you more interested in knowing how it works, or what you can do with it, or something else?

From a technical perspective, it's an ETL pipeline which runs continuously on the data produced by all other applications and converts their data into a generic format, plus a web UI for querying that data.

From a user perspective, it's a web UI that lets you chart anything (for some realistic value of "anything" which is actually "a lot of things") from any Phabricator application by typing some values into a form and hitting "Plot Chart".

From a Phabricator developer perspective, it's a toolset for letting you write a very small amount of code and get flexible charting out of it.

The ETL pipeline approximately works right now, and the web UI exists today in a very crude form, basically two interfaces:

This is where you configure a chart:

Screen Shot 2015-06-02 at 1.30.08 PM.png (728×1 px, 81 KB)

This is where you look at your chart:

Screen Shot 2015-06-02 at 1.31.09 PM.png (805×1 px, 103 KB)

This is obviously extremely crude and not useful, but is basically the intended workflow.

The major differences between this and the final product would be many more options for choosing what to plot and how to plot it (e.g., many more datasets, time ranges, different chart types, putting multiple and derived datasets on the same chart, adding additional constraints like owner, actor, or projects) and integration with other applications (so you could save a chart and put it on a dashboard, and applications like Projects could pull charts -- like burndown charts -- in by default).

Change 216940 had a related patch set uploaded (by Christopher Johnson (WMDE)):
Adds Task Project History Logs and View

https://gerrit.wikimedia.org/r/216940

Change 216940 merged by Christopher Johnson (WMDE):
Adds Task Project History Logs and View

https://gerrit.wikimedia.org/r/216940

Tobi_WMDE_SW renamed this task from Expose a project's removed tasks through Conduit to [Phabricator] Expose a project's removed tasks through Conduit.Jul 6 2015, 12:52 PM

I don't think this task is still "high priority" nowadays?

It's not high priority for Phlogiston.

ksmith lowered the priority of this task from High to Low.Dec 15 2015, 6:58 PM

And from the phragile side, it's no longer high priority for the WMF (at least for now).

I don't know if it's still something the phragile folks really want to see for other reasons, but I'm guessing not, so lowering the priority.

Declining this task as Phragile is not under development or maintained anymore - see T240308#6164990