Maniphest backend for Metrics Grimoire
Closed, ResolvedPublic

Description

In T28#1182136, @Qgil wrote:

Bitergia has sent us a quote to develop a Maniphest backend for Metrics Grimoire, and to update http://korma.wmflabs.org/browser/its.html and related metrics with the new Phabricator data from December 2014 onwards.

Quim works on the bureaucratic steps. Meanwhile, @Aklapper and @Dicortazar can start working on the details.


Summary what is wanted (added by @Aklapper on 2014-04-30) to be on a par with which code already exists and might "just" need porting:

Pages currently existing on korma (but out of date due to switch from BZ to Phab), and what to do about them in a Phab world:

  • bugzilla_response_time.html (custom)
    • Median age graphs = less useful, don't focus on this for time being
    • Longest time without action = useful, if possible show 50 instead of 20 tickets
    • Longest time without comment = less useful, don't focus on this for time being
    • Longest time without resolution = less useful, don't focus on this for time being
  • its.html
    • All data makes sense on that page
  • its-contributors.html
    • seems to show basically everything that is already on its.html (apart from "filtered by tracker" active vs archive which looks like a not too important detail)?
  • its-repos.html
    • All data makes sense on that page
    • small upstream item: "tracker" seems to be the same as "domain", wondering if that's only in our case or a general glitch to fix

Note that further low-priority "wishful thinking" stuff (not "on a par" but enhancements) is in the comments of T28.

Qgil created this task.Apr 16 2015, 10:02 AM
Qgil updated the task description. (Show Details)
Qgil raised the priority of this task from to High.
Qgil claimed this task.
Qgil set Security to None.
Qgil added subscribers: Dicortazar, scfc, kevinator and 5 others.

@Dicortazar: Could you elaborate a bit on the complexity of this task given the MetricsGrimoire architecture that I am not familiar with?
I'm interested to understand how modular this could be approached.
Could certain and specific functionality be implemented (that would currently require me to run SQL queries on the Phabricator databases)?
Would a Maniphest backend in MetricsGrimoire always require a dump of the Phabricator databases and MetricsGrimoire to extract and parse information from those DBs to then move information into MetricsGrimoire's own database schemes? I'm wondering how "dominant" the abstraction layer is when it comes to bug database backends in MetricsGrimoire.

@Aklapper, there's an introduction to the Metrics Grimoire toolset in the Puppetization of the tools [1]. We may provide extra info about the architecture in the wiki.

Basically there are three main layers:

  1. Retrieval process: where you get as output a MySQL database, used as input by (2).
  2. Analysis process: where you get as output a set of JSON files used as input by (3).
  3. Visualization: e.g. Korma.

We have to start with this task, but the idea is to go through the Phabricator API and retrieve all of the needed data. This should be daily updated. This will be implemented as a first approach as a Bicho [2] backend.

In second place, we need to create the specific queries. This is done through the GrimoireLib product [3]. Given that the final database is likely to be under the Bicho database schema (shared by others backends such as Jira or Bugzilla), this shouldn't take that long,

And finally, we would need to add the new panels to the dashboard.

As a first approach, our idea is to provide same information as found in the Bugzilla part, but this is something to discuss about. This could be a good place for requirements or use cases.

Last but not least, if it was impossible to deal with the Phabricator API, then we may want to move to your suggestion: dumps of Maniphest dataset.

[1] https://raw.githubusercontent.com/MetricsGrimoire/puppet-metricsgrimoire/master/architecture.rst
[2] https://github.com/MetricsGrimoire/Bicho
[3] https://github.com/VizGrimoire/GrimoireLib

Qgil lowered the priority of this task from High to Normal.Apr 24 2015, 10:43 AM
Qgil reassigned this task from Qgil to Dicortazar.

Wikimedia completed the bureaucratic steps. Bitergia can start working and invoicing as agreed.

Hi guys!

Before developing this backend, we started with some research about how to retrieve the information from Maniphest.

Our first idea was to use conduict. It provides an API to get data from tasks, users, tags, projects, etc. We like this method because it returns JSON objects that are easy to parse. The thing is we have found some problems with conduit that could make this task much more difficult or even impossible. These problems are:

  1. There is no way to filter tasks by date. The query method does not have parameters to retrieve the issues starting from a given date. This is a big problem on systems where to retrieve data you need pagination and are updated each minute. Without a date filter, you can fall so often in infinite loops, querying again and again and retrieving the same info.
  2. There is not sorting in ascending order (as you know it really well - T88899). This isn't a big lock but to get a list of tasks from the oldest to the newest would make the process easier and faster.
  3. The JSON object that returns the information about any task (see maniphest.info) does not contain useful data such as blocking tasks.

I've been checking Phabricator source code and from my point of view 1 and 3 can be easily fixed. My knowledge about PHP is low and maybe I'm wrong, though. The other solution to this is HTML scrapping but we strongly recommend to avoid this one. HTML parsing is hard, time consuming, overloads the website and most of the times the page changes, the parser that you developed brokes.

@Dicortazar, do you want to add anything else?

  • There is no way to filter tasks by date. The query method does not have parameters to retrieve the issues starting from a given date.

Do you refer to creation date or to modification date?

  • There is not sorting in ascending order (as you know it really well - T88899). This isn't a big lock but to get a list of tasks from the oldest to the newest would make the process easier and faster.

Upstreamed as https://secure.phabricator.com/T7909 after IRC chat with upstream maintainers. Feel free to CC yourself.

  • The JSON object that returns the information about any task (see maniphest.info) does not contain useful data such as blocking tasks.

True, Conduit output only includes such info the other way round via the dependsOnTaskPHIDs parameter (though in the database itself, actually two rows are created in phabricator_maniphest.edge, one for 'Blocks' and one for 'Blocked By', so it's just Conduit's API).
But are "Blocking tasks" currently a usecase for Wikimedia / do we (want to) analyze anything that requires such information?
Any other specific "useful data" in mind?

As a first approach, our idea is to provide same information as found in the Bugzilla part, but this is something to discuss about. This could be a good place for requirements or use cases.

I think Maniphest data turned into Metrics which we are interested in pretty much covered by T28: Decide on wanted metrics for Maniphest in kibana (which I should clean up and summarize in the task description there, to save you a few hours of reading comments).
I've pasted corresponding raw SQL queries on the Phab DB in T28 which might not be helpful at all for you, but would provide us for the time being (without a Maniphest backend in Grimoire) a way to gather wanted data quick'n'dirty even without Grimoire (and hence without a graphical representation). If we wanted. :)

  • There is no way to filter tasks by date. The query method does not have parameters to retrieve the issues starting from a given date.

Do you refer to creation date or to modification date?

Modification date

  • There is not sorting in ascending order (as you know it really well - T88899). This isn't a big lock but to get a list of tasks from the oldest to the newest would make the process easier and faster.

Upstreamed as https://secure.phabricator.com/T7909 after IRC chat with upstream maintainers. Feel free to CC yourself.

Thanks. I've been reading @epriestley response that provides some points about how these features can be added to Conduit. It doesn't look like to be hard but as my PHP and Phabricator knowledge is reduced, I can't make a guess about how much effor will be required to perform this task.

  • The JSON object that returns the information about any task (see maniphest.info) does not contain useful data such as blocking tasks.

True, Conduit output only includes such info the other way round via the dependsOnTaskPHIDs parameter (though in the database itself, actually two rows are created in phabricator_maniphest.edge, one for 'Blocks' and one for 'Blocked By', so it's just Conduit's API).

This was just a request to make things more faster but as you said, we can get the info the other way around.

But are "Blocking tasks" currently a usecase for Wikimedia / do we (want to) analyze anything that requires such information?

Not yet but for instance, in a possible future scenario, we can use this information to know which issues require to be closed first, the median of time issues are blocked/blocking or the types of issues that are more often blocking others.

Any other specific "useful data" in mind?

Commits, for instance. To have related which commits fixed which issues would be really helpful for further studies.

Qgil moved this task from Backlog to Need Discussion on the ECT-April-2015 board.
Aklapper updated the task description. (Show Details)Apr 30 2015, 9:08 PM

I have updated the summary to describe which metrics are definitely wanted and which parts are nothing to investigate at this stage.

Aklapper updated the task description. (Show Details)Apr 30 2015, 9:21 PM

We have developed the first version of Maniphest backend. This means we're now ready to collect data.

My idea is that it will last around 2 days and a half. It retrieves like 2000 tasks per hour, so if everything goes ok, to get 100K tasks it will need more than 2 days.

We retrieve the tasks in sets of 200. Each tasks requires one extra request to the server to get its list of transactions. Additionaly, extra trips are needed to collect information of users and projects not cached. With this numbers my impression is we won't be banned. There are not many request to do and not so often, so it won't overload the system.

We have developed the first version of Maniphest backend. This means we're now ready to collect data.

My idea is that it will last around 2 days and a half. It retrieves like 2000 tasks per hour, so if everything goes ok, to get 100K tasks it will need more than 2 days.

We retrieve the tasks in sets of 200. Each tasks requires one extra request to the server to get its list of transactions. Additionaly, extra trips are needed to collect information of users and projects not cached. With this numbers my impression is we won't be banned. There are not many request to do and not so often, so it won't overload the system.

can you tip us off when this process begins so we can be on the lookout?

We have developed the first version of Maniphest backend. This means we're now ready to collect data.

My idea is that it will last around 2 days and a half. It retrieves like 2000 tasks per hour, so if everything goes ok, to get 100K tasks it will need more than 2 days.

We retrieve the tasks in sets of 200. Each tasks requires one extra request to the server to get its list of transactions. Additionaly, extra trips are needed to collect information of users and projects not cached. With this numbers my impression is we won't be banned. There are not many request to do and not so often, so it won't overload the system.

can you tip us off when this process begins so we can be on the lookout?

@chasemp, we already started. If you find any problem, please let me know to stop the process.

chasemp added a subscriber: mmodell.May 8 2015, 3:01 PM

can you tip us off when this process begins so we can be on the lookout?

@chasemp, we already started. If you find any problem, please let me know to stop the process.

understood and thanks

@mmodell just fyi ^

We were able to collect around 71700 but now, we're stuck because we're getting pretty 503 errors.

Oldest issue we retrieved was T25551 (we use modification date to sort)

This is an example of the errors we're getting.

Request: POST http://phabricator.wikimedia.org/api/maniphest.query, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 816136382
Forwarded for: 82.158.176.175, 10.64.0.172
Error: 503, Service Unavailable at Mon, 11 May 2015 16:10:56 GMT

You can reproduce this error using maniphest.query method from conduit:

Any clue about what it's going on? Should I report this opening a new task?

Offset-based paging costs approximately O(offset) (we must load, policy filter, and then discard all results to compute the offset) so the cost of the queries has probably become very large.

The long-term solution is to switch this method to cursor-based paging, which has cost closer to O(1), but will require an API break (see https://secure.phabricator.com/T7909).

A possible workaround would be to use ids explicitly, although that won't necessarily work if you depend on ordering by modification date.

An update about this. We kept working on the following steps till we get a better approach to deal with the retrieval process.

Thus, we only have data going back till 2011. Regarding to the most updated dataset, we're working now on integrating this new backend in the toolset.

A good point about having both data sources, Bugzilla and Maniphest is that this will help to compare them and look for inconsistencies.

Another update, we already have a set of JSON files to be visualized. Working now on the viz part :).

Work ongoing, hence adding ECT-June-2015; looking forward to the results!

Working now on the viz part :).

@Dicortazar: Any new updates to share about the status here (as it's the middle of the month)?

You should see a new entry in the main korma page with Maniphest activity.

With this respect, we're updating the dataset, you'll notice that this contains data till the 25th of May,.

Qgil added a comment.Jun 18 2015, 7:57 AM

Good! Currently there are two separate graphs for Bugzilla and Maniphest tickets. Is it possible to merge the data of both in single graphs?

Qgil moved this task from Backlog to Doing on the ECT-June-2015 board.Jun 18 2015, 9:10 AM
Dicortazar added a comment.EditedJun 23 2015, 10:44 AM

Hi,

This panel contains updated information and this will be updated when the general korma dashboard is updated. This means that this is in production.

Then, there are some limitations as we saw at the beginning for the development of the backend for Bicho. The first is that this new panel contains info about tickets modified after 2011. For instance, the closing info found in the new panels are coming from that set of tickets whose last modification took place in 2011 or after.

So, I'd say that the backend is ready to go and this ticket should be closed. Regarding to the rest of the history, and given that this is not part of the backend, but Maniphest API issues, we're trying to retrieve the rest of the info through a sequential process, going ticket by ticket.

Comments are welcome :)

@Qgil, in order to have all of the information together, we're finally retrieving all of the tickets from Phabricator. This means that at some point, Bugzilla information would be deprecated due to that all of the info is in Maniphest.

Regarding to the retrieval process, it seems that there are some bunches of 'empty' ticket ids. With this I mean that there are empty slots of tickets if we want to retrieve all of them in a sequential way. As an example, and as far as I remember, between the ticket id 1,800 and ticket id 2,000, Maniphest returns nothing.

It's not a big deal, but working on it :).

I do not know if we want to close this ticket, given that the panel is ready to go, and keep working on another.

Comments?

Aklapper closed this task as Resolved.Jun 25 2015, 3:27 PM

The backend is in place! ♥! Thanks a lot!
So I'm closing this task (as discussed in our meeting today).

Regarding to the retrieval process, it seems that there are some bunches of 'empty' ticket ids. With this I mean that there are empty slots of tickets if we want to retrieve all of them in a sequential way. As an example, and as far as I remember, between the ticket id 1,800 and ticket id 2,000, Maniphest returns nothing.

Let's deal with specific issues in dedicated followup tasks - feel very free to create one such retrieval hiccups. Thanks!

at some point, Bugzilla information would be deprecated due to that all of the info is in Maniphest.

Now covered in T106037: "Tickets" (defunct Bugzilla) vs "Maniphest" sections on korma are confusing.

Note that upstream now allows ascending order for Conduit's maniphest.query's "order" parameter (order-created, order-modified). See https://secure.phabricator.com/T7909