Page MenuHomePhabricator

Obtain access/directions to use Elasticsearch API powering wikimedia.biterg.io
Closed, DeclinedPublic

Description

I am working on T202233. One of the main goals I have is to allow Wikimedia event organizers to track developer activity by providing a tool that allows them to query the users for checking their contributions over time. To achieve this, I initially tried fetching developers’ activity via the Phabricator and Gerrit APIs. With each, I can get detailed information on every activity. But the problem is that Phabricator has a limit of obtaining 100 objects per request (including every activity like tasks created, subscribed comments added, etc.). I then discovered how http://wikimedia.biterg.io allows much faster and detailed queries and uses ElasticSearch to perform queries internally. With Elasticsearch API, it is not only feasible to bypass the limit; it is also very fast in comparison with individual APIs. You can take a look at it here. To use Elasticsearch, I would need the following info:

Is there any other way in which these queries can be made via REST APIs?

Event Timeline

@Aklapper @mmodell It would be great if you can help answer @Rammanojpotla's question. It will help us move forward with the project cc @Tuxology

Aklapper renamed this task from Obtain access/directions to use Elasticsearch API powering Wikimedia Bitergia to Obtain access/directions to use Elasticsearch API powering wikimedia.biterg.io.Jul 7 2019, 2:08 PM
Aklapper removed a subscriber: mmodell.

I don't see how @mmodell has anything to do with this task so I removed them. Task itself is not an Outreach-Programs-Projects itself hence I removed that tag. Also note that "Wikimedia Bitergia" does not exist so I've changed the task summary.

But the problem is that Phabricator has a limit of obtaining 100 objects per request (including every activity like tasks created, subscribed comments added, etc.).

The result of the first Phabricator query contains an after value in the cursor. You use that value for the after parameter in your follow-up query.
Pagination should not be a problem. If it is, then please mention the specific Conduit API calls which are problematic so someone else can try to reproduce a problem.

Is the ES instance accessible for querying publicly via REST APIs (https://www.elastic.co/guide/en/cloud/current/ec-getting-started-connect.html#ec-getting-started-api)

wikimedia.biterg.io does not use the "Hosted Elasticsearch and Kibana" service whose documentation you linked to.

The result of the first Phabricator query contains an after value in the cursor. You use that value for the after parameter in your follow-up query.

yup I have done the same while working on micro tasks of T202233. The script for can be found at https://github.com/lalit97/Pygerrit/blob/master/task_statistics.py

Performance will be a big issue for this. for example when a user has subscribed to around 600 tasks. To get his subscription time we have to call 600 API's.

Simple python requests will take around 1 second for each call and that will take around 10 minutes to get the output.

So we can use Asynchronous Requests(grequests) to call API's . It can make the execution around 20 times faster than before.

image9.png (647×463 px, 70 KB)

Scripts used in the screenshot can be found here
(1) sync requests
(2) async requests

Thanks, @Aklapper for the reply. As you said the result of the phabricator query has after and before in cursor indicating next and previous pages. As @lalit97 said performance will be a severe issue here and also I can not make an asynchronous request here. I just need to display the count of a user like https://meta.wikimedia.org/wiki/Contraband#/media/File:Mockup3.svg. So, to get the count I need to perform request to an API(https://phabricator.wikimedia.org/conduit/method/maniphest.search/) that returns me the details of last 100 task subscriptions. Let's suppose a user has 1000 task subscriptions, the tool need to make around 10 API requests to know that the user is subscribed to 1000 tasks. The tool need to continue requesting the API till the request returns no more data returned from the API. I can not even make it asyncronous. This is because if the tool has to request second page, it need to get the value of after keyword in the response data to the previous API. And If I made it syncronous, suppose I need to get the detials of 20 users. In the worst case scenario, let us assume all the 20 users has around 1000 task subscriptions. So totally there can be 20,000 (for 20 different users). Now I need to make around 20,000/100 = 200 API requests in sequential manner. In addition, I also need to request gerrit apis. This can take almost >=200 secs of time and I can not display the data on realtime to the users.

On the other hand, I tried performing some API requests from the console of wikimedia.biterg.io. I used some lucene queries to get the data. This is a sample response for a user:

{
  "took": 89,
  "hits": {
    "total": 17
  },
  "aggregations": {
    "2": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Resolved",
          "doc_count": 10
        },
        {
          "key": "Open",
          "doc_count": 4
        },
        {
          "key": "Declined",
          "doc_count": 1
        },
        {
          "key": "Duplicate",
          "doc_count": 1
        },
        {
          "key": "Stalled",
          "doc_count": 1
        }
      ]
    }
  }
}

I can get all it in above format. I guess through this, I can get data in fast and feasible manner.

It will be even fine if there is the way to perform the above API calls in asynchronous manner. Suppose apart of requesting the subsequent pages with after keyword of the previous request, is there some way like ?page=2(page number)?

Only one point that is not clear to me is If I have to make a request to page 10. I have to make requests from page 1 to 9 and then after getting the response to page 9, I can request for the 10th page. Is there any alternate way possible as I mentioned above?

Only one point that is not clear to me is If I have to make a request to page 10. I have to make requests from page 1 to 9 and then after getting the response to page 9, I can request for the 10th page. Is there any alternate way possible as I mentioned above?

Maybe we can first create all the urls and store them in a list. And than call them in asynchronous manner. I have followed this approach while writing this.

Only one point that is not clear to me is If I have to make a request to page 10. I have to make requests from page 1 to 9 and then after getting the response to page 9, I can request for the 10th page. Is there any alternate way possible as I mentioned above?

Maybe we can first create all the URLs and store them in a list. And then call them in asynchronous manner. I have followed this approach while writing this.

Thanks for the suggestion, but It is either not possible @lalit97. Because after and before are completely dynamic. These denote ids of the tasks. after means show all the tasks after the corresponding id and before mean show all the tasks before that id. So they are completely dynamic.

I just need to display the count of a user like https://meta.wikimedia.org/wiki/Contraband#/media/File:Mockup3.svg.

I don't know how "count of a user" is defined. Or how it is calculated. Is there documentation? Please provide a link to where your metrics are defined, how they are defined and calculated, and/or a potential review and discussion whether those metrics are meaningful, so I can have more context.

So, to get the count I need to perform request to an API(https://phabricator.wikimedia.org/conduit/method/maniphest.search/) that returns me the details of last 100 task subscriptions. Let's suppose a user has 1000 task subscriptions, the tool need to make around 10 API requests to know that the user is subscribed to 1000 tasks.

I don't know why you would want to get somebody's task subscriptions as I do not understand how that is relevant for something. Please help me understand which specific information you would like to gather from Phabricator for this tool by linking to documentation which explain which specific information you would like to gather and why.

I used some lucene queries to get the data.

As I don't know which exact data you tried to get and which exact query you ran for that, it's hard for me to comment on anything, without context. Is the sample response that you posted related to gathering "task subscriptions" of some user? I do not know as I don't know your query...

So I'd first like to know which specific data you want to index from which exact sources (Phab, Gerrit, ...), and why (link to some planning document?). As I am concerned that this task could turn out to be an XY problem.

@Rammanojpotla As things are much clearer now, lets take this opportunity ( ⬆ https://phabricator.wikimedia.org/T227397#5311719) to also document and link link each UI element you are proposing to specific procedure and sample queries to achieve them. This may help @Aklapper understand better what your intentions are. For example, here is a sample you can use a possible guide to translate high level actor requirements into more detailed technical requirements and approach:


Requirement-01: Given a username for Gerrit and a span of time, list the total user contributions for the user

Summary: In order to understand the contribution trend of registered contributors on Gerrit, the tool would provide a UI to input their Gerrit username and a duration for which their contribution summary is needed and provide their Gerrit User Contribution Count (total submitted/accepted/abandoned commits)

Definitions:

  1. Gerrit User Contribution Count: - Integer denoting overall commits for a given username for a given duration

Input: "Gerrit username" Textbox in Screen A or "username" from each line in CSV
Output: "Gerrit User Contribution Count"

Implementation Note:

  1. Query X with parameterys Y and Z and expect a JSON response
  2. Parse JSON response and obtain value P
  3. Use P to make query X and repeat 1 and 2 until the response is empty
  4. Sum the values and populate elements in each row in table 1 in Screen B

While this documentation is important, I'd advice you to keep a balance on documentations vs implementation/experimentation speed given. For example. in some cases implementation notes for this format can be written post solution as well.

Aklapper changed the task status from Open to Stalled.Jul 9 2019, 8:23 AM

Setting task status to stalled until it is clear and complete which exact information/data is wanted about users' activity in Phabricator and Gerrit.
After this has been clarified, please set the status of this report back to "Open" via the Add Action...Change Status dropdown. Thanks!

It looks like there is now some more context on https://meta.wikimedia.org/wiki/Contraband#Technical_implementation - thanks!

I do not see how assigning a Phab task to yourself is a contribution in itself. To me assignment only translates to "I hereby announce that I plan to work on this".

https://meta.wikimedia.org/w/index.php?title=Contraband&oldid=19195077 also says:

it is also not possible to get perform “authored or assigned” tasks in phabricator, only “authored and assigned” is possible.

Indeed. https://phabricator.wikimedia.org/conduit/method/maniphest.search/ lists both assigned and authorPHIDs keys as constraints.

FYI: As far as I know, Grimoirelab (the software behind wikimedia.biterg.io) does not index any assignment of users to Phabricator tasks. Also see the code at https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/phabricator.py for what is (not) indexed.

Thanks for the feedback!

@Rammanojpotla: Under "Technical Implementation", the Gerrit section says that "All the commits that user put are considered as contributions over gerrit." When you say "commits" do you actually mean "changesets"? You probably don't mean "patchsets"? Maybe the documentation should use the same words as the software to avoid misinterpretation?

Yes, I am using the changesets, not patchsets. I will update the docs. Thanks for the suggestion.

@Rammanojpotla: Under "Technical Implementation", I do not understand what the sentence "The solutions to both the above questions is not proposed by me. It is the way how wikimedia.bitergia.io gets the statistics. I am just following it’s method." means. Does wikimedia.biterg.io somewhere define what 'contributions' mean? Or do I misunderstand? How is wikimedia.biterg.io related at all here and why?

I am not sure if wikimedia.bitergia.io define it. I performed a simple experiment to track it. I made a query with my name and this is the result in it's console.

image.png (632×1 px, 100 KB)

I performed similar requests on phabricator:

authored tasks: https://phabricator.wikimedia.org/maniphest/query/1H.FCmU5Gj8F/#R
assigned tasks: https://phabricator.wikimedia.org/maniphest/query/LChKS25FF2ii/#R

From the above two urls, you will receive a count of 20. As two of the 20 are both authored and assigned. So the unique count is 18 which is equal to the value shown in the image above. So, I came to this conclusion.

However, I will discuss about this with my mentors once and finalize it. Thanks again for the advice :)

Oh, interesting! Thanks a lot for the explanation, makes sense now! You played way more with wikimedia.biterg.io's console than I have. :)

@Aklapper Could you share your thoughts on whether this argument that ElasticSearch is going to be helpful over Phabricator and Gerrit APIs for building this tool is valid? It is here: https://meta.wikimedia.org/wiki/Contraband#Benefits_of_using_ElasticSearch_over_Gerrit_and_Phabricator_APIs. And, if or not whether it is possible for us to get access to the ElasticSearch API to do the queries that we need to move forward?

(Off-topic here: Would love to get an answer on-wiki to my question on metawiki why anyone wants to query task assignment (not: task created by an author) in Phabricator Maniphest. That question is off-topic in this task, but I'd still appreciate an answer on wiki.)

Could you share your thoughts on whether this argument that ElasticSearch is going to be helpful over Phabricator and Gerrit APIs for building this tool is valid?
It is here: https://meta.wikimedia.org/wiki/Contraband#Benefits_of_using_ElasticSearch_over_Gerrit_and_Phabricator_APIs.

Thanks to the clarifications above, at least for Phabricator it's also my understanding that using wikimedia.biterg.io might be the way to go.
I don't know enough about Gerrit (and obviously not enough either about wikimedia.biterg.io) APIs.

And, if or not whether it is possible for us to get access to the ElasticSearch API to do the queries that we need to move forward?

First of all I guess we have to find out how to use REST API access for wikimedia.biterg.io plus its URL endpoint. I have never played with that before.
So again: If anyone could paste a complete API example query (preferably using curl, not the web console), instead of only results from some unmentioned query, that would be incredibly nice. I cannot look over your shoulders and I do not know much about all this.

As we seem to come from https://discuss.elastic.co/t/access-elastic-search-from-remote-api/189049/2 which mentions Python, maybe https://chaoss.github.io/grimoirelab-tutorial/python/querying.html could also be helpful here?

(Off-topic here: Would love to get an answer on-wiki to my question on metawiki why anyone wants to query task assignment (not: task created by an author) in Phabricator Maniphest. That question is off-topic in this task, but I'd still appreciate an answer on wiki.)

I've responded to your question on meta-wiki: :)

First of all I guess we have to find out how to use REST API access for wikimedia.biterg.io plus its URL endpoint. I have never played with that before.
So again: If anyone could paste a complete API example query (preferably using curl, not the web console), instead of only results from some unmentioned query, that would be incredibly nice. I cannot look over your shoulders and I do not know much about all this.

You can go through this example:

curl -u elastic:password https://CLUSTER_ID.REGION.PLATFORM.found.io:9243/my_index/my_type -XPOST -d '{
"title": "One", "tags": ["ruby"]
}'
{"_index":"my_index","_type":"my_type","_id":"AV3ZeXsOMOVbmlCACuwj","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

And one more important note is, for the above curl request to work the IP address of the tool has to be added in the list of apis that has the access to the elastic search as said here: https://discuss.elastic.co/t/access-elastic-search-from-remote-api/189049/4

So, we might need all the above details specified in the task description:

  1. CLUSTER_ID
  2. REGION
  3. PLATFORM
  4. DOMAIN
  5. PORT
  6. PASSWORD

along with all the above details, the IP address of the app should also be added to the configurations in the elastic search instance.

@Aklapper I guess you can check out the answer to the question here: https://discuss.elastic.co/t/access-elastic-search-from-remote-api/189049/9. I asked the person about the procedure of how we can access the APIs. One of the community members replied saying that it depends on the people who configured the tool.

You can go through this example:

I am not asking for random examples copied and pasted from random websites.
I asked several times for a specific real query that you actually plan to use.
There is stuff on https://meta.wikimedia.org/wiki/Contraband#Technical_implementation .
I ask how to literally post that stuff/query/"payload": I ask for full and complete command/steps, without any need left for interpretation. (I am aware that we might not know the exact endpoint URL, but the rest of the query should be "real".)

So, we might need all the above details specified in the task description:

No, we do not. Please read again T227397#5311552 to avoid re-posting stuff which makes absolutely no sense here.

Hi everyone!
I had some conversations about this. Which unfortunately might change plans:
We have enrollment data in the database (because a person might have been an active volunteer and then joined WMF or another organization, and as we allow filtering on organizations/enrollments we need to know at which time someone was working for which organization). So there is likely a legal issue here which makes me strongly recommend to use the Phabricator and Gerrit APIs instead which are public and don't include such data. :-/ (I'm not entirely sure though if the API actually allows you to get that data out of the database because... see next section in this comment.)

For completeness, I have fiddled a bit with the API provided by Bitergia after looking at https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-query-string-query.html (as I am also new to this). But I did not succeed so far to create a query that delivers actual data. That's why I asked for a specific query before... I'm posting three examples here - the first one seems to work as intended but I'm still wondering how to fix my incorrect second and third query to actually work.

$:acko\> curl -X GET "https://wikimedia.biterg.io/api/saved_objects/_find?type=index-pattern&search_fields=title&search=*" -u aklapper -H 'kbn-xsrf: true'
{"page":1,"per_page":20,"total":0,"saved_objects":[]}

$:acko\> curl -X GET "https://wikimedia.biterg.io/api/_search" -u aklapper -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d' {"aggs":{"2":{"terms":{"field":"status", "order":{"_count":"desc"}}}},"query": {"query_string":{"query":"*Rammanojpotla"}}}'
{"statusCode":404,"error":"Not Found","message":"Not Found"}

$:acko\> curl -X GET "https://wikimedia.biterg.io/api/gerrit/_search" -u aklapper -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d'{"query":{"bool":{"must":[{"query_string":{"query":"*rammanoj"}},{"match_phrase":{"created_on":{"query":"2017-04-14"}}}]}}}'
{"statusCode":404,"error":"Not Found","message":"Not Found"}

Thanks! for all the effort @Aklapper. I had an online meet with @Tuxology recently. Keeping in mind the timeline of the project, I decided to move forward with Gerrit and Phabricator and started my work with them.

So there is likely a legal issue here which makes me strongly recommend to use the Phabricator and Gerrit APIs instead which are public and don't include such data.

However, it is cleared that we can not get access to the API of wikimedia.bitergia.io. Once again thanks for the effort.

$:acko\> curl -X GET "https://wikimedia.biterg.io/api/_search" -u aklapper -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d' {"aggs":{"2":{"terms":{"field":"status", "order":{"_count":"desc"}}}},"query": {"query_string":{"query":"*Rammanojpotla"}}}'
{"statusCode":404,"error":"Not Found","message":"Not Found"}

I tried this even but I received the same response. So, I thought there is a need for access to perform this request.

I tried this even but I received the same response. So, I thought there is a need for access to perform this request.

Well, the first query won't work without authentication I guess? :P
And the first query provides proper empty output and not a 404 error... but that's passing URL parameters and not curl's --data parameter so I was wondering if you know how to convert the other two queries from using curl's --data parameter into queries using URL parameters.

Keeping in mind the timeline of the project, I decided to move forward with Gerrit and Phabricator and started my work with them.

Yeah, that makes sense. Sorry that I wasn't very helpful here. :-/

I'm going to decline this task, though I'm still curious myself how to construct an API request to wikimedia.biterg.io that actually gives back data...