Understand how to construct queries in wikimedia.biterg.io for the activity of each newcomer
Closed, ResolvedPublic

Description

Investigation for T163440#3428408

How to get the exact names out of the database to query for.

I started with SELECT username,name,source,email,uuid FROM identities WHERE (email="xxxxxxx") OR (source="phabricator" AND (username="yyyy")) OR (source="mediawiki" AND (username="ZZZ")) ORDER BY source,name; on the DB.
So for all rows with source="gerrit" in the output, I took all entries in the "name" column output, went to https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit , and constructed a search query in the text search field like author_name:"XXXX"

But this was missing some people.

So for a query on a specific author_name in the Overview (to see activity of a user in all sources on one page, except for Mediawiki which is not included in the Overview), it seems that I need to use the "name" field in the "profiles" table instead, and not the username used in that specific source.
Which makes a lot of sense.

So I still need to find out how to pull this data pragmatically in a query.

MariaDB [sortinghat_wm]> SELECT * FROM profiles WHERE name = "Andre Klapper";
+------------------------------------------+---------------+--------------+--------+--------------+
| uuid                                     | name          | email        | is_bot | country_code |
+------------------------------------------+---------------+--------------+--------+--------------+
| 0cf6a9804d8edc5f6cdb3a87849ba6c5eb0dc349 | Andre Klapper | aklapper@... |      0 | NULL         |
+------------------------------------------+---------------+--------------+--------+--------------+

MariaDB [sortinghat_wm]> SELECT id,name,email,username,source FROM identities WHERE uuid = "0cf6a9804d8edc5f6cdb3a87849ba6c5eb0dc349";
+------------------------------------------+---------------------+-------------------+------------------+-------------+
| id                                       | name                | email             | username         | source      |
+------------------------------------------+---------------------+-------------------+------------------+-------------+
| 0cf6a9804d8edc5f6cdb3a87849ba6c5eb0dc349 | NULL                | aklapper@...      | aklapper@...     | bugzilla    |
| 1b5ba8f9ba0fa9e9e40ba981982fe52ff704bb9d | AKlapper (WMF)      | NULL              | AKlapper (WMF)   | mediawiki   |
| 3fc9ae7da372a499d381d0ab693f85a986f04c84 | NULL                | aklapper          | aklapper         | bugzilla    |
| 4b0b75cea0320cd2368993cb964be5e9bc61820f | Andre Klapper       | ak-47@...         |                  | gerrit      |
| 4f41f1401435e6963380338f3ce6e3f4e1a4a411 | Andre Klapper       | ak-47@...         | NULL             | gerrit      |
| 692ad158c742a5c8477af1cbd20e37d31cd72aeb | Andre Klapper       | a9016009@...      | NULL             | git         |
| 8a893f50230b385fa81d81b6473600252c0fdaff | Andre Klapper       | ak-47@...         | NULL             | git         |
| a253d5e81125127cb6c77d2f5fcdeab293a18d23 | Andre Klapper       | andre_klapper@... | NULL             | pipermail   |
| a59d39d9f50703e3c129a1ffa87569c3af213ae9 | Aklapper            | aklapper@...      | aklapper         | gerrit      |
| bb3b35caf8079fd5fad6d8fab810c46a3fb2c979 | Andre Klapper       | aklapper@...      | NULL             | pipermail   |
| bc38a58e97f6c71f515e5831d0e88d33fdcc7dd8 | Andre Klapper (WMF) | NULL              | Aklapper         | phabricator |
| c5e148d20b62de9ab4994c6e9affcac046836b9a | Aklapper            | aklapper@...      | NULL             | git         |
| e263b5c4f8942b9ce81e6efb0b0e3145cb59d2bf | NULL                | NULL              | aklapper@... | bugzilla    |
| f0a5ce57c6416e69e5d4ed3f5f7899470a507501 | Andre Klapper       | aklapper@...      | NULL             | git         |
+------------------------------------------+---------------------+-------------------+------------------+-------------+
14 rows in set (0.00 sec)

So I still need to find out how to pull this data pragmatically in a query.

Was about time to refresh basic knowledge (the different types of JOINs and that syntax is ON instead of WHERE x = y if you don't want to kill your machine).

Steps, basically:

  1. SELECT DISTINCT profiles.name FROM identities LEFT JOIN profiles ON profiles.uuid=identities.uuid WHERE (identities.email="xxxx" OR identities.email="yyyy") OR (source="phabricator" AND (username="XXXX" OR username="YYYY")) OR (source="mediawiki" AND (username="XXXX" OR username="YYYY"));
  2. Use the output (AAAA, BBBB) to construct the term author_name="AAAA" OR author_name="BBBB"
  3. Enter that term in the search bar on https://wikimedia.biterg.io/app/kibana#/dashboard/Overview
  4. Get numbers, bars, colors. Potentially adjust time frame etc. etc.

Example on Overview page for author_name:"Quim Gil" OR author_name:"Andre Klapper"

Problems:

  • profiles.name is not a unique key. So results could look better than they are.
  • Overview page does not offer a widget to "one-click" filter the view for each person. Specific subpages do. Unclear if we could adjust via customizing the dashboard (C_...).
  • Overview page is unreliable in displaying all results ("Could not locate that visualization" error). Reported https://gitlab.com/Bitergia/c/Wikimedia/support/issues/2 . Cumbersome workaround: Go to the specific sources (git, gerrit, ...) pages.
  • Overview page does not include Phabricator Maniphest activity. Could adjust via customizing the dashboard (C_...).
Qgil added a subscriber: Qgil.Sep 14 2017, 7:10 AM

Mmm... is it possible that you are taking a route more complex than needed?

I just went to the wikimedia.biterg.io's main page, entered "Andre Klapper" in the search (with a timeframe for the last 6 months) and I got a page full of results. It is clearly evident that Andre Klapper has been active in different technical spaces in the last six months, up to yesterday. Gerrit included.

For the purpose of checking retention, this looks like enough. Your are going after something more, but I cannot deduce what. :)

Mmm... is it possible that you are taking a route more complex than needed?

@Qgil: It's always possible, but I don't see that yet in this case.

I just went to the wikimedia.biterg.io's main page, entered "Andre Klapper" in the search (with a timeframe for the last 6 months) and I got a page full of results. It is clearly evident that Andre Klapper has been active in different technical spaces in the last six months, up to yesterday. Gerrit included.

That is because you already know that you need to enter exactly Andre Klapper (because that user you search for has their real name set in the profiles database).
Try "Andre klapper" and find nothing. Or my Phab username. Or my Gerrit user name. In short: I cannot know the right search term without running that SQL query first.

Problems:

  • profiles.name is not a unique key. So results could look better than they are.

Not fixable given the design.

  • Overview page does not offer a widget to "one-click" filter the view for each person. Specific subpages do. Unclear if we could adjust via customizing the dashboard (C_...).

The latest version does.

That is because my browser cache played tricks on me. No problems in the latest version. PEBKAC.

  • Overview page does not include Phabricator Maniphest activity. Could adjust via customizing the dashboard (C_...).

The latest version does.

Aklapper closed this task as Resolved.Sep 18 2017, 12:18 PM

I am closing this task as T175854#3605693 is the way to go.
The queries seem to be correctly constructed (do what we want) and this is the workflow to use.

In practice, as I wrote in https://phabricator.wikimedia.org/T163440#3614530 where I posted the links to the queries for Vienna and Montreal, the output data of those queries cannot be fully trusted due to a DB issue I reported in T176135: Filtering on "author_name" in Git can show unrelated activity by other users.

Qgil awarded a token.Sep 19 2017, 9:02 AM