Page MenuHomePhabricator

Explain / sort out / fix SCR repository number mismatch on korma
Closed, ResolvedPublic

Description

As of today,

I'd expect at least these numbers to be the same. As they are not, I'd love to have an explanation.

Furthermore, what is the relation of these numbers to:

Event Timeline

This seems to be an error when counting activity in empty repositories. There may exist Gerrit repositories with no activity even when they are tracked by the Gerrit server.

In any case we'll re-check these numbers carefully.

I'm re-assigning this task to @Lcanasdiaz

The panel currently says: 1,078 repositories .. the list of repos shows trends for 1019 repositories

mysql> select count(*) from trackers;
+----------+
| count(*) |
+----------+
|     1277 |
+----------+
1 row in set (0,00 sec)
mysql> select count(distinct tracker_id) from issues;
+----------------------------+
| count(distinct tracker_id) |
+----------------------------+
|                       1181 |
+----------------------------+
1 row in set (0,98 sec)

This URL which gets the list of SCR repos since 2013 returns 1142

mysql> SELECT t.url as name, COUNT(DISTINCT(i.id)) AS issues
FROM  issues i, trackers t
WHERE i.tracker_id = t.id AND
i.submitted_on >= '2013-05-01' AND i.submitted_on < '2015-11-21'
GROUP BY t.url ORDER BY issues DESC;

mysql> SELECT COUNT(DISTINCT i.tracker_id) as repositories FROM issues i, trackers t WHERE i.tracker_id = t.id and i.submitted_on >= '2013-05-01';
+--------------+
| repositories |
+--------------+
|         1142 |
+--------------+
1 row in set (0,22 sec)

Something is really wrong with the queries ..

Now I get the problem, the queries above do not include the bot and subject filtering. So, we're missing 59 repositories .. I'm still looking for an answer

SELECT count(distinct(t.id)) as repositories 
FROM sh_mediawiki.profiles pro , issues i , people_uidentities pup , trackers t 
WHERE  i.submitted_on>='2013-05-01' AND  i.submitted_on<'2015-11-20' 
AND pro.name <> 'wm-bot' and pro.name <> 'jenkins-bot' and pro.uuid = pup.uuid and i.submitted_by = pup.people_id 
and pro.name <> 'Translation updater bot' and pro.name <> 'gerrit-wm' and pro.name <> 'wikibugs_' 
and pro.name <> 'Jenkins-mwext-sync' and pro.name <> 'wikibugs' and pro.name <> 'L10n-bot' 
and i.summary not like '%WIP%' and i.tracker_id = t.id and i.id not in
                       (select distinct(i.id) as issue_id
                       from issues i,
                            changes ch
                       where i.submitted_by = ch.changed_by and
                             i.id = ch.issue_id and
                             ch.field = 'status' and
                             ch.new_value = 'MERGED')
                     and pro.name <> 'Wikimedia Jenkins Bot' and pro.is_bot<>1

.. the query that gets the list of repos is not using this filters, so we generate the JSON files for all the repositories. And .. it seems there is another query which generates the final list of repos to be shown in the list and this query .. what filters is using?

Confirmed, the main issue is the difference between the filters used to get the scr-repos.json and the ones used to the scr-static.json

Fixed in upstream product version.

New version of the library deployed. Waiting results ..

So far the numbers on korma.wmflabs.org have not changed. :)

I merged some local changes we had and it seems I ruined my own fix. Working on it

v_v

Producing new data. The 2nd bug was related with the period of time which was overwritten for some studies.

Final test passed. We're resuming the gathering process and generating the new data. It should be in korma in a few of hours.