Page MenuHomePhabricator

How to identify affiliation of indexed GitLab accounts
Open, MediumPublic

Description

Split from T306769:

Indexing GitLab data in Perceval does not index email addresses, probably because that's non-public PII which would require auth and specific permissions before being able to pull via the GitLab API.

So what we get in the DB is only a random username without any further info:

"1234567890abcdef1234567890abcdef12345678": {
    "enrollments": [],
    "identities": [
        {
            "email": null,
            "id": "1234567890abcdef1234567890abcdef12345678",
            "name": "SomeName",
            "source": "gitlab",
            "username": "someusername",
            "uuid": "1234567890abcdef1234567890abcdef12345678"
        }
    ]
}

That means in contrast to Gerrit there is nothing that would allow identifying staff or non-volunteer GitLab accounts, as we only have a random username.

This will mean that our affiliation stats will become more incorrect.

Event Timeline

Would love to have input from folks who better know GitLab here.

Currently in Bitergia's Hatstall I only see: Name - email - Username - Source and email is always empty when Source is gitlab. So I literally only see some username and nothing else, and that does not allow me to make any judgment on affiliation.

Thus any Organization data on e.g. https://wikimedia.biterg.io/app/kibana#/dashboard/b2218fd0-bc11-11e8-8aac-ef7fd4d8cbad will get more and more wrong.

My first thought here is that username is always LDAP / CAS uid, which maps both to email and to groups that might indicate affiliation. I'm not sure how helpful that is.

Thanks. That means that I could at least run manual checks on https://ldap.toolforge.org/user/username (replace username) which lists email addresses and group membership (such as wmf).
Documented in https://www.mediawiki.org/w/index.php?title=User%3AAKlapper_%28WMF%29%2FBitergia_data_quality_queries&type=revision&diff=5215055&oldid=5205848