Page MenuHomePhabricator

How to identify affiliation of indexed GitLab accounts
Open, LowPublic

Description

Split from T306769:

Indexing GitLab data in Perceval does not index email addresses, probably because that's non-public PII which would require auth and specific permissions before being able to pull via the GitLab API.

So what we get in the DB is only a random username without any further info:

"1234567890abcdef1234567890abcdef12345678": {
    "enrollments": [],
    "identities": [
        {
            "email": null,
            "id": "1234567890abcdef1234567890abcdef12345678",
            "name": "SomeName",
            "source": "gitlab",
            "username": "someusername",
            "uuid": "1234567890abcdef1234567890abcdef12345678"
        }
    ]
}

That means in contrast to Gerrit there is nothing that would allow identifying staff or non-volunteer GitLab accounts, as we only have a random username.

This will mean that our affiliation stats will become more incorrect.

Event Timeline

Would love to have input from folks who better know GitLab here.

Currently in Bitergia's Hatstall I only see: Name - email - Username - Source and email is always empty when Source is gitlab. So I literally only see some username and nothing else, and that does not allow me to make any judgment on affiliation.

Thus any Organization data on e.g. https://wikimedia.biterg.io/app/kibana#/dashboard/b2218fd0-bc11-11e8-8aac-ef7fd4d8cbad will get more and more wrong.

My first thought here is that username is always LDAP / CAS uid, which maps both to email and to groups that might indicate affiliation. I'm not sure how helpful that is.

Thanks. That means that I could at least run manual checks on https://ldap.toolforge.org/user/username (replace username) which lists email addresses and group membership (such as wmf).
Documented in https://www.mediawiki.org/w/index.php?title=User%3AAKlapper_%28WMF%29%2FBitergia_data_quality_queries&type=revision&diff=5215055&oldid=5205848

Aklapper lowered the priority of this task from Medium to Low.Feb 5 2025, 5:36 PM

Thanks. That means that I could at least run manual checks on https://ldap.toolforge.org/user/username (replace username) which lists email addresses and group membership (such as wmf).

@Aklapper Is there a schema for that upstream hatstall lookup that we could populate via traversal of the backing LDAP directory that holds our Developer accounts? I'm wondering if we could build an easier to use ETL pipeline for you. Theoretically we could even do work to add custom schema elements in our Developer account LDAP to track things like "affiliation" if they are a generally useful concept for developer experience metrics that can also be publicly documented.