Page MenuHomePhabricator

Create list of developers eligible to vote on the 2021 board vote
Closed, ResolvedPublic

Description

The organizers of the 2021 Board election need a list of developers eligible to vote. The specific data needed is (((FIXME @tstarling @jrbs what is the data needed for SecurePoll?)))

The new eligibility criteria are divided into two levels. This request is about extracting the data of the developers that meet the criteria at the main level:

  • are Wikimedia server administrators with shell access
  • or have made at least one merged commit to any Wikimedia repos on Gerrit, between 5 January 2021 and 5 July 2021.

Just for context, if there is a way to gather the data of technical contributors meeting any of the additional criteria it would be great, but not required. The Elections Committee plans to handle requests from people meeting these additional criteria manualy:

  • or have made at least one merged commit to any repo in nonwmf-extensions or nonwmf-skins, between 5 January 2021 and 5 July 2021.
  • or have made at least one merged commit to any Wikimedia tool repo (for example magnustools) between 5 January 2021 and 5 July 2021.
  • or have made at least 300 edits before 5 July 2021, and 20 edits between 5 January 2021 and 5 July 2021, on Translatewiki.
  • or maintainers/contributors of any tools, bots, user scripts, gadgets, and Lua modules on Wikimedia wikis.
  • or have substantially engaged in the design and/or review processes of technical development related to Wikimedia.

Details

Due Date
Aug 12 2021, 10:00 PM

Event Timeline

Qgil triaged this task as High priority.Aug 9 2021, 11:45 AM
Qgil set Due Date to Aug 12 2021, 10:00 PM.

I took the liberty of marking this task High priority and setting a deadline for this Friday.

It is not clear to me who should be assigned to this task. I am going to ping SRE and Release Engineering.

@tstarling @jrbs Please edit the description to specify the data needed and any other technical details relevant to produce this list.

hi @Qgil Bitergia has a lot of the info you're looking for

or have made at least one merged commit to any Wikimedia repos on Gerrit, between 5 January 2021 and 5 July 2021.

https://wikimedia.biterg.io/goto/1d62cdd781dbfa9f093dd9047803043e

What info do you need for the folks from the list ^ ?

Likewise for "Wikimedia server administrators with shell access" we can pull this out of modules/admin/data/data.yaml, just a question of what info you're looking for.

hi @Qgil Bitergia has a lot of the info you're looking for

or have made at least one merged commit to any Wikimedia repos on Gerrit, between 5 January 2021 and 5 July 2021.

https://wikimedia.biterg.io/goto/1d62cdd781dbfa9f093dd9047803043e

What info do you need for the folks from the list ^ ?

Needs changeset_status:MERGED too, I think.

Thank you for your quick replies!

  • SecurePoll needs Wimedia usernames (SUL). This might be tricky to extract. Options to do this seem to include checking Phabricator accounts, crossing email addresses and also "there is also some dev<->SUL mapping in the database for toolsadmin." I don't know the best technical answer. Soem developers won't have a Wikimedia SUL account at all, or they may have different email addresses for editing and developing, and it's fine to leave these out.
  • We are sending two mass emails during the election, and for that, we need their email addresses. This shouldn't be as complicated, right?

@Qgil

Here is an alphabetical list of email addresses of all users with shell access:
It's public info pulled out of the public git repo but I still restricted it to WMF to make a little harder for spammers.

P.S. There are many different types of "shell access", they range from just unprivileged access on a single host to global root and everything in between. Also contains contractors that might be just temporary.

{P16990}

If we could produce a list of commit author email addresses for all Gerrit commits merged in the given period, which doesn't seem that hard, then at least anyone left out by the SUL mapping would have an easy manual method for proving eligibility.

P.S. There are many different types of "shell access", they range from just unprivileged access on a single host to global root and everything in between. Also contains contractors that might be just temporary.

I think back at the time this referred to system administrators with root access (although back at the time there wasn't any other kind of access). Given the Gerrit criteria (which also did not exist back then), I doubt it matters much - how many people with shell access don't have a single merged commit from the last half year?

Yes, I agree. Back in the days there was only one type of "shell access" and the term is still used in that sense. Then we split into "deployers" (formerly mortals) and "roots" and from there we got all the separate groups.

maintainers/contributors of any tools, bots, user scripts, gadgets

You might want to pull a list of toolforge and cloud VPS users. Most of them will be tool or bot maintainers. These users should be included and probably more so than the shell users.

Thank you @Dzahn! One more piece of data for this puzzle.

Note that people's names displayed on wikimedia.biterg.io are not necessarily their account usernames in some system (Gerrit, Phab, mworg, etc).
So in P17001 I dumped the list of 429 Gerrit usernames that correspond to the 399 people listed by the wikimedia.biterg.io query in the previous comment.
(∆30 as some folks have >1 entry. Examples: tchan/tchanders; srishakatux/srishtisethi; lucaswerkmeister/lucaswerkmeister-wmde)

We still need to map gerrit username to SUL account for it to be useful for votewiki

The aggregation from Bitergia data should use the "author_user_name" field which corresponds to the patch submitter's Developer account uid (shell username) value. The "author_name" value is functionally freeform text from the developer's git configuration.

The aggregation from Bitergia data should use the "author_user_name" field which corresponds to the patch submitter's Developer account uid (shell username) value. The "author_name" value is functionally freeform text from the developer's git configuration.

https://wikimedia.biterg.io/goto/840d296dc3e3251819f07a63636c6931

<tl:dr>: I dumped 321 SUL accounts that correspond to some of the 399 eligible Gerrit/LDAP accounts into P17004.

I'll repeat what had been said in April discussing eligibility criteria: There is no link (unless manually made by the user) between a user's SUL and Developer account, which will make it tough for SecurePoll to verify who had made contributions with their Developer accounts; there is no real-time way these two are connected.

Today I played a bit with the database behind wikimedia.biterg.io and the Phabricator database.
There might be further data sources out there (IIUC toolsadmin was mentioned) that I'm not aware of, which may help making data more complete (or not).

Based on https://wikimedia.biterg.io/goto/db9c4cdc49ddc4f6a3d905e3c9b11680 (export to CSV), I queried [1] the corresponding 429 Gerrit/LDAP usernames in P17001 (extracted from the wikimedia.biterg.io DB, though there's an easier way, see bd808 in T288455#7276184) and threw them into long SQL queries.

  • wikimedia.biterg.io DB: not helpful:
    • The perceval code does not index SUL accounts that have never made an edit on mediawiki.org (so they are not in the DB).
    • Querying the DB on SUL usernames and corresponding Gerrit usernames both linked to the same profile (=person) lists too much noise / ambiguity. Furthermore it is impossible to see which one is a profile's latest mediawiki.org account (because renames, affiliation changes, etc). Furthermore, there are many separate (disconnected, non-matched) LDAP/Gerrit account profiles vs SUL account profiles in the DB (and so far it's never been a good use of time to identify them and merge them).
  • Phabricator DB: more helpful:
    • Querying [2] the 946 Phab accounts that have both LDAP and SUL linked for those 429 LDAP usernames of 399 eligible folks. There are 124 SUL results.
    • Nicholas had the great idea to query[3] the 480 Gerrit email addresses of the 399 eligible folks from the wikimedia.biterg.io DB and use them to query [4] the user accounts in the Phab DB that use the same email addresses for Phab, to get the linked SUL accounts for those Phab accounts. There were 315 SUL results.

I merged these two query result sets, removed duplicates, and dumped the 321 SUL account results into P17004.
Note there are at least two accounts listed who belong to the same person (Lukas; work and personal). In theory there could also be bot accounts in the result set when not marked as a bot account in Phab or in the Bitergia DB, in practice I think it is unlikely.

Due to pulling email addresses (not: LDAP account names) from one DB and using them to query SUL account names in another DB, I don't see an easy way to offer an actual mapping (SUL–LDAP next to each other). If that was really wanted, I could maybe manually fiddle them together on another day.

No idea if I made any logic mistakes. If you find them, please share them.

Random notes to future selfs:

  • Case-insensitive SQL queries (LOWER() is your friend) can lead to better results.
  • Doublechecking underscores versus whitespaces in stored SUL usernames across different systems can lead to better results.
  • People who follow docs and link both their SUL and LDAP accounts to their Phab account are good people and deserve a hug.
    • As we have Gerrit email addresses of eligible people, people could in theory be contacted and asked to create a Phab account and link both their SUL and LDAP accounts. Then folks with Phab DB access could...run a query [5].

[1] Bitergia - SELECT DISTINCT(i.username),p.name FROM identities i INNER JOIN profiles p ON p.uuid=i.uuid WHERE i.source = "gerrit" AND i.username IS NOT NULL AND (p.name = "ProfileName1" OR p.name = "ProfileName2") ORDER BY i.username;
[2] Phab - SELECT ue1.username AS ldapUsername, ue2.username AS mwUsername FROM phabricator_user.user_externalaccount ue1 INNER JOIN phabricator_user.user u JOIN phabricator_user.user_externalaccount ue2 WHERE ue1.accountType = "ldap" AND ue2.accountType = "mediawiki" AND ue1.userPHID = ue2.userPHID AND ue1.userPHID = u.phid AND u.isSystemAgent = 0 AND u.isDisabled = 0 AND (LOWER(ue1.username) = "LdapUserName1" OR LOWER(ue1.username) = "LdapUserName2");
[3] Bitergia - SELECT DISTINCT(ig.email) FROM identities ig WHERE ig.source = "gerrit" AND ig.username IS NOT NULL AND ig.email IS NOT NULL AND (ig.username = "LdapUserName1" OR ig.username = "LdapUserName2") ORDER BY ig.email;
[4] Phab - SELECT DISTINCT(uea.username) AS mwUsername FROM phabricator_user.user_externalaccount uea INNER JOIN phabricator_user.user u INNER JOIN phabricator_user.user_email ue WHERE uea.accountType = "mediawiki" AND ue.userPHID = u.phid AND uea.userPHID = u.phid AND u.isSystemAgent = 0 AND u.isDisabled = 0 AND (ue.address = "emailaddress1" OR ue.address = "emailaddress2");
[5] Phab - SELECT ue1.username AS ldapUsername, ue2.username AS mwUsername FROM phabricator_user.user_externalaccount ue1 INNER JOIN phabricator_user.user u JOIN phabricator_user.user_externalaccount ue2 WHERE ue1.accountType = "ldap" AND ue2.accountType = "mediawiki" AND ue1.userPHID = ue2.userPHID AND ue1.userPHID = u.phid AND u.isSystemAgent = 0 AND u.isDisabled = 0 AND (LOWER(ue1.username) = "LdapUserName1" OR LOWER(ue1.username) = "LdapUserName2");

One could try to match the various email addresses (Phabricator, Gerrit primary, Gerrit secondary, Wikitech...) against the SUL email. I doubt it's worth the effort though. A list of eligible LDAP accounts is enough for manual verification (people can just do a proof Wikitech edit) and I doubt there will be so many voters eligible via LDAP but ineligible via SUL that those verifications would get unmanageable.

Thank you so much @Aklapper for this investigation and these queries!

@jrbs can we take these queries and run with them or do we need anything else?

Thank you so much @Aklapper for this investigation and these queries!

@jrbs can we take these queries and run with them or do we need anything else?

I think P17004 will cover the vast majority of people so it's probably good enough? It leaves 78 people eligible to vote but not on this list, though as Tim says above many of those 78 will be on the regular SUL list. We could invite those caught in the middle to contact T&S (or similar) for manual verification?

I'll be honest, I am not remotely skilled enough with SQL at the moment to work out those queries :)

tstarling claimed this task.

as Tim says above many of those 78 will be on the regular SUL list

I think you mean @Tgr.

Of the 321 users in P17004, all but 6 were already in the staff list which I imported earlier. The 6 non-staff developers were apparently all qualified by edit count already. The script reported zero users added so I checked the status of those 6 users.

Are there plans to identify the 78 (399-321) remaining developer users? Is this considered important or not? It would help if folks could clearly express which data would be ideal, and which data would be good enough. I don't know about policies (/deadlines) for informing eligible voters. Also, is this a subtask of T287023?