Page MenuHomePhabricator

Provide data of Gerrit contributors from Bitergia DB to interview developers for user research
Closed, DeclinedPublic

Description

Context

In efforts to inform generative research to improve and better understand the developer experience of our technical community, the developer experience team is looking to reach out to interview developers from various stages of their involvement.

Request

We are requesting the following list of contact information separated by the following categories based on the community metrics dashboard

  1. Inactive Developers
  2. New Developers
  3. Retained Developers*

*The third category (tentative called retained developers) represents developers who have been here long enough to encounter the obstacles that discourage other developers but have nevertheless continued to contribute code. This group need not be senior developers, by analogy they would probably be classified as sophomores.

Ideally the output would be 100 developers from each group, ordered by recency of first contribution.

Context information that would be useful:

  • Dates active
  • Projects to which they contributed
  • Name
  • Demographics
  • WMF staff affiliation

Event Timeline

For clarity: This task is a result of an email exchange. In this very comment, I'll only summarize what I _already covered_ in emails:

This looks a little bit related to 2017/2018 work in https://www.mediawiki.org/wiki/New_Developers/Quarterly performed by @srishakatux.

Some general info, for everyone stumbling into this ticket:

  • I'm not aware of any dashboards measuring on-wiki code activity (for the records, https://wikimedia.biterg.io currently only indexes on-wiki activity on the site mediawiki.org but does not allow filtering by namespace or such). This is a pain point. For the bigger picture, see https://www.mediawiki.org/wiki/Developer_Advocacy/Metrics#Technical_Contributors_Map which might provide an overview and some ballpark numbers from about two years ago.
  • On https://wikimedia.biterg.io, the author_name is unfortunately pretty random; it gets initially set by whatever name was used by a person in a data source (data source examples: Gerrit, Github, Phab, Mediawiki.org, etc). The author name can be manually edited by folks who can access the database behind (that's me), and I sometimes do that for clarity.
    • "Extracting" wiki usernames or email addresses attached to a profile requires database access to run SQL queries (which I can do, technically). When it comes to email addresses used by people in Gerrit, these are exposed to anyone with a Gerrit account who uses the autocomplete search function. (Just stating.)
  • On https://wikimedia.biterg.io, apart from Wikimedia Gerrit repos, we also index a bunch of Github and GitLab repos. The full list is at https://gitlab.com/Bitergia/c/Wikimedia/sources/blob/master/projects.json
  • We have a custom "Gerrit > Retention of Newcomers" dashboard which allows enabling the "Leaving newcomers only" filter. You can filter on affiliation (e.g. to exclude former staff) and set the timeframe in the upper corner to "Last 5 years" or so. Afterwards, you'd see a list of names in the "Contributions per author" widget, sorted by contributions (for performance reasons, the widget only lists 100 entries but I could expand that if wanted). These names could be exported as CSV; someone with DB access (me) could then query further data if wanted.
    • Caveat: "inactive" here is hardcoded to "not active within last 90 days", so a listed person could have been inactive for, say, 4.7 years. I have not checked if there might be a way to only list the most recent 100 folks. (Obviously the last activity date must be stored in the DB to create that existing list but I'd have to dig and understand more if I could add that field as a sorting parameter/column to the widget).
    • Caveat: Probably way less important: There are also a few folks in that list who at some point got banned due to misbehavior etc. Just saying.

This will be tricky, I am not sure all items listed above can be fulfilled ("Projects to which they contributed", exact "Dates active") - it'll either require me to do quite some digging to find DB indexes (note to myself: check T253976#6622674 again) and construct queries, or maybe another option could be to create a support ticket with Bitergia who provide wikimedia.biterg.io to us.

Two quick questions though: What does "Demographics" mean exactly? (Note that there is no data on geographical location, age, or aspects of one's identity, simply because there's no interfaces where people could have been able to enter such aspects, but maybe I misinterpret the word.)
"Dates active" means both first and last?

Everything beside e-mail address and user category is a nice-to-have. It helps with outreach to know the context.

We don't need a comprehensive list of dates or projects. It just seems like dates of activity would be a prerequisite for determining active/inactive status. Something like "last active in 2017" or "joined 2 weeks ago." If those activity dates are related to submitting a patchset for a particular project I'd be curious about which project it was. Demographics would be useful for balancing the study but are not necessary. Author name and WMF affiliation are already in the Gerrit dashboard so I'm assuming that's not an obstacle.

Hi, I currently don't know (yet) exactly how to get something like "last active in 2017" or "joined 2 weeks ago" out of the DB.
Due to more urgent CTA stuff on my list I might not look into this ticket soon, plus after a local distro upgrade bumping sqlalchemy from 1.3 to 1.4 (and no easy to downgrade as its older package versions are compiled against Python 3.9 instead of 3.10), I myself am currently unable to run SQL queries locally as sortinghat is broken for importing a DB dump (pymysql.err.IntegrityError: (1062, "Duplicate entry 'Canonical' for key '_name_unique'")) now. :(
I've reported this problem to our partners at Bitergia (non-public ticket) and they have reproduced this problem.

For the records, I'm still blocked on SQLAlchemy 1.4 issue above and asked Bitergia again for guidance.

I worked around technical issues by using pyenv to get a parallel local installation of Python 3.9 and downgrading SQLAlchemy, however...

@sdkim: We are currently unable to have meaningful metrics about bot, tool, and script developers as this happens mostly outside of Wikimedia Gerrit - see https://www.mediawiki.org/wiki/Developer_Advocacy/Metrics#Technical_Contributors_Map . Thus the research approach has shifted.
Is my understanding correct, and should this task be declined?

Hi @Aklapper, we are still interested to get the list of MediaWiki core or extension contributors. This task should remain open given the original ask if you do have the capacity for it.

Investigation would become a bit easier if someone could please provide specific and clear criteria which result(s) are ideally wanted.
Looking at this ticket again, I for example wonder how "new" are "new developers"? How "long" is "long enough" in "Retained Developers"? etc
Ideally a request like "a list of Gerrit usernames of the 100 most active volunteer contributors in Gerrit who have contributed at least one patch (either still open, merged, or abandoned) in any codebase (deployed or not on Wikimedia servers) in Gerrit within the last 24 months, sorted by number of patches per contributor", or something similar to that.

Sure thing, let me know if this is descriptive enough.

Requesting a list of the most recent 100 users per section with the following attributes:

  • Email address
  • Total Contributions (# of patchsets contributed)
  • Projects to which they contributed

New Developers (aka Active Newcomers)

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • I am matching "Newcomer" and "New Developer" definitions
  • Every Gerrit user whose first patchset was made within the last 90 days*
    • *Given there have been a lot of holidays in the past 90 days, I would request this timeframe to be defined as August - October 2021

Inactive Developers

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • Any leaving newcomer whose last contribution was more than 12 months

Retained Developers

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • Developers who have made the most contributions over the past 24 months & have actively contributed in the last 60 days

Note: These definitions may change depending on the results. I don't expect all categories to return 100 entries but we don't know until we try?

Requesting a list of the most recent 100 users per section with the following attributes:

  • Email address
  • Total Contributions (# of patchsets contributed)
  • Projects to which they contributed

For the last bullet point I would not know how to output that. The data is available in the DB but you'd have to check manually on a per-person level on https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit by clicking on each account name to filter (drill down) the results in the "Repositories" panel.

New Developers (aka Active Newcomers)

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • I am matching "Newcomer" and "New Developer" definitions
  • Every Gerrit user whose first patchset was made within the last 90 days*
    • *Given there have been a lot of holidays in the past 90 days, I would request this timeframe to be defined as August - October 2021

That would be 67 independent contributor accounts in that timeframe, not 100:
Using the Gerrit retention of newcomers dashboard at https://wikimedia.biterg.io/goto/f792059a948e638a0fca92af68b7cfed and setting the time frame to Aug-Oct 2021 (this means: accounts who got first active in Gerrit within that timeframe) and filtering on author_org_name:"Independent", the numbers in the Contributions per author panel have *no relation* to that Aug-Oct 2021 timeframe (plus "contributions" in that panel also include e.g. commenting on patchsets, reviewing patchsets, or creating an updated patchset within a changeset). One could take the 67 usernames from the panel and create a custom query (author_name:"A" OR author_name:"B") on https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit , but those usernames will *not* be a list of "the 100 contributors who created the *most* patchsets or changesets or such).
https://wikimedia.biterg.io/goto/f792059a948e638a0fca92af68b7cfed
Maybe that is fine, as you wrote "Total Contributions (# of patchsets contributed)" in the previous comment, without any timeframe limitation.

It's unclear to me if "Total Contributions (# of patchsets contributed)" above only refers to patchsets or not. If it does:

If you are only after patchsets submitters (not *any* contributions in Gerrit), an option could be using https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit and filtering on author_org_name:"Independent", plus setting "Add a filter > Edit Query DSL" to {"range": {"demography_min_date":{"gte": "now-3M/d", "lt": "now-6M/d"}}} (means: accounts who got first active in Gerrit within 3 to 6 months ago), and checking the 36 results under "Submitters".
https://wikimedia.biterg.io/goto/4fc8d358580ad8f52ddad34fe7d7cf28
Again, numbers displayed do not refer to any timeframe.

Inactive Developers

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • Any leaving newcomer whose last contribution was more than 12 months

Similar to what I wrote above, I'd say you could go for a custom query on https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit and filtering on author_org_name:"Independent", plus set "Add a filter > Edit Query DSL" to {"range": {"demography_max_date":{"lt": "now-1y/d"}}}.
The "Submitters" panel will show lots of names as Gerrit has been up since 2012 (but for performance reasons the output is limited to 100 accounts), so some folks may have stopped being active many years ago.
https://wikimedia.biterg.io/goto/7d3db61369725d328bb6dda5432259c2

Retained Developers

  • Independent (Non-WMF/Affiliate staff, Contracted Organizations)
  • Developers who have made the most contributions over the past 24 months & have actively contributed in the last 60 days

I'm not sure I understand what "most" means; in any case I don't see a way how to programmatically express that the number of contributions 24 months ago most be higher than the number of contributions in the last 60 days.
Similar to above, that would be setting two separate filters {"range": {"demography_min_date":{"lt": "now-2y/d"}}} and {"range": {"demography_max_date":{"gte": "now-2M/d"}}}
https://wikimedia.biterg.io/goto/51fcaa287264eb5d10d5f12da5c9eb4d

In any case, please do note that results can (and do) contain current staff if current staff was not yet staff in the timeframe covered by a query, and people who also have a staff role but separate their volunteer contributions.
(And as mentioned before, in a later step, getting any email addresses would require running an SQL query on the SQL database.)

With Seve departing from WMF, I'll be assuming a more active role in this thread going forward.

I'll take some time this weekend to address these questions individually but I want to reiterate that we need some way to contact the developers matching these queries. That's the whole point: contacting developers for research recruiting. E-mail is best but at the very least we would need a meta, mw or other username to reach out through a talk page or wikimail.

Aklapper renamed this task from Community Metrics contacts for user research to Provide data of Gerrit contributors from Bitergia DB to interview developers for user research.Feb 4 2022, 8:43 PM

@jdh264: We cannot match Gerrit accounts (those are LDAP accounts) with SUL (mediawiki.org accounts) as these systems are not related (though in some cases these accounts have been merged into a single identity in the Bitergia database). However, Gerrit accounts must have email addresses.

When Seve wrote: "Total Contributions (# of patchsets contributed)" I'm guessing it was because neither he or I recognized that contributions could consist of anything other than patchsets. My goal is to find people who are writing code.

New Developers

The timeframe for this request should be relative to the current date. Seve's note about Aug/Oct made sense when he wrote it back in January but makes less sense the closer we get to spring. The intent is to find developers who are relatively new to Wikimedia and still learning the processes. Please expand the timeframe until you reach 100, returning their date of first patchset for reference.

There is no expectation that new developers would have a large number of contributions. Their relative newness is the critical factor.

Inactive Developers

Similar to new developers, knowing the date of last patchset (rather than first patchset) would help me to prioritize developers who became inactive more recently in 2020, 2019, 2018 etc.

Retained Developers

The intent here is to avoid recruiting developers who were previously active but have recently gone dormant. Programmatically it would be something like this:

  1. Return a list of IDs for all patchset contributors in the past 60 days.
  1. Use those IDs to query the total number of patchsets per ID in the past 24 months. Order by that total, descending. Limit 100.

Understood regarding the nature of staff designation over time. I'll try to weed those out during the recruiting process.

The timeframe for this request should be relative to the current date.

https://wikimedia.biterg.io/goto/26ee7324f4b8f999fb0b27b3c8eca60e expands to "between 3 months ago and 11 months ago". I also took the liberty to exclude the sandbox repository which is only for testing Gerrit.

returning their date of first patchset for reference.

IIUC that's not possible by default, it would likely require creating a custom dashboard to expose that DB column. How exactly would this date of first patchset be used? If the plan is to contact approx. 100 people anyway, I'm not sure I understand what's getting "prioritized" - is it their answers?

Retained Developers

The intent here is to avoid recruiting developers who were previously active but have recently gone dormant. Programmatically it would be something like this:

  1. Return a list of IDs for all patchset contributors in the past 60 days.
  1. Use those IDs to query the total number of patchsets per ID in the past 24 months. Order by that total, descending. Limit 100.

https://wikimedia.biterg.io/goto/563ed0b766ec6fdb09ee277dbf2b3993 lists those 87 at-some-point volunteer contributors who contributed within the last two months, also contributed more than two years ago, and "Submitters" is sorted by the total number of patchsets within the last 2 years.

I'm not sure I understand what's getting "prioritized"

In the case of new developers, my hypothesis is that a 2-day developer will have different perspectives than a 2-week developer or a 2-month developer, etc. I won't actually interview all 100. The plan is to sort through the candidates, find a diversity of experiences and reach out individually until I get a handful that might agree to talk with me.

It's not immediately clear whether these biterg.io links are a deliverable or not. I'm not seeing any e-mail addresses. I appreciate the context, and I'll definitely use these during the triage of selecting candidates, but what I really need is a list of e-malls with submitter username for each group.

If I'm missing something obvious or I need a certain credential to see them, please let me know.

Aklapper added subscribers: Bmueller, Aklapper.

In my understanding there are further conversations taking place; assigning this to @Bmueller for the time being.

Thanks for your work here! I’m declining this ticket since we’re not proceeding with this specific research approach. A note on Gerrit data as a source: This data gives insights into who contributes via Gerrit in a given timeframe, but unfortunately can’t tell us who is a new developer at Wikimedia or who has stopped contributing to Wikimedia as a developer. The experimental Gerrit newcomer retention dashboard should be viewed with that in mind.