Page MenuHomePhabricator

Groceryheist (Nathan TeBlunthuis)
Analysis

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Sep 19 2018, 12:07 AM (48 w, 3 d)
Availability
Available
IRC Nick
groceryheist
LDAP User
Unknown
MediaWiki User
Groceryheist [ Global Accounts ]

I'm Nate!

I'm PhD student at the University of Washington. I'm consulting on some data analysis and research projects at WMF this year.

I belong to the Community Data Science Collective, at the Communication Department at UW and the Department of Communication Studies at Northwestern University. I am training to be a computational social scientist of organizational communication with a focus on online collaboration.

Check out my paper “Revisiting ‘The Rise and Decline’ in a population of Peer Production Projects” For this project, I set out to replicate some of the key findings from “The Rise and Decline of an Open Collaboration System” by Aaron Halfaker, Stuart Geiger, Johnathan Morgan, and John Riedl. They argued that the decline in the number of active Wikipedia editors could be attributed to the rise of quality control systems that made it difficult for newcomers to join the community. I wanted to know if such systems create barriers for newcomers in peer production projects other than Wikipedia. I adapted Halfaker et al.’s methodological approach to analyze a set of 700 Wikia wikis. It turns out that typical wikis not only have similar mechanisms for decline as Wikipedia, but also exhibits ‘rise and decline’ patterns.

Recent Activity

Thu, Aug 22

Groceryheist added a comment to T229042: Reading_depth: remove eventlogging instrumentation.

Sorry I lost track of this bug until today. I think it is really regrettable to turn off the instrumentation. The utility of the data is greatly lessened by gaps in the collection window. As I pointed out if the volume of the events is a problem, would decreasing the sampling rate help? Nobody so far addressed that point.

Thu, Aug 22, 7:35 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1), Reading Depth, Product-Analytics, Analytics

Wed, Aug 21

Groceryheist added a comment to T230642: Publish aggregated reading time dataset .

Thanks Nuria!

Wed, Aug 21, 10:23 AM · Analytics, Reading Depth

Sat, Aug 17

Groceryheist added a comment to T230642: Publish aggregated reading time dataset .

Hi Nuria. I'm proposing to start with a one-off release that I can handle easily. I can also do some work to set up automated scheduled releases, but I don't want to commit to owning it in the long run.

Sat, Aug 17, 12:24 PM · Analytics, Reading Depth
Groceryheist created T230642: Publish aggregated reading time dataset .
Sat, Aug 17, 2:10 AM · Analytics, Reading Depth

Fri, Jul 26

Groceryheist updated subscribers of T229042: Reading_depth: remove eventlogging instrumentation.

I do not think we should be in a rush to remove this instrumentation.

Fri, Jul 26, 9:51 PM · Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1), Reading Depth, Product-Analytics, Analytics

Jul 3 2019

GitHub <noreply@github.com> committed rOEQ5e4744804831: Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into… (authored by Groceryheist).
Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into…
Jul 3 2019, 10:22 PM

Jun 25 2019

Groceryheist created T226574: Add feature for edit namespace to edit quality models.
Jun 25 2019, 9:00 PM · artificial-intelligence, editquality-modeling, Scoring-platform-team

Jun 24 2019

GitHub <noreply@github.com> committed rOEQ83660b5b61ca: Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into… (authored by Groceryheist).
Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into…
Jun 24 2019, 10:52 PM
Groceryheist created T226426: Build tool to guess what tool was used to make reverts on Wikimedia wikis .
Jun 24 2019, 4:15 PM · Scoring-platform-team (Current)

Jun 13 2019

Groceryheist added a comment to T225692: Pyarrow hdfs interface does not work in SWAP.

Thank you!

Jun 13 2019, 3:25 PM · Analytics-Kanban, Analytics
Restricted Application removed a project from T225692: Pyarrow hdfs interface does not work in SWAP: Patch-For-Review.
Jun 13 2019, 6:29 AM · Analytics-Kanban, Analytics

Jun 11 2019

Groceryheist updated subscribers of T225441: Qualitative data collection for ores bias analysis.

I'm making a list of people who helped with labeling campaigns for the different ores projects.

Jun 11 2019, 1:23 AM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)

Jun 10 2019

Groceryheist awarded T186559: Provide data dumps in the Analytics Data Lake a Love token.
Jun 10 2019, 7:52 PM · Analytics
Groceryheist created T225441: Qualitative data collection for ores bias analysis.
Jun 10 2019, 4:16 PM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)

Jun 7 2019

GitHub <noreply@github.com> committed rOEQ83f186b6ae3b: Merge pull request #201 from wikimedia/jawiki (authored by Groceryheist).
Merge pull request #201 from wikimedia/jawiki
Jun 7 2019, 4:34 AM

Jun 5 2019

Groceryheist added a comment to T225133: Look at recent changes filters event log to track usage.

The changeslisthighlights and changeslistfilters schemas were deleted along with the data. So we don't have the data that we would want to have for this.

Jun 5 2019, 11:02 PM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)
Groceryheist created T225134: Find out what tools are used for making reverts on the ores-enabled wikis..
Jun 5 2019, 6:50 PM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)
Groceryheist created T225133: Look at recent changes filters event log to track usage.
Jun 5 2019, 6:49 PM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)

Jun 4 2019

GitHub <noreply@github.com> committed rOEQ5d2dec886e8a: Merge pull request #196 from wikimedia/zhwiki (authored by Groceryheist).
Merge pull request #196 from wikimedia/zhwiki
Jun 4 2019, 3:17 AM

Jun 3 2019

Groceryheist updated the task description for T224902: Fit models for revert prediction.
Jun 3 2019, 5:58 PM · Scoring-platform-team, editquality-modeling, ORES, artificial-intelligence
Groceryheist added a comment to T224901: ORES bias analysis.

I created a task T224918 for that analysis.

Jun 3 2019, 5:55 PM · editquality-modeling, ORES, Epic, Scoring-platform-team (Current), artificial-intelligence
Groceryheist created T224918: Visualize the relationship between the probability of reversion and ores scores .
Jun 3 2019, 5:54 PM · editquality-modeling, ORES, artificial-intelligence, Scoring-platform-team (Current)

Jun 2 2019

GitHub <noreply@github.com> committed rOEQb0cee05a69f0: Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into… (authored by Groceryheist).
Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into…
Jun 2 2019, 4:59 AM
GitHub <noreply@github.com> committed rOEQ9ef2f950e0d4: Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into… (authored by Groceryheist).
Merge 83f8ae89af62064d03808f8e09bfb61b20e8e915 into…
Jun 2 2019, 4:58 AM
Groceryheist committed rOEQ83f8ae89af62: add the model infor for the enwiki reverted model. (authored by Groceryheist).
add the model infor for the enwiki reverted model.
Jun 2 2019, 4:58 AM
Groceryheist committed rOEQc6e982823bc4: change enwiki.reverted model to logistic regression. (authored by Groceryheist).
change enwiki.reverted model to logistic regression.
Jun 2 2019, 4:41 AM

May 30 2019

GitHub <noreply@github.com> committed rOEQedf3bf8b112c: Merge pull request #197 from wikimedia/nlwiki (authored by Groceryheist).
Merge pull request #197 from wikimedia/nlwiki
May 30 2019, 9:29 PM
GitHub <noreply@github.com> committed rOEQb6f4742e81c3: Merge pull request #195 from wikimedia/srwiki_goodfaith_fix (authored by Groceryheist).
Merge pull request #195 from wikimedia/srwiki_goodfaith_fix
May 30 2019, 8:51 PM
GitHub <noreply@github.com> committed rOEQ44e81bdbabf3: Merge pull request #192 from wikimedia/eswikiversity (authored by Groceryheist).
Merge pull request #192 from wikimedia/eswikiversity
May 30 2019, 8:19 PM

May 13 2019

Groceryheist updated subscribers of T222933: Upgrade R in SWAP notebooks to 3.4+.
May 13 2019, 4:08 PM · Analytics-SWAP, Analytics

May 10 2019

Groceryheist created T222933: Upgrade R in SWAP notebooks to 3.4+.
May 10 2019, 2:10 AM · Analytics-SWAP, Analytics

May 3 2019

mpopov awarded T221890: Add wikidata ids to data lake tables a Like token.
May 3 2019, 2:34 PM · Epic, Analytics, Product-Analytics
Groceryheist added a comment to T222301: Upgrade pandas in spark SWAP notebooks.

Ok I see. A hostile dependency could be a big problem. I'm not looking to argue, just sincerely curious. I get involved managing a sort of ad-hoc spark setup on the UW cluster, so maybe I can learn something useful :)

May 3 2019, 7:24 AM · Analytics
Groceryheist added a comment to T222301: Upgrade pandas in spark SWAP notebooks.

Having said this, Andrew is planning to work on the Spark 2.4.2 upgrade and he will take a look if pandas could be upgraded as well :)

May 3 2019, 7:05 AM · Analytics

May 2 2019

Groceryheist added a comment to T222301: Upgrade pandas in spark SWAP notebooks.

I see, for Python packages I usually use pip instead of Debian since python tends to move much faster than Debian. Of course, I'm just managing this for myself and not supporting a whole organization :), But I'm also curious about why you use Debian for this.

May 2 2019, 3:54 PM · Analytics

May 1 2019

Groceryheist created T222301: Upgrade pandas in spark SWAP notebooks.
May 1 2019, 7:49 PM · Analytics
Groceryheist added a comment to T222254: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow.

@elukey, thanks. It seems like I'm experiencing a regression then. I can work around it for now. See you tomorrow!

May 1 2019, 3:55 PM · Analytics, Analytics-Cluster
Groceryheist created T222254: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow.
May 1 2019, 6:48 AM · Analytics, Analytics-Cluster
Groceryheist created T222253: Upgrade Spark to 2.4.x.
May 1 2019, 4:54 AM · Analytics-Kanban, Analytics, Analytics-Cluster

Apr 29 2019

Groceryheist added a comment to T221890: Add wikidata ids to data lake tables.

@Nuria yes. My understanding is that they are when pp_propname == "wikibase_item"

Apr 29 2019, 8:21 PM · Epic, Analytics, Product-Analytics
Groceryheist added a comment to T221890: Add wikidata ids to data lake tables.

My ultimate goal is to identify, from a random sample of ~500,000 to ~50,000,000 edits from different language Wikipedias.

  1. Which edits are to biographical articles.
  2. The gender or sex of the subject of the biographical articles.
Apr 29 2019, 5:06 PM · Epic, Analytics, Product-Analytics

Apr 26 2019

Groceryheist added a comment to T221890: Add wikidata ids to data lake tables.

Thank you Nuria. Are you saying that we'll be able to sqoop the prop_tables in May at the earliest? Would it be okay to lookup a sizable number of pages in the prop_tables in the meantime? I'm thinking on the order of 20,000 pages per language.

Apr 26 2019, 4:58 AM · Epic, Analytics, Product-Analytics
Groceryheist added a comment to T221870: Why are there three Q-marks (???) in threshholds in Special:ORESModels?.

Also
https://sr.wikipedia.org/wiki/Special:ORESModels has a strange threshold (0,1) for goodfaith.

Apr 26 2019, 12:40 AM · Growth-Team, Scoring-platform-team, ORES, MediaWiki-extensions-ORES

Apr 25 2019

Groceryheist updated the task description for T221890: Add wikidata ids to data lake tables.
Apr 25 2019, 7:51 PM · Epic, Analytics, Product-Analytics
Groceryheist created T221890: Add wikidata ids to data lake tables.
Apr 25 2019, 7:50 PM · Epic, Analytics, Product-Analytics
Groceryheist updated subscribers of T221870: Why are there three Q-marks (???) in threshholds in Special:ORESModels?.
Apr 25 2019, 4:01 PM · Growth-Team, Scoring-platform-team, ORES, MediaWiki-extensions-ORES
Groceryheist created T221871: Non-overlapping threshholds in ORESModels on lvwiki.
Apr 25 2019, 4:00 PM · Patch-For-Review, Growth-Team (Current Sprint), ORES, Scoring-platform-team, MediaWiki-extensions-ORES
Groceryheist created T221870: Why are there three Q-marks (???) in threshholds in Special:ORESModels?.
Apr 25 2019, 3:57 PM · Growth-Team, Scoring-platform-team, ORES, MediaWiki-extensions-ORES

Apr 23 2019

Groceryheist added a comment to T212172: Provide feature parity between the wiki replicas and the Analytics Data Lake.

Wikipedia-to-Wikidata linkage patterns (T209891#4798717, using the page_props table)

Apr 23 2019, 8:40 PM · Epic, Analytics, Product-Analytics

Apr 18 2019

Groceryheist closed T221398: Install aspell for ORES languages on STAT1006 as Resolved.
Apr 18 2019, 7:00 PM · Scoring-platform-team, ORES
Groceryheist claimed T221398: Install aspell for ORES languages on STAT1006.
Apr 18 2019, 7:00 PM · Scoring-platform-team, ORES
Groceryheist added a comment to T221398: Install aspell for ORES languages on STAT1006.

@Ladsgroup Oh sweet thanks I'll do that.

Apr 18 2019, 6:59 PM · Scoring-platform-team, ORES
Groceryheist created T221398: Install aspell for ORES languages on STAT1006.
Apr 18 2019, 6:50 PM · Scoring-platform-team, ORES
GitHub <noreply@github.com> committed rORES099794334c6b: Merge 2ed740a6a142e4587c87a5b5f3944c3625445b0a into… (authored by Groceryheist).
Merge 2ed740a6a142e4587c87a5b5f3944c3625445b0a into…
Apr 18 2019, 12:38 AM
Groceryheist committed rORES2ed740a6a142: Fix for #325: Score_revisions.py doesn't respect output parameter. (authored by Groceryheist).
Fix for #325: Score_revisions.py doesn't respect output parameter.
Apr 18 2019, 12:38 AM

Apr 9 2019

Groceryheist claimed T200898: Analyze the effects of ORES deployments on counter-vandalism behavior.

@Harej Indeed. I was already planning to do something very similar to this in the course of my project. I may be actively working on some of these subtasks starting next week.

Apr 9 2019, 10:21 PM · Scoring-platform-team (Research), ORES, Research ideas

Nov 20 2018

Groceryheist added a comment to T209051: ReadingDepth schema is whitelisting both session ids and page ids.

A handful of thoughts:

Nov 20 2018, 1:23 AM · Patch-For-Review, Analytics

Nov 18 2018

Groceryheist updated the task description for T160492: Conduct further data quality checks on the ReadingDepth schema.
Nov 18 2018, 12:09 AM · Readers-Web-Backlog (Tracking), Reading Depth, Product-Analytics, Reading-analysis

Nov 17 2018

Groceryheist updated the task description for T160492: Conduct further data quality checks on the ReadingDepth schema.
Nov 17 2018, 9:59 PM · Readers-Web-Backlog (Tracking), Reading Depth, Product-Analytics, Reading-analysis

Nov 2 2018

Groceryheist added a comment to T208275: Add revision ID to ReadingDepth Schema and Data.

Good question. I don't think so, unless there are additional schemas that we might need to join with that have keys other than page_id or revision_id. We already record namespace.

Nov 2 2018, 5:41 AM · Readers-Web-Backlog

Nov 1 2018

Groceryheist created T208478: Red links in ReadingDepth data.
Nov 1 2018, 5:05 AM · Readers-Web-Backlog

Oct 31 2018

Groceryheist added a comment to T208275: Add revision ID to ReadingDepth Schema and Data.

A related problem is that pages can move. Right now we record page_title, but different pages can have the same_page title at different times. It would also make downstream analysis much more convenient to have page_id in the schema.

Oct 31 2018, 4:29 AM · Readers-Web-Backlog

Oct 29 2018

Groceryheist renamed T208275: Add revision ID to ReadingDepth Schema and Data from Add revision ID to ReadingDepth Schema to Add revision ID to ReadingDepth Schema and Data.
Oct 29 2018, 11:10 PM · Readers-Web-Backlog
Groceryheist created T208275: Add revision ID to ReadingDepth Schema and Data.
Oct 29 2018, 11:09 PM · Readers-Web-Backlog

Sep 26 2018

Groceryheist updated the task description for T160492: Conduct further data quality checks on the ReadingDepth schema.
Sep 26 2018, 11:20 PM · Readers-Web-Backlog (Tracking), Reading Depth, Product-Analytics, Reading-analysis
Groceryheist updated the task description for T160492: Conduct further data quality checks on the ReadingDepth schema.
Sep 26 2018, 11:14 PM · Readers-Web-Backlog (Tracking), Reading Depth, Product-Analytics, Reading-analysis

Sep 25 2018

Groceryheist closed T204790: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users as Resolved.

Created task https://phabricator.wikimedia.org/T205454 for LDAP access

Sep 25 2018, 4:20 PM · Patch-For-Review, SRE-Access-Requests, Operations
Groceryheist created T205454: LDAP Access request for Nathan TeBlunthuis (groceryheist / nathante).
Sep 25 2018, 4:19 PM · LDAP-Access-Requests
Groceryheist reopened T204790: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users as "Open".

I still don't have access to SWAP. I understand that I need to be added to the nda LDAP group.

Sep 25 2018, 4:13 PM · Patch-For-Review, SRE-Access-Requests, Operations

Sep 19 2018

Groceryheist added a comment to T204790: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users.

@RobH: Great. Thanks!

Sep 19 2018, 8:01 PM · Patch-For-Review, SRE-Access-Requests, Operations
Groceryheist added a comment to T204790: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users.

Here's the contract that I signed and sent to @ovasileva : REMOVED
It includes a "Contractor Confidentiality Agreement. Is this the NDA we are looking for?
Per the contract, the end date is November 16th 2018.

Sep 19 2018, 7:57 PM · Patch-For-Review, SRE-Access-Requests, Operations
Groceryheist created T204790: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users.
Sep 19 2018, 12:36 AM · Patch-For-Review, SRE-Access-Requests, Operations