Page MenuHomePhabricator

Isaac (Isaac Johnson)
Research Scientist

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 1 2018, 2:19 PM (194 w, 5 d)
Availability
Available
IRC Nick
isaacj
LDAP User
Isaac Johnson
MediaWiki User
Isaac (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Isaac updated the task description for T310379: SuggestBot Experimentation.
Fri, Jun 24, 5:20 PM · Research
Isaac added a comment to T307229: Edit Types: Share out about library.

updates: blogpost being published Tuesday and then i'll close this out.

Fri, Jun 24, 5:14 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.

weekly updates:

  • did a lot of thinking about suggestbot experimental design and revising our offline analyses to be more in-line with the proposed experiment. in particular, stuck on how to transform data we have for each editor receiving suggestbot recommendations (their edits that are attributable to suggestbot recs and broader edit history) to appropriately capture their flexibility when it comes to editing about various topics -- e.g., if they predominantly edit articles about men, would that affect the likelihood that they'd accept a recommendation to edit a biography of a woman? unfortunately even with gender (which is relatively simple), there are several challenges:
    • not all articles are biographies so e.g., a feature that captures whether a recommendation matches an editor's preferences around biography gender (as gathered via edit history) doesn't distinguish between ambivalence about the gender of the biography and not editing biographies.
    • for an e.g., editor that edits 40% women and 60% men, should we be more surprised if they accept a recommendation for a man or a woman biography? presumably this example editor prefers to edit about women but maybe they're just editing about a topic that has more women and they don't actually have a preference (or they're editing about sports and have a strong preference)?
  • solution might be to not model it but try to capture via descriptive stats, which would probably also more easily capture editor variability in the flexibility of their preferences
  • also working on summarizing current state of project with Mo before we decide on parameters for experiment
Fri, Jun 24, 5:13 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

working with DS and PD on getting historical evaluation of quality working. sent them patch for images in templates/galleries but main issue still assumed to be selecting the expected features for high quality articles based only on current snapshot and not all of history (which overweights high-quality, highly-edited articles).

Fri, Jun 24, 4:40 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T310237: Story idea for Blog: What is in an edit? Automated detection of edit types on Wikipedia.

Tuesday sounds great - thanks!

Fri, Jun 24, 4:21 PM · Technical-blog-posts

Wed, Jun 22

Isaac added a comment to T311155: Rebuilding instances via Horizon gets stuck in forever loop of collecting puppet agent stats.

I should add that I'm going to delete this instance because I need its resources and a hard reboot did not solve the issue either. So I assume the logs will be deleted too but in my experience this has happened several times with different VMs so if whoever looks at this can't replicate it, let me know and I'll try on something that can be unreachable for however long it takes to debug etc.

Wed, Jun 22, 3:48 PM · Cloud-VPS
Isaac created T311155: Rebuilding instances via Horizon gets stuck in forever loop of collecting puppet agent stats.
Wed, Jun 22, 3:38 PM · Cloud-VPS

Fri, Jun 17

Isaac added a comment to T310237: Story idea for Blog: What is in an edit? Automated detection of edit types on Wikipedia.

The only thing I still need are links to the images that you've included in the Google Doc - could you share those with me directly?

No problem -- motivation I needed to upload them to Commons:

Fri, Jun 17, 5:12 PM · Technical-blog-posts
Isaac added a comment to T293468: Co-organize Fair Ranking Track at TREC 2022.

Weekly update:

  • data released and talked through metrics more
Fri, Jun 17, 2:49 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T286923: Source geoprovenance: scope work.

TODO: read https://opensym.org/wsos2013/proceedings/p0203-ford.pdf

Fri, Jun 17, 2:47 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307229: Edit Types: Share out about library.

Updates:

  • Copy-editing on blogpost -- should be published next week
Fri, Jun 17, 2:46 PM · Research (FY2021-22-Research-April-June)

Wed, Jun 15

Isaac added a comment to T310237: Story idea for Blog: What is in an edit? Automated detection of edit types on Wikipedia.

If you're happy with the small changes and can update the comments, I can set this up and scheduled in Wordpress for early next week.

Looks great -- thanks for the feedback. I went through and did another pass so feel like it's ready to go now.

Wed, Jun 15, 5:57 PM · Technical-blog-posts

Tue, Jun 14

Isaac added a comment to T310646: Reduce timeouts for prolific editors.

ok -- patch uploaded. @Tchanders let me know here or on the patch if there's any questions etc. Only caveat is I created a new config variable so it would be easier to adjust this in the future without changing code. but for testing/production, obviously have to make sure that the config file being used has this variable too.

Tue, Jun 14, 8:37 PM · Anti-Harassment (AHaT Sprint 10: The Mokorotlo), Patch-For-Review, Similar Editors
Isaac created T310646: Reduce timeouts for prolific editors.
Tue, Jun 14, 5:39 PM · Anti-Harassment (AHaT Sprint 10: The Mokorotlo), Patch-For-Review, Similar Editors

Fri, Jun 10

Isaac committed rRLP48f2cc8f9c67: facct paper and trec instructions (authored by Isaac).
facct paper and trec instructions
Fri, Jun 10, 6:08 PM
Isaac added a comment to T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.

Weekly updates:

Fri, Jun 10, 4:47 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates: none

Fri, Jun 10, 4:45 PM · Research (FY2021-22-Research-April-June)
Isaac added a subtask for T310379: SuggestBot Experimentation: T308287: Onboard Mo to analytics infrastructure.
Fri, Jun 10, 4:42 PM · Research
Isaac added a parent task for T308287: Onboard Mo to analytics infrastructure: T310379: SuggestBot Experimentation.
Fri, Jun 10, 4:42 PM · Research
Isaac created T310379: SuggestBot Experimentation.
Fri, Jun 10, 4:41 PM · Research
Isaac added a comment to T307229: Edit Types: Share out about library.

Updates:

Fri, Jun 10, 4:29 PM · Research (FY2021-22-Research-April-June)
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Fri, Jun 10, 3:11 PM · Patch-For-Review, periodic-update, Research
Isaac added a comment to T309769: Expanding External Referrer Tracking.

Thanks @KinneretG for creating this task! Just chiming in with a thought for whoever takes up this work: search engines are a bit more standardized but in their referer URLs but some of these external platforms have link shorteners that we need to account for. You can see a few examples from a past pilot with a similar scope (though fewer external platforms). Generally I just inspected the top external traffic from a given day to identify any non-standard referer formats -- e.g.,:

Fri, Jun 10, 2:20 PM · Foundational Technology Requests

Wed, Jun 8

Isaac created T310237: Story idea for Blog: What is in an edit? Automated detection of edit types on Wikipedia.
Wed, Jun 8, 10:28 PM · Technical-blog-posts

Mon, Jun 6

Isaac updated subscribers of T308165: Explore what would be required to migrate the content translation recommendation model to Lift Wing.

@kevinbazira chiming in here to try to help sort out the different services. tagging @leila and @bmansurov too who hopefully can correct / verify what I know:

Mon, Jun 6, 5:28 PM · Epic, Machine-Learning-Team (Active Tasks)

Fri, Jun 3

Isaac added a comment to T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.

weekly updates: met with JT and MR to discuss knowledge gaps / ml equity + product. shared general takeaways:

  • For measurement of content impact, gender and geography gives good coverage: gender because interventions have been shown to be affective for getting editors to edit content about women so design choices can have a real impact. geography because interventions are less effective (editors are more likely to edit content with which they are familiar and geographic familiarity is a large component of this) so without measuring individual editor demographics, tracking content geography gives some insight into the diversity of the editor community and encourages long-term investments in supporting a more diverse editor community.
  • For design: individual filters (e.g., topics, countries) are good and should continue to receive development but individual action won't close knowledge gaps. for that, we need collective action of the type organized by campaigns/edit-a-thons. so long-term, connecting recommender systems with campaigns feels like the much more effective approach.
Fri, Jun 3, 8:03 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates: worked with DS to make sure his implementation of the model for all of wikitext history made sense. discussed how to set the right thresholds for each feature (use only current version of wikitext to determine 'top quality' articles not every revision).

Fri, Jun 3, 7:54 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307229: Edit Types: Share out about library.

also FK finally figured out the cluster + edittypes issue! it evidently is some memory issue with mwparserfromhell version 0.6.4. Running with default version on the cluster (0.6) completed while running with upgraded version (0.6.4) triggered memory errors. no clear reason so presumably some complicated interaction between the library and YARN.

Fri, Jun 3, 7:24 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307229: Edit Types: Share out about library.

Weekly updates:

  • first full draft of blogpost. running by Jesse before submitting to techblog for review
Fri, Jun 3, 7:22 PM · Research (FY2021-22-Research-April-June)

Tue, May 31

Isaac added a comment to T309035: Add links to interaction timeline from Special:SimilarEditors results.

This can be worked on while some of the parameters are being confirmed. One thing I noticed is that the URL requires you to pass along a wiki. I assume since similar users only works with enwiki right now, it's fine to hardcode that in? Should we track some of these hardcoded dependencies (since I believe there's some talk about making it work for all wikis?)

Not my decision but agreed that hardcoding is fine now but would be good to track these for a hopeful eventual conversion to all wikis. All the Mediawiki API calls from within the tool require a language too and that would also need updated to be more flexible. Thanks for flagging this too because it has implications for the backend data or API -- i.e. giving an appropriate value of a single wiki for the interaction timeline for a given pair of users is not necessarily trivial.

Tue, May 31, 7:59 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Anti-Harassment (AHaT Sprint 10: The Mokorotlo), Similar Editors

Fri, May 27

Isaac added a comment to T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.

weekly updates: most of thinking in this space has been preparation for Mo's summer work around personalized edit recommendation and content equity (focus on SuggestBot). also in re-reading Diego's proposal for AI + Knowledge Integrity, he's covered the need to discuss data generation strategies with ML Platform and Product stakeholders so the planning for that (which we'd identifed as the main priority of next year in this space) can likely happen in collaboration with him (perhaps using vandalism detection as the case study, which is something I would have proposed anyhow).

Fri, May 27, 6:07 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates: none

Fri, May 27, 6:01 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307229: Edit Types: Share out about library.

Weekly updates:

  • continued minor improvements to both libraries to fix edge cases. I put together a simple notebook script for helping to identify issues and will hopefully eventually get to the point where both libraries are in full agreement where possible: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/DiffLibraryComparison.ipynb
  • ragesoss would be very interested in the references and words aspect of the tool so hopefully at some point in the next year, we'll work with them to incorporate the outputs into their programs and events dashboard: https://outreachdashboard.wmflabs.org/
  • wrote the full introduction to the blogpost and will wrap up the rest of the sections next week. then some time for Jesse to comment before submitting to TechBlog.
Fri, May 27, 6:00 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T293468: Co-organize Fair Ranking Track at TREC 2022.

Weekly update:

Fri, May 27, 5:54 PM · Research (FY2021-22-Research-April-June)
Isaac updated the task description for T293468: Co-organize Fair Ranking Track at TREC 2022.
Fri, May 27, 5:53 PM · Research (FY2021-22-Research-April-June)
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Fri, May 27, 3:33 PM · Patch-For-Review, periodic-update, Research

May 26 2022

Isaac added a comment to T307023: Display results of Special:SimilarEditors.

Should a task be filed for this, or is this just something that should be worked around for now?

Just to be clear: this question pertains to the timeout when gathering new contributions for Arjayay? If so, a few things going on here:

  • I don't think the databases for the tool have been updated in many many months so you're going to see this happen much more right now than you would when we have the monthly updates to the databases running
  • What's happening is that say the databases are good to 30 April 2022. If you query a user today, the tool hits the API for the pages they edited since 30 April. For each of these pages, the edit history (since 30 April in this example) is then gathered and analyzed. This is the step that's almost certainly timing out. Details:
    • The first set of API calls for pages edited is loosely capped at 1000 pages (code). It would be pretty cheap to reduce that cap to say 50 pages. Then if someone edited a lot recently, the first call to the tool would get those 50 pages. The next call would maybe get the next 50. And so on. So you essentially stretch out the data updates over multiple sessions so no one session times out (hopefully) at the cost of maybe not having all the most current data in that first session. The timespan associated with the data is included in the API response though so hopefully we could expose this easily to the user of the tool. We're making some assumptions too about the data that aren't perfect and the more sessions this update is spread across, the more likely we are to introduce error into the data. I wouldn't be super concerned about this and I haven't empirically evaluated it but an FYI. Each monthly database update resets this error to 0 though, so that's good.
    • The second set of API calls (code) is the expensive step. Many active editors will have edited 1000 unique pages since the database was last refreshed and each of those pages could have many associated API calls to get the edit history (especially right now). There are maybe ways to also explicitly limit this process but it's a lot trickier to do (definitely its own task and I don't know when/if it would be figured out). All to say, much easier to address this at the prior page gathering step.
May 26 2022, 3:42 PM · Anti-Harassment (AHaT Sprint 9: The Beret), MW-1.39-notes (1.39.0-wmf.14; 2022-05-30), Similar Editors
Isaac updated subscribers of T252227: Mobile redirects drop provenance parameters.

Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the current understanding of the situation but let me know if anything seems off.

May 26 2022, 12:02 PM · Data-Engineering, Traffic-Icebox, SRE
Isaac updated the task description for T252227: Mobile redirects drop provenance parameters.
May 26 2022, 11:48 AM · Data-Engineering, Traffic-Icebox, SRE

May 20 2022

Isaac added a comment to T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.

weekly updates: haven't worked on this much but did meet with PG from Language to discuss future of content translation recommendations and potential collaborations there

May 20 2022, 3:32 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T307229: Edit Types: Share out about library.

Weekly updates:

  • Some indication that memory issues might be related to wikitext table and not the edit types UDF
  • MA identified some simple optimizations for edit types library around converting mwparserfromhell nodes to strings that greatly increased the speed of the library -- e.g., order of magnitude faster, especially for larger changes! not implemented yet but likely update the libraries next week
  • meeting set up with ragesoss from Wikiedu to talk about potentially using the edit types with their programs and events dashboard (tracking the impact of edit-a-thons etc.)
May 20 2022, 3:32 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates: continued fine-tuning of model card based on feedback from Hal / Pablo: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card

May 20 2022, 3:22 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T308853: Similarusers API documentation doesn't match the returned object.

Good catch -- I forget why exactly this would have been included but the num_edits_in_data variable is the important one here that captures something very similar. Which is to say I support updating the documentation and leaving the code as is.

May 20 2022, 2:12 PM · MW-1.39-notes (1.39.0-wmf.13; 2022-05-23), Patch-For-Review, Similar Editors, Anti-Harassment (AHaT Sprint 8: The Flat Cap)

May 18 2022

Isaac updated the task description for T308287: Onboard Mo to analytics infrastructure.
May 18 2022, 10:15 PM · Research

May 17 2022

Isaac added a comment to T307229: Edit Types: Share out about library.

Weekly updates: spent some time fighting with the cluster and edit types library to get it running on mediawiki history snapshots without memory errors. FK and I worked on it collaboratively yesterday but did not make much progress in diagnosing or fixing the issue. Will continue to think on potential steps forward.

May 17 2022, 3:06 PM · Research (FY2021-22-Research-April-June)

May 16 2022

Isaac added a comment to T308287: Onboard Mo to analytics infrastructure.

@RoccoMo we can continue with this next week (we can schedule a time during our wednesday call but same time works for me). We'll go through the SuggestBot extraction notebook in detail. If you're curious, feel free to take a look ahead of time at some of the examples/documentation but no expectation that you will have done that ahead of our next session.

May 16 2022, 5:12 PM · Research
Isaac updated the task description for T308287: Onboard Mo to analytics infrastructure.
May 16 2022, 5:08 PM · Research

May 13 2022

Isaac added a comment to T305888: Reference Quality in English Wikipedia / Internship.

FYI in case it's useful, here's some code I was using for extracting cite templates on English Wikipedia and joining it with country data inferred based on URLs/publishers: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/references/ref_extraction.ipynb

May 13 2022, 2:24 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300477: Story idea for Blog: mwsql: a faster way to explore data from wiki projects.

Hey all -- just checking in to see if this might be published soon. Not urgent urgent but it would be really nice to have the blogpost published to share at the hackathon to showcase the sort of intern projects that our team (research) has been a part of.

May 13 2022, 2:07 PM · User-Slst2020, Technical-blog-posts

May 12 2022

Isaac created T308287: Onboard Mo to analytics infrastructure.
May 12 2022, 7:51 PM · Research
Isaac updated the task description for T287048: [medium] Develop good practices for Wikisource processing.
May 12 2022, 6:21 PM · Research ideas
Isaac closed T288666: Create a user script for showing statistics on Wikipedia articles about the gender of those linked in the article as Resolved.
May 12 2022, 3:43 PM · Research ideas, Wikimania-Hackathon-2021
Isaac renamed T287048: [medium] Develop good practices for Wikisource processing from [short] Develop Python tutorial for Article Topic Dataset to [medium] Develop good practices for Wikisource processing.
May 12 2022, 3:02 PM · Research ideas

May 10 2022

Isaac updated the task description for T308053: Requesting access to analytics-privatedata-users for RoccoMo.
May 10 2022, 6:34 PM · SRE, SRE-Access-Requests
Isaac updated subscribers of T308053: Requesting access to analytics-privatedata-users for RoccoMo.

Hey SRE/Analytics -- we have a new formal collaborator onboard: @RoccoMo. They need access to HDFS and the stat machines for a new research project. Don't hesitate to let me know if you need more information.

May 10 2022, 6:10 PM · SRE, SRE-Access-Requests
Isaac created T308053: Requesting access to analytics-privatedata-users for RoccoMo.
May 10 2022, 6:08 PM · SRE, SRE-Access-Requests

May 6 2022

Isaac added a comment to T307229: Edit Types: Share out about library.

Weekly updates:

  • Working on blogpost. Contract ended but Jesse will continue to be looped in on blogpost. Goal is to get it to TechBlog folks by the end of the month for hopeful publication in June.
  • Also prepared a document about potential approaches for creating a stream of these diffs so analysts, tool designers, etc. will have access to a real-time stream of edit types that they can use without having to calculate themselves (shared with some Data Engineering / Platform folks): https://docs.google.com/document/d/1_EQ13zhEtJmYQVijg5ACAT33oHduhaXp05lAaLvTL38/edit?usp=sharing
May 6 2022, 1:16 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates:

May 6 2022, 12:47 PM · Research (FY2021-22-Research-April-June)

May 5 2022

Isaac added a comment to T307022: Sockpuppet detection API is accessible without prior auth.

Thanks @Mstyles and @sbassett !

May 5 2022, 9:00 PM · SecTeam-Processed, Vuln-MissingAuthz, Platform Engineering, Security, Security-Team
Isaac updated the task description for T280369: Isaac Academic Service.
May 5 2022, 3:14 PM · Research
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
May 5 2022, 3:13 PM · Patch-For-Review, periodic-update, Research
Isaac committed rRLP2da97c09dd5f: White paper revision (authored by Isaac).
White paper revision
May 5 2022, 3:13 PM

May 4 2022

Isaac added a comment to T307022: Sockpuppet detection API is accessible without prior auth.

Thanks @STran for catching this! It's a test instance that I built as an early prototype of the now similarusers service (thank you @WDoranWMF and @hnowlan for fielding this but they weren't involved with this instance). While I opted to password-protect it to reduce the chance of abuse, it's purely based on public data from edit histories. Happy to follow-up as needed though.

May 4 2022, 4:53 PM · SecTeam-Processed, Vuln-MissingAuthz, Platform Engineering, Security, Security-Team

May 3 2022

Isaac committed rRLP7be1061a87f1: Add Diego's CHI paper (authored by Isaac).
Add Diego's CHI paper
May 3 2022, 1:21 PM
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
May 3 2022, 1:15 PM · Patch-For-Review, periodic-update, Research

May 2 2022

Isaac added a comment to T293480: Content Tagging Models: Prototype two.

Hey @paramita_das: you can see most of these details in the write-up and attached notebook. Pointers to your specific questions below:

May 2 2022, 1:16 PM · Research (FY2021-22-Research-Jan-March)

Apr 29 2022

Isaac committed rRLPb3e2dfa5add8: Revert button to WikiResearch (authored by Isaac).
Revert button to WikiResearch
Apr 29 2022, 10:16 PM
Isaac moved T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy from Staged to FY2021-22-Research-April-June on the Research board.
Apr 29 2022, 8:29 PM · Research (FY2021-22-Research-April-June)
Isaac created T307254: Recommendation Equity: Findings from GLAM pilot and ML Equity Strategy.
Apr 29 2022, 8:29 PM · Research (FY2021-22-Research-April-June)
Isaac closed T283821: Add information about Research office hours on research landing page (events) as Resolved.

boldly closing this task (reopen if i'm wrong) -- i believe we completed this a while ago and chose to remove some of the visuals from the page:

Apr 29 2022, 8:10 PM · Research
Isaac added a comment to T300670: Quality Model: Streamline.

Weekly updates:

Apr 29 2022, 2:59 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T286923: Source geoprovenance: scope work.

Update:

  • Reading Amaral et al.'s work on references on Wikidata as a source of ideas about ways to analyze sources. Will reach out to them with thoughts when I've had time to think about it.
  • Following T305888 to see what I can learn from that project
Apr 29 2022, 2:56 PM · Research (FY2021-22-Research-April-June)
Isaac moved T307229: Edit Types: Share out about library from Staged to FY2021-22-Research-April-June on the Research board.
Apr 29 2022, 2:54 PM · Research (FY2021-22-Research-April-June)
Isaac created T307229: Edit Types: Share out about library.
Apr 29 2022, 2:54 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T293468: Co-organize Fair Ranking Track at TREC 2022.

Update:

  • Dataset compiled and ready to release. Will go out in probably 2 weeks after my collaborators generate keywords to act as queries for each WikiProject. Notebook paper will go out in next week
Apr 29 2022, 2:40 PM · Research (FY2021-22-Research-April-June)
Isaac updated the task description for T293468: Co-organize Fair Ranking Track at TREC 2022.
Apr 29 2022, 2:39 PM · Research (FY2021-22-Research-April-June)

Apr 25 2022

Isaac added a comment to T305390: Cross-Linngual Article Quality Exploration.

FYI if you want an alterative approach for extracting quality ratings for current articles, you can use the page assessments MySQL table as well. They are available in at least English, French, Arabic, Hungarian, and Turkish but only contain the current state:

Apr 25 2022, 5:46 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T287056: Deploy Outlinks topic model to production.

@achou excited to have you claim this task! Don't hesitate to reach out if you have any questions about the model etc.

Apr 25 2022, 2:18 PM · Machine-Learning-Team (Active Tasks), Lift-Wing
Isaac updated subscribers of T306114: Cloud VPS "wmf-research-tools" project Stretch deprecation.

@diego this is your instance -- not sure if you mind whether it's shutdown/deleted on May 1st or not but FYI in case.

Apr 25 2022, 2:07 PM · Cloud-VPS (Debian Stretch Deprecation)

Apr 8 2022

Isaac committed rRLP19a52731096e: Update button for wikiworkshop and new publications (authored by Isaac).
Update button for wikiworkshop and new publications
Apr 8 2022, 3:20 PM
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Apr 8 2022, 3:02 PM · Patch-For-Review, periodic-update, Research
Isaac added a comment to T305390: Cross-Linngual Article Quality Exploration.

There was a mistake on the title, this work is about article quality and not specifically about citations.

Oh drat -- well then a small tweak on my comments that largely repeats what I said in our meeting. The two main gaps I see in the quality model are:

  • Better features related to the actual content (not just its quantity).
    • Martin is working on a language-agnostic measure for readability (meta)
    • Language-agnostic lists of maintenance templates so editor's can flag issues that models might not detect, which I know is something Diego has already worked on
  • Better features related to the sources of the article (detailed thoughts in my previous comment) given how important sources are to article quality
Apr 8 2022, 1:06 PM · Research (FY2021-22-Research-April-June)

Apr 7 2022

Isaac updated subscribers of T305390: Cross-Linngual Article Quality Exploration.

Just wanted to comment that I love this task and a few thoughts:

  • The start of some of my work around understanding sources -- specifically their geography: https://meta.wikimedia.org/wiki/Research:Analyzing_sources_on_Wikipedia
  • There are three aspects of citation quality that stand out to me:
    • Do they exist -- i.e. for sentences that need a citation, how many have one? Miriam and Aiko's work tackled this question for a few wikis.
    • The reliability/verifiability of an individual source -- e.g., does it show up on reliable source lists or in Featured Articles? Is it primary, secondary, or tertiary? Is there a digital version that doesn't sit behind a paywall?
    • The overall diversity of sources -- e.g., how many unique sources? mixture of primary/secondary/tertiary? mixture of countries? mixture of source types (books, newspapers, etc.)? mixture of dates?
  • As I mentioned in our meeting, I'm also intrigued by the question of when sources / content go stale. For example, you have an article that exists in multiple languages. On one language, we see a burst of edits that are adding new sources to the article that seem to be recent (e.g., newspaper articles published in the past month). How often do we see the other languages also add recent sources? When the other languages don't add these sources, which of these is true:
    • The articles are incomplete -- e.g., the person won a prize that isn't listed yet?
    • The articles have stale (old data) -- e.g., they have population date from 2010 but there's now data for 2020?
    • The articles now have misinformation / NPOV violations -- e.g., the original estimate of damage for an earthquake was wrong and by not including the new estimate, the article is under/overstating the impact? Or an even simpler case: the person has died and by not including that information, the reader would assume the person to still be alive?
Apr 7 2022, 2:49 PM · Research (FY2021-22-Research-April-June)

Apr 6 2022

Isaac updated the task description for T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.
Apr 6 2022, 6:35 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac and @MGerlach , should we mail the task notebook to both of you or either one would do it? Thanks.

@Appledora thanks for asking -- the ideal approach is to send the email to both of us and we will coordinate who provides feedback. I'll update the task to clarify that.

Apr 6 2022, 6:33 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 5 2022

Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Apr 5 2022, 6:28 PM · Patch-For-Review, periodic-update, Research

Apr 4 2022

Isaac added a comment to T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps.

Welcome newer applicants -- still plenty of time and glad to see your interest!

Apr 4 2022, 1:45 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Exciting to see you all digging into the details of HTML parsing and lots of good points / questions. If you're still curious, you'll find some more details about why those links to Wikipedia are created as external links here: https://en.wikipedia.org/wiki/Wikipedia:Namespace#Virtual_namespaces

Apr 4 2022, 1:44 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 1 2022

Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Are there features / data that are available in the HTML but not the wikitext?
What exactly should I be showing here? Codes or just study references?

I'm not sure what you mean by Codes or just study references but the thinking with this TODO is that the HTML contains various attributes that aren't included in the wikitext that can tell us things about the article / links / text / etc. So just asking for an example or two of these. Does that help?

Apr 1 2022, 1:53 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac added a comment to T293468: Co-organize Fair Ranking Track at TREC 2022.

Weekly update:

  • Put together snapshot of wikiprojects for us to choose the training data
  • Pulling together dump of all the enwiki articles and associated fairness attributes that we're using. i merged in pageview data, article age, and source geo-provenance data. The last attribute is subject age, which I'm working on based on the approach being taken for the Knowledge Gaps Index.
Apr 1 2022, 1:46 PM · Research (FY2021-22-Research-April-June)
Isaac added a comment to T286923: Source geoprovenance: scope work.

Weekly updates:

Apr 1 2022, 1:45 PM · Research (FY2021-22-Research-April-June)

Mar 31 2022

Isaac added a comment to T300977: Maybe restrict domains accessible by webproxy.

Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly asking for clarification right now on the current proposal. Thanks all for working through this.

Mar 31 2022, 2:25 PM · Patch-For-Review, Research, Product-Analytics, SRE, netops, Infrastructure-Foundations, Data-Engineering

Mar 30 2022

Isaac added a comment to T305082: Request for Private repos to be enabled.

Just chiming in to +1 this or at least open the discussion. A few miscellaneous thoughts:

  • Many of my personal notebooks that I should keep private don't actually contain highly sensitive data -- e.g., as an extreme example, I'm not printing out editor IP addresses as part of any analyses. They are generally of the form of aggregate analyses of, for example, top external referrer domains to Wikipedia which is technically private data but also would almost certainly pass a privacy review if the raw data needs to be released.
  • As a workaround, when I want to share the code, I either have to share the actual location of the notebook on the stat machines or make a copy that I purge of outputs and upload to Github etc.
  • My understanding is that we are hosting the Gitlab instance so hopefully that makes it pretty secure though I'm not sure if the only folks who could ever see a private repo are guaranteed to be NDAed?
  • Perhaps if this is implemented, there is some way to ensure that folks don't accidentally switch a private repo to public (without destroying any history that might still contain sensitive info).
Mar 30 2022, 9:33 PM · Release-Engineering-Team (Priority Backlog 📥), Privacy, User-brennen, GitLab (Administration, Settings & Policy), Product-Analytics
Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Why is this particular template represented differently in HTML than the rest of the templates? Should I consider this as a template in my implementation of wt.filter_templates()? I know {{}} represents a template.

@Talika2002 good catch! The reason those curly brackets don't behave like other templates is because it's technically not a template (though obviously it looks very similar). It's called a magic word and there are several that are parsed in a special way from wikitext -> HTML. I don't know myself the different ways in which they will all show up in the HTML (some become just standard text in the article; others affect meta tags as you showed). If you think you can handle their behavior specifically, go for it. Otherwise, I'd just leave a comment/note in your notebook noting their existence. mwparserfromhell unfortunately treats them like a template, which is not technically correct, so you would see a difference between counts of "templates" between the two and that is expected.

Mar 30 2022, 1:13 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 29 2022

Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case?

It's the actual case. See Figure 4 in the paper for the example of links in wikitext vs. HTML: https://arxiv.org/pdf/2001.10256.pdf#page=6
The reason there are more links in the HTML than appear directly in the wikitext is because many templates add a lot of extra content to Wikipedia articles. So in the wikitext of a Wikipedia article about movies, you might see something like {{Film genres}}. The effect of this on the HTML is adding all the content on this page: https://en.wikipedia.org/wiki/Template:Film_genres (which looks to be well over 100 links). Hope that helps.

Mar 29 2022, 7:21 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Upon reading this section, I had assumed that issues with macro expansion in the historical revisions might also exist in other places.

@Appledora thanks for clarifying. The issue mentioned in that passage is specific to generating the HTML for historical revisions -- e.g., taking the wikitext for an article from 2010 and trying to convert it into HTML. That is very difficult to do and will likely result in missing/incorrect content. In this case, the HTML dumps that you are working with are created from the current versions of articles so they don't have this issue and you can assume the HTML is complete and correct.

Mar 29 2022, 7:03 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

after reading the paper by Mitrevski et. al, I had expected there to be less information in the HTML code compared to the wikitext. However, while doing the first to-do, I have found the outcomes to be the opposite.

@Appledora can you explain more? My takeaway from that work is that the HTML often has much more content.

Mar 29 2022, 6:34 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Isaac updated the task description for T280369: Isaac Academic Service.
Mar 29 2022, 5:23 PM · Research
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Mar 29 2022, 5:22 PM · Patch-For-Review, periodic-update, Research
Isaac committed rRLP42e3b1950837: Add TREC 2021 paper (authored by Isaac).
Add TREC 2021 paper
Mar 29 2022, 5:13 PM