User Details
- User Since
- Oct 1 2018, 2:19 PM (194 w, 5 d)
- Availability
- Available
- IRC Nick
- isaacj
- LDAP User
- Isaac Johnson
- MediaWiki User
- Isaac (WMF) [ Global Accounts ]
Yesterday
updates: blogpost being published Tuesday and then i'll close this out.
weekly updates:
- did a lot of thinking about suggestbot experimental design and revising our offline analyses to be more in-line with the proposed experiment. in particular, stuck on how to transform data we have for each editor receiving suggestbot recommendations (their edits that are attributable to suggestbot recs and broader edit history) to appropriately capture their flexibility when it comes to editing about various topics -- e.g., if they predominantly edit articles about men, would that affect the likelihood that they'd accept a recommendation to edit a biography of a woman? unfortunately even with gender (which is relatively simple), there are several challenges:
- not all articles are biographies so e.g., a feature that captures whether a recommendation matches an editor's preferences around biography gender (as gathered via edit history) doesn't distinguish between ambivalence about the gender of the biography and not editing biographies.
- for an e.g., editor that edits 40% women and 60% men, should we be more surprised if they accept a recommendation for a man or a woman biography? presumably this example editor prefers to edit about women but maybe they're just editing about a topic that has more women and they don't actually have a preference (or they're editing about sports and have a strong preference)?
- solution might be to not model it but try to capture via descriptive stats, which would probably also more easily capture editor variability in the flexibility of their preferences
- also working on summarizing current state of project with Mo before we decide on parameters for experiment
working with DS and PD on getting historical evaluation of quality working. sent them patch for images in templates/galleries but main issue still assumed to be selecting the expected features for high quality articles based only on current snapshot and not all of history (which overweights high-quality, highly-edited articles).
Tuesday sounds great - thanks!
Wed, Jun 22
I should add that I'm going to delete this instance because I need its resources and a hard reboot did not solve the issue either. So I assume the logs will be deleted too but in my experience this has happened several times with different VMs so if whoever looks at this can't replicate it, let me know and I'll try on something that can be unreachable for however long it takes to debug etc.
Fri, Jun 17
The only thing I still need are links to the images that you've included in the Google Doc - could you share those with me directly?
No problem -- motivation I needed to upload them to Commons:
Weekly update:
- data released and talked through metrics more
Updates:
- Copy-editing on blogpost -- should be published next week
Wed, Jun 15
If you're happy with the small changes and can update the comments, I can set this up and scheduled in Wordpress for early next week.
Looks great -- thanks for the feedback. I went through and did another pass so feel like it's ready to go now.
Tue, Jun 14
ok -- patch uploaded. @Tchanders let me know here or on the patch if there's any questions etc. Only caveat is I created a new config variable so it would be easier to adjust this in the future without changing code. but for testing/production, obviously have to make sure that the config file being used has this variable too.
Fri, Jun 10
Weekly updates:
- Put together initial thoughts on next phase of this project around data gaps: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Content_tagging/Data_gaps
- Tracking SuggestBot analysis/experiment work under this task: T310379
Weekly updates: none
Updates:
- Edit types blogpost submitted for feedback to techblog: T310237
- Shared notebook for detecting link changes with CC for SDAW: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/link-changes/wikilink_changes.ipynb
- presented work to Product Analytics team. lots of good questions about global coverage and what languages it has good coverage in and when the precomputed edit types will be available
Thanks @KinneretG for creating this task! Just chiming in with a thought for whoever takes up this work: search engines are a bit more standardized but in their referer URLs but some of these external platforms have link shorteners that we need to account for. You can see a few examples from a past pilot with a similar scope (though fewer external platforms). Generally I just inspected the top external traffic from a given day to identify any non-standard referer formats -- e.g.,:
Wed, Jun 8
Mon, Jun 6
@kevinbazira chiming in here to try to help sort out the different services. tagging @leila and @bmansurov too who hopefully can correct / verify what I know:
Fri, Jun 3
weekly updates: met with JT and MR to discuss knowledge gaps / ml equity + product. shared general takeaways:
- For measurement of content impact, gender and geography gives good coverage: gender because interventions have been shown to be affective for getting editors to edit content about women so design choices can have a real impact. geography because interventions are less effective (editors are more likely to edit content with which they are familiar and geographic familiarity is a large component of this) so without measuring individual editor demographics, tracking content geography gives some insight into the diversity of the editor community and encourages long-term investments in supporting a more diverse editor community.
- For design: individual filters (e.g., topics, countries) are good and should continue to receive development but individual action won't close knowledge gaps. for that, we need collective action of the type organized by campaigns/edit-a-thons. so long-term, connecting recommender systems with campaigns feels like the much more effective approach.
Weekly updates: worked with DS to make sure his implementation of the model for all of wikitext history made sense. discussed how to set the right thresholds for each feature (use only current version of wikitext to determine 'top quality' articles not every revision).
also FK finally figured out the cluster + edittypes issue! it evidently is some memory issue with mwparserfromhell version 0.6.4. Running with default version on the cluster (0.6) completed while running with upgraded version (0.6.4) triggered memory errors. no clear reason so presumably some complicated interaction between the library and YARN.
Weekly updates:
- first full draft of blogpost. running by Jesse before submitting to techblog for review
Tue, May 31
This can be worked on while some of the parameters are being confirmed. One thing I noticed is that the URL requires you to pass along a wiki. I assume since similar users only works with enwiki right now, it's fine to hardcode that in? Should we track some of these hardcoded dependencies (since I believe there's some talk about making it work for all wikis?)
Not my decision but agreed that hardcoding is fine now but would be good to track these for a hopeful eventual conversion to all wikis. All the Mediawiki API calls from within the tool require a language too and that would also need updated to be more flexible. Thanks for flagging this too because it has implications for the backend data or API -- i.e. giving an appropriate value of a single wiki for the interaction timeline for a given pair of users is not necessarily trivial.
Fri, May 27
weekly updates: most of thinking in this space has been preparation for Mo's summer work around personalized edit recommendation and content equity (focus on SuggestBot). also in re-reading Diego's proposal for AI + Knowledge Integrity, he's covered the need to discuss data generation strategies with ML Platform and Product stakeholders so the planning for that (which we'd identifed as the main priority of next year in this space) can likely happen in collaboration with him (perhaps using vandalism detection as the case study, which is something I would have proposed anyhow).
Weekly updates: none
Weekly updates:
- continued minor improvements to both libraries to fix edge cases. I put together a simple notebook script for helping to identify issues and will hopefully eventually get to the point where both libraries are in full agreement where possible: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/DiffLibraryComparison.ipynb
- ragesoss would be very interested in the references and words aspect of the tool so hopefully at some point in the next year, we'll work with them to incorporate the outputs into their programs and events dashboard: https://outreachdashboard.wmflabs.org/
- wrote the full introduction to the blogpost and will wrap up the rest of the sections next week. then some time for Jesse to comment before submitting to TechBlog.
Weekly update:
- keywords filtered and instructions released: https://fair-trec.github.io/docs/Fair_Ranking_2022_Participant_Instructions.pdf
May 26 2022
Should a task be filed for this, or is this just something that should be worked around for now?
Just to be clear: this question pertains to the timeout when gathering new contributions for Arjayay? If so, a few things going on here:
- I don't think the databases for the tool have been updated in many many months so you're going to see this happen much more right now than you would when we have the monthly updates to the databases running
- What's happening is that say the databases are good to 30 April 2022. If you query a user today, the tool hits the API for the pages they edited since 30 April. For each of these pages, the edit history (since 30 April in this example) is then gathered and analyzed. This is the step that's almost certainly timing out. Details:
- The first set of API calls for pages edited is loosely capped at 1000 pages (code). It would be pretty cheap to reduce that cap to say 50 pages. Then if someone edited a lot recently, the first call to the tool would get those 50 pages. The next call would maybe get the next 50. And so on. So you essentially stretch out the data updates over multiple sessions so no one session times out (hopefully) at the cost of maybe not having all the most current data in that first session. The timespan associated with the data is included in the API response though so hopefully we could expose this easily to the user of the tool. We're making some assumptions too about the data that aren't perfect and the more sessions this update is spread across, the more likely we are to introduce error into the data. I wouldn't be super concerned about this and I haven't empirically evaluated it but an FYI. Each monthly database update resets this error to 0 though, so that's good.
- The second set of API calls (code) is the expensive step. Many active editors will have edited 1000 unique pages since the database was last refreshed and each of those pages could have many associated API calls to get the edit history (especially right now). There are maybe ways to also explicitly limit this process but it's a lot trickier to do (definitely its own task and I don't know when/if it would be figured out). All to say, much easier to address this at the prior page gathering step.
Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the current understanding of the situation but let me know if anything seems off.
May 20 2022
weekly updates: haven't worked on this much but did meet with PG from Language to discuss future of content translation recommendations and potential collaborations there
Weekly updates:
- Some indication that memory issues might be related to wikitext table and not the edit types UDF
- MA identified some simple optimizations for edit types library around converting mwparserfromhell nodes to strings that greatly increased the speed of the library -- e.g., order of magnitude faster, especially for larger changes! not implemented yet but likely update the libraries next week
- meeting set up with ragesoss from Wikiedu to talk about potentially using the edit types with their programs and events dashboard (tracking the impact of edit-a-thons etc.)
Weekly updates: continued fine-tuning of model card based on feedback from Hal / Pablo: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card
Good catch -- I forget why exactly this would have been included but the num_edits_in_data variable is the important one here that captures something very similar. Which is to say I support updating the documentation and leaving the code as is.
May 18 2022
May 17 2022
Weekly updates: spent some time fighting with the cluster and edit types library to get it running on mediawiki history snapshots without memory errors. FK and I worked on it collaboratively yesterday but did not make much progress in diagnosing or fixing the issue. Will continue to think on potential steps forward.
May 16 2022
@RoccoMo we can continue with this next week (we can schedule a time during our wednesday call but same time works for me). We'll go through the SuggestBot extraction notebook in detail. If you're curious, feel free to take a look ahead of time at some of the examples/documentation but no expectation that you will have done that ahead of our next session.
May 13 2022
FYI in case it's useful, here's some code I was using for extracting cite templates on English Wikipedia and joining it with country data inferred based on URLs/publishers: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/references/ref_extraction.ipynb
Hey all -- just checking in to see if this might be published soon. Not urgent urgent but it would be really nice to have the blogpost published to share at the hackathon to showcase the sort of intern projects that our team (research) has been a part of.
May 12 2022
May 10 2022
Hey SRE/Analytics -- we have a new formal collaborator onboard: @RoccoMo. They need access to HDFS and the stat machines for a new research project. Don't hesitate to let me know if you need more information.
May 6 2022
Weekly updates:
- Working on blogpost. Contract ended but Jesse will continue to be looped in on blogpost. Goal is to get it to TechBlog folks by the end of the month for hopeful publication in June.
- Also prepared a document about potential approaches for creating a stream of these diffs so analysts, tool designers, etc. will have access to a real-time stream of edit types that they can use without having to calculate themselves (shared with some Data Engineering / Platform folks): https://docs.google.com/document/d/1_EQ13zhEtJmYQVijg5ACAT33oHduhaXp05lAaLvTL38/edit?usp=sharing
Weekly updates:
- Draft model card available: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card#Ethical_considerations%2C_caveats%2C_and_recommendations
- As part of preparing that model card, I also did a more formal evaluation of model performance: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Quality/Quality_Model_Evaluation.ipynb
- I also updated the thresholds I use to convert decimal scores (0-1) to class labels (stub, start, etc.) to better align with the groundtruth data.
May 5 2022
May 4 2022
Thanks @STran for catching this! It's a test instance that I built as an early prototype of the now similarusers service (thank you @WDoranWMF and @hnowlan for fielding this but they weren't involved with this instance). While I opted to password-protect it to reduce the chance of abuse, it's purely based on public data from edit histories. Happy to follow-up as needed though.
May 3 2022
May 2 2022
Hey @paramita_das: you can see most of these details in the write-up and attached notebook. Pointers to your specific questions below:
Apr 29 2022
boldly closing this task (reopen if i'm wrong) -- i believe we completed this a while ago and chose to remove some of the visuals from the page:
Weekly updates:
- Next step will be to try to write model card per https://meta.wikimedia.org/wiki/Machine_learning_models
Update:
- Reading Amaral et al.'s work on references on Wikidata as a source of ideas about ways to analyze sources. Will reach out to them with thoughts when I've had time to think about it.
- Following T305888 to see what I can learn from that project
Update:
- Dataset compiled and ready to release. Will go out in probably 2 weeks after my collaborators generate keywords to act as queries for each WikiProject. Notebook paper will go out in next week
Apr 25 2022
FYI if you want an alterative approach for extracting quality ratings for current articles, you can use the page assessments MySQL table as well. They are available in at least English, French, Arabic, Hungarian, and Turkish but only contain the current state:
- An example in this notebook: https://github.com/geohci/miscellaneous-wikimedia/blob/master/list-building/wikiproject_lists.ipynb
- Documentation: https://www.mediawiki.org/wiki/Extension:PageAssessments#Database_tables
@achou excited to have you claim this task! Don't hesitate to reach out if you have any questions about the model etc.
@diego this is your instance -- not sure if you mind whether it's shutdown/deleted on May 1st or not but FYI in case.
Apr 8 2022
There was a mistake on the title, this work is about article quality and not specifically about citations.
Oh drat -- well then a small tweak on my comments that largely repeats what I said in our meeting. The two main gaps I see in the quality model are:
- Better features related to the actual content (not just its quantity).
- Martin is working on a language-agnostic measure for readability (meta)
- Language-agnostic lists of maintenance templates so editor's can flag issues that models might not detect, which I know is something Diego has already worked on
- Better features related to the sources of the article (detailed thoughts in my previous comment) given how important sources are to article quality
Apr 7 2022
Just wanted to comment that I love this task and a few thoughts:
- The start of some of my work around understanding sources -- specifically their geography: https://meta.wikimedia.org/wiki/Research:Analyzing_sources_on_Wikipedia
- There are three aspects of citation quality that stand out to me:
- Do they exist -- i.e. for sentences that need a citation, how many have one? Miriam and Aiko's work tackled this question for a few wikis.
- The reliability/verifiability of an individual source -- e.g., does it show up on reliable source lists or in Featured Articles? Is it primary, secondary, or tertiary? Is there a digital version that doesn't sit behind a paywall?
- There are a few past explorations that touch on aspects of this -- e.g., Scholarly article citations in Wikipedia and @Miriam's initial explorations into the paywall side of things.
- The overall diversity of sources -- e.g., how many unique sources? mixture of primary/secondary/tertiary? mixture of countries? mixture of source types (books, newspapers, etc.)? mixture of dates?
- As I mentioned in our meeting, I'm also intrigued by the question of when sources / content go stale. For example, you have an article that exists in multiple languages. On one language, we see a burst of edits that are adding new sources to the article that seem to be recent (e.g., newspaper articles published in the past month). How often do we see the other languages also add recent sources? When the other languages don't add these sources, which of these is true:
- The articles are incomplete -- e.g., the person won a prize that isn't listed yet?
- The articles have stale (old data) -- e.g., they have population date from 2010 but there's now data for 2020?
- The articles now have misinformation / NPOV violations -- e.g., the original estimate of damage for an earthquake was wrong and by not including the new estimate, the article is under/overstating the impact? Or an even simpler case: the person has died and by not including that information, the reader would assume the person to still be alive?
Apr 6 2022
@Isaac and @MGerlach , should we mail the task notebook to both of you or either one would do it? Thanks.
@Appledora thanks for asking -- the ideal approach is to send the email to both of us and we will coordinate who provides feedback. I'll update the task to clarify that.
Apr 5 2022
Apr 4 2022
Welcome newer applicants -- still plenty of time and glad to see your interest!
Exciting to see you all digging into the details of HTML parsing and lots of good points / questions. If you're still curious, you'll find some more details about why those links to Wikipedia are created as external links here: https://en.wikipedia.org/wiki/Wikipedia:Namespace#Virtual_namespaces
Apr 1 2022
Are there features / data that are available in the HTML but not the wikitext?
What exactly should I be showing here? Codes or just study references?
I'm not sure what you mean by Codes or just study references but the thinking with this TODO is that the HTML contains various attributes that aren't included in the wikitext that can tell us things about the article / links / text / etc. So just asking for an example or two of these. Does that help?
Weekly update:
- Put together snapshot of wikiprojects for us to choose the training data
- Pulling together dump of all the enwiki articles and associated fairness attributes that we're using. i merged in pageview data, article age, and source geo-provenance data. The last attribute is subject age, which I'm working on based on the approach being taken for the Knowledge Gaps Index.
Weekly updates:
- Minor updates to API (supporting .int domains as International)
- Meta page: https://meta.wikimedia.org/wiki/Research:Analyzing_sources_on_Wikipedia
- Took dump of citation templates from English Wikipedia with extracted URLs and publishers and ran through model without making any whois updates for a baseline dataset. Coverage stats: https://public.paws.wmcloud.org/User:Isaac_(WMF)/TREC/citation_coverage.ipynb
Mar 31 2022
Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly asking for clarification right now on the current proposal. Thanks all for working through this.
Mar 30 2022
Just chiming in to +1 this or at least open the discussion. A few miscellaneous thoughts:
- Many of my personal notebooks that I should keep private don't actually contain highly sensitive data -- e.g., as an extreme example, I'm not printing out editor IP addresses as part of any analyses. They are generally of the form of aggregate analyses of, for example, top external referrer domains to Wikipedia which is technically private data but also would almost certainly pass a privacy review if the raw data needs to be released.
- As a workaround, when I want to share the code, I either have to share the actual location of the notebook on the stat machines or make a copy that I purge of outputs and upload to Github etc.
- My understanding is that we are hosting the Gitlab instance so hopefully that makes it pretty secure though I'm not sure if the only folks who could ever see a private repo are guaranteed to be NDAed?
- Perhaps if this is implemented, there is some way to ensure that folks don't accidentally switch a private repo to public (without destroying any history that might still contain sensitive info).
Why is this particular template represented differently in HTML than the rest of the templates? Should I consider this as a template in my implementation of wt.filter_templates()? I know {{}} represents a template.
@Talika2002 good catch! The reason those curly brackets don't behave like other templates is because it's technically not a template (though obviously it looks very similar). It's called a magic word and there are several that are parsed in a special way from wikitext -> HTML. I don't know myself the different ways in which they will all show up in the HTML (some become just standard text in the article; others affect meta tags as you showed). If you think you can handle their behavior specifically, go for it. Otherwise, I'd just leave a comment/note in your notebook noting their existence. mwparserfromhell unfortunately treats them like a template, which is not technically correct, so you would see a difference between counts of "templates" between the two and that is expected.
Mar 29 2022
HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case?
It's the actual case. See Figure 4 in the paper for the example of links in wikitext vs. HTML: https://arxiv.org/pdf/2001.10256.pdf#page=6
The reason there are more links in the HTML than appear directly in the wikitext is because many templates add a lot of extra content to Wikipedia articles. So in the wikitext of a Wikipedia article about movies, you might see something like {{Film genres}}. The effect of this on the HTML is adding all the content on this page: https://en.wikipedia.org/wiki/Template:Film_genres (which looks to be well over 100 links). Hope that helps.
Upon reading this section, I had assumed that issues with macro expansion in the historical revisions might also exist in other places.
@Appledora thanks for clarifying. The issue mentioned in that passage is specific to generating the HTML for historical revisions -- e.g., taking the wikitext for an article from 2010 and trying to convert it into HTML. That is very difficult to do and will likely result in missing/incorrect content. In this case, the HTML dumps that you are working with are created from the current versions of articles so they don't have this issue and you can assume the HTML is complete and correct.
after reading the paper by Mitrevski et. al, I had expected there to be less information in the HTML code compared to the wikitext. However, while doing the first to-do, I have found the outcomes to be the opposite.
@Appledora can you explain more? My takeaway from that work is that the HTML often has much more content.