Weekly updates:
- Draft of intro: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/AI_Datasets
Weekly updates:
Summary of some data analysis I did for evaluating the current topic taxonomy and gathering some thoughts about potential changes (google doc with more data/notes):
@Miriam I don't mind either way but I'll be bold. This is my quarterly goal task so it touches on the topic classification evolution but also other related aspects and I mainly see as a personal tracking task that I intend to close out at the end of this quarter. I think best thing would be to make this a subtask of T343241 (as in I'm playing a supporting role for the taxonomy work) and I'll shift my updates over there when they're about the topic taxonomy.
Weekly updates (adding early while it's fresh):
Weekly updates:
Weekly updates:
Weekly updates:
Weekly update:
Weekly updates:
Weekly updates:
@YLiou_WMF here's the task -- please sign L3
Ahh this is great news @kevinbazira ! @KartikMistry is there any reason from the Content Translation side why we can't switch over to the LiftWing endpoint? My read is that the code is quite simple -- e.g., if I go to Content Translation on Spanish Wikipedia, the tool hits this endpoint:
https://recommend.wmflabs.org/types/translation/v1/articles?source=en&target=es&seed=Music%20Modernization%20Act|Felony%20disenfranchisement&search=morelike&application=CX
hey all (not sure who exactly to tag but maybe I'll start with @kevinbazira just because I know you did a lot of good work on this) -- I'm working on some planning for improvements to our recommender systems for next fiscal year around what topic filters we provide to editors. Content Translation is of special interest but Android's SuggestedEdits is important too. The recommendation logic for both of these systems is still hosted on GapFinder as far as I can tell, but deploying any improvements is going to require moving them to a proper service (LiftWing). Does anyone know why this effort to move Content Translation's recommendation API over to LiftWing (along with Android's endpoints T340854) stalled last year?
Weekly update:
Weekly updates:
@DJames-WMF has made progress on converting the wikitext over to HTML features. We're finding that the old normalization values -- e.g., how many references are expected in a top-quality article for a given wiki -- are no longer well-aligned for a few features. This seems to be most relevant for page-length which then affects wikilinks and references as well. I'll need to look into re-generating these normalization values. A few options:
Confirmed -- thanks @DDeSouza !
Would it be possible to add the theme of the showcase to its subpage title
Good idea. I think it should be doable (just need to move the pages to the new titles). The downside would be it's harder to guess the page title but maybe that's not an issue. One alternative too: when we picked up this task, there was also a question about whether we wanted some sort of "summary" of our archive too. Maybe the listing of pages isn't the place to do that and instead we add a basic table to the page with each month, a link to the full description, and the theme?
@DDeSouza I went ahead and made a merge request for a new paper and some small adjustments to the other papers (seemed easier than trying to explain in this case): https://gitlab.wikimedia.org/repos/sre/miscweb/research-landing-page/-/merge_requests/25
Thanks all for the patience on this -- I have now moved all the past showcases onto monthly archive pages and added the search functionality to the main page: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Archive
Next steps for this notebook based on Destinie's assessment (notebook) of how well-distributed each model feature is after switching to HTML. We have three features that are poorly distributed (values all lumped together) so the model cannot learn much from them. They are:
Weekly update: no progress
Weekly updates:
Added another step for the bug-fixing we're working on right now with 0-values for some of the features. I also unchecked the optional exploration -- that actually is separate from the notebook (it involves updating a README file in a code repository) so we can talk about it in a future meeting and decide whether to pick it up or not.
Thanks! Unlikely to happen soon but when we reach a stage where we are re-training the model, I'll see if we can experiment with nudging the model away from these sorts of responses (because agreed that it's ideal to solve it via model architecture / training as opposed to post-hoc filters if possible). And please continue to share if you see other patterns in incorrect recommendations.
Excellent work @DJames-WMF ! Took a readthrough of your notebook and everything looked good. Closing this as resolved. We didn't pursue the Nepalese Wikipedia extension but that's okay -- we can always come back to it later. For now, I'd like to progress to the HTML work that you've started in T361623.
(Shouldn't this be a factor for machine learning? I mean, if matching the title produced a wrong description as a general rule, wouldn't the machine learning algorithm infer it from the training set?)
Closing this task out. We can re-open or create a new one in case substantial new work is required as a result of COLM etc. I'll still update with an arXiv link when available.
@DJames-WMF can claim and start this task when T360815 is complete.
Human
3 beams:Ethnic group Ethnic group of humanes Ethnic group of humans
Thanks for passing this along @Jack_who_built_the_house! I checked a number of other very high-level topics and didn't find it in Civilization or Primates but did get "Class of plants" for Plants. This sort of error seems most likely with article about very high-level concepts (which often already have article descriptions thankfully) but would still be nice to fix obviously. We might be able to address this sort of tautological output by adding a simple string-matching check to ensure that the output doesn't contain the title itself. Before we implement anything, I'd want to think about what sort of issues this might cause though with e.g., very simple titles where text matching might introduce a bunch of false positives (and therefore not return results).
Weekly updates:
Paper submitted to COLM and we'll hear May 24. I'll link to arXiv paper when posted.
Notebook looks good - thanks for the hard work and patience on this!
Task created -- @isarantopoulos just let me know if any details are missing or anything I can do to help with next steps when you are ready!
Weekly updates:
Weekly update:
Weekly updates:
Weekly update:
Very excited to see this gaining some traction (thanks @mpopov and @dr0ptp4kt)! Commenting on the analytics side of things (I don't know enough about Varnish to comment on implementation details):
This is really wonderful news! Thanks @kevinbazira for slogging through this with us and @isarantopoulos for your support as well! Those endpoints were working for me too so I'll let Android indicate what the next steps are.
Thanks @taavi! Indeed, unblocked now
Connecting to another ticket focused on producing these topic snapshots: T351118
Weekly updates:
Weekly update:
I'll let others chime in but that would be my feeling about the correct scope. Going historical indeed adds a lot of complications and I think current snapshots are a huge first step. I'd coordinate with Enterprise obviously just to see if any changes are going to happen with schema etc. but hopefully relatively straightforward.
Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.
Thanks @lbowmaker for considering and @mfossati for raising! Just chiming in to add my support that having a current snapshot of Parsoid HTML from Enterprise would be very helpful. We've developed a Python library (mwparserfromhtml) that enables us to extract lots of features (references, infoboxes, plaintext, etc.) easily from the HTML so are in a good position to make use of it. Within Research, we're working on switching more of our models to using it too because the gap between wikitext and HTML is definitely growing (example with references). For example, we have an intern who will be working on converting the quality model used for knowledge gap metrics from using wikitext to HTML for this reason, so having a regular snapshot that could be used for computing article quality for all articles would be very helpful.
Can we investigation reducing the computational need to just the language requested?
The model definitely benefits from some translations so "just the language requested" I would say is not the right approach. Are you suggesting capping it at 5 for example? It no longer seems to be an issue from the pre-processing perspective with Ilias' fix at least so hopefully not a blocker at this point and just a bonus for capping model latency. That said, if there's a desire to further constrain, I can look into it in the next few weeks but please don't let it be a blocker given that Android has said they're comfortable moving forward per T343123#9558740.
we currently have 4 workers consuming tasks for programs and events dashboard, and 3 for wikieducational dashboard. So I guess that the max number of concurrent requests would be 7 (if all the workers are working at the same time).
Yeah, that's quite reasonable! Thanks for looping back about it.
Weekly updates:
Weekly update: no progress but will check in with team next week
This puzzles me. Is that really necessary for the model to work?
Yes and no -- the model is multilingual so you can think of it doing a mixture of finding the right phrase from the first paragraph of the article within the target language with translating over descriptions from other languages and translating+extracting phrases from articles in other language editions too. In reality, I suspect we could come up with a simple but smart way to cap how many languages it queries without actually reducing the output quality (because I'd guess that the model has everything it really needs after probably at most 5 or so languages). But also in reality most article descriptions that are missing will be for languages in which there are at most only a few languages so this sort of optimization hasn't been tested because I think it'd be triggered pretty rarely. If we feel this is important, I can do some tests to see how much this changes things output-wise. Very simple code-wise at least, you'd still gather all the possible sitelinks but then cap them with something like (here we take five largest language editions):
descriptions, sitelinks, blp = await self.get_wikidata_info(lang, title) # new code - excuse its hackiness lang_by_size = {l:i for i,l in enumerate(['en', 'de', 'nl', 'es', 'it', 'ru', 'fr', 'zh', 'ar', 'vi', 'ja', 'fi', 'ko', 'tr', 'ro', 'cs', 'et', 'lt', 'kk', 'lv', 'hi', 'ne', 'my', 'si', 'gu'])} sitelinks = {l:sitelinks[l] for i,l in enumerate(sorted(sitelinks, key=lang_by_size.get)) if i <= 5}
@kevinbazira thanks for explaining -- I was unaware of the Rest Gateway etc. stuff so assumed LiftWing was using the same entrypoints to the APIs. I saw the other ticket where you're working through possibilities. I'll monitor but sounds like you all have it and thanks for digging into this!
Oh wow good sleuthing. To help me understand, does this sound right: the challenge with preprocess is that it for every language (up to 25) that the article exists in, an API call has to be made to that language edition's page summary REST endpoint to get the first paragraph (enwiki endpoint). And as we can see, this normally is still under half a second because it's a pretty quick API and we do the calls async. But presumably something on LiftWing is preventing the up-to-25 API calls from being made async/simultaneously?
Checked and grabbed a few files that are important so mhoutti's home directory on stat1008 and HDFS may now be removed. Thanks!
Does it answer your question?
Getting us closer I think -- it is a batch job so possibility for a large number of requests all at once. Do you know what a maximum load might look like (doesn't have to be super specific, just a general sense to make sure it doesn't cause issues on Wikimedia end)? For instance, is it async but at most 20 concurrent requests or a sequential job that's only processing one revision at a time? There isn't necessarily a wrong answer though REST API documentation says max 200 reqs/second: https://en.wikipedia.org/api/rest_v1/. Older revisions could take some time to process as nothing would be cached on the Parsoid side.
Weekly update:
Weekly updates:
Sounds good -- one thing that came up when I was chatting a bit with our Parsoid folks. what's the strategy for collecting the ref counts? would it be a batch job with a lot of concurrent API calls for the HTML (latency really could start to become a factor because old revisions are unlikely to be cached) or something a bit more spread-out / kinder to the APIs?
It should work for any project actually assuming they follow the same approach to handling citations as Wikipedia does but I haven't tested much beyond Wikipedia. You'd just switch the project in the REST API URL -- e.g., https://en.wiktionary.org/api/rest_v1/page/html/heart/76995678 for the article you used above: https://en.wiktionary.org/wiki/?oldid=76995678
Oooh yes excellent example to think through. I think there are two potential answers to the question of how many references are in an article but they only loosely relate to ref tags vs. footnote templates. There are two things that I think are relevant to count regarding references in an article. Other people might use different terminology (English Wikipedia notes that people often use these terms interchangeably unfortunately) but this how I'll distinguish:
I suspect it might be a lot slower with HTML, especially since the HTML old revisions is probably not cached and so would rely on getting rendered by MediaWiki for each query. But maybe it won't be too slow.
@Ragesoss yeah, that's a fair point but hopefully not a blocker. Another point in favor: if you switch to HTML, with relatively little additional overhead you can also add in extraction of other elements. Most of the latency would come from requesting the HTML from the API and doing the initial parsing in Python but extracting additional features would be very cheap after that. My library has implemented this already for audio, categories, externallinks, sections, images, infoboxes, lists, math elements, message boxes, navboxes, hatnotes, references (unique sources as opposed to inline citations), videos, wikilinks, and wikitables. There are likely some gaps depending on whether the feature is an explicit mediawiki feature (e.g., images where extraction should be near perfect) or more norm-based and based on templates (e.g.., infoboxes where some language communities might not follow the norm). But at least the library codifies some expectations and works for any language of Wikipedia. If there are other features you're interested in too but don't see above, always happy to discuss and figure out how feasible adding support would be.
I finally got around to doing some analysis of wikitext vs. HTML. High-level: about 90% of sources/citations in HTML are correctly identified via ref-tags in the wikitext. This varies by language. This is in existing articles though and we might see that a lot of them were e.g., initially added by bots but wikitext and HTML still match up pretty well for new edits (as would be relevant to PG&E dashboard). That said, I think this is a good indication that long-term it makes sense to switch over to HTML as source for this data.
Weekly updates:
Weekly update:
This is very useful (and exciting) data -- thank you @isarantopoulos !