Proposal: Improve how Wiki Education Dashboard counts references added
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Gabinaluz
	Nov 28 2023, 2:10 PM

Description

Personal information

Name: Gabina Luz
Timezone: America/Argentina/Buenos Aires
Location: Rosario, Santa Fe, Argentina
GitHub profile URL: https://github.com/gabina/

Timeline of tasks for the internship period

Introduction

The Dashboard is a Ruby on Rails web application for tracking contributions by groups of editors, providing statistics and details about their contributions. It is commonly used for events that bring new editors to Wikipedia, such as edit-a-thons, Wikipedia editing assignments in schools and universities, and distributed campaigns like #1Lib1Ref. In particular, Spanish #1Lib1Ref campaign (known also as #1Bib1Ref) is a Wikipedia campaign inviting librarians to participate in the online Spanish encyclopedia project, specifically improving articles by adding citations.

The organization of the Spanish #1Lib1Ref campaign uses the Wiki Edu Dashboard and considers the "references added" feature a pretty valuable one, as it helps organizers bring life to this campaign with ethics and care: they can better monitor quality, encourage newcomers, praise champions, and so much more.

The Dashboard currently gets statistics on references added by fetching the features data supplied by Wikimedia's article quality machine learning models, and comparing the values of one revision with values for the previous revision. This reference counting method is constrained by the availability of article quality machine learning models, which are not accessible for the majority of Wikipedia language editions, including the Spanish language version.

Main goal of the internship

The main goal of the internship is to develop a performant alternative implementation of counting references added that does not depend on articlequality features data, and works for every language version of Wikipedia. One high priority is to enable reference counting for Spanish Wikipedia, in support of the Spanish #1Lib1Ref campaign.

One promising route would be to co-opt data from another API that works across languages, such as this one: https://misalignment.wmcloud.org/api/v1/quality-revid-features?lang=es&revid=144495297

Project timeline

The internship commences on December 4, 2023, and concludes on March 1, 2024. Therefore, the project timeline aligns with that time frame.

Please note that this timeline is a guide and may be subject to adjustments based on progress and feedback. The descriptions may have different levels of detail, depending on the depth of knowledge about the topic during the planning phase. Some details are intentionally left abstract, to be refined during the code implementation phase.

Week 1: Dec. 4, 2023 - Dec. 11, 2023
Research the current method for calculating added references. Explore the dashboard interface to locate where the "references added" value is displayed. Identify the specific sections of the code responsible for making API calls, handling their responses, and processing the data. Assess how the backend and frontend collaborate to calculate and present the "references added" value, and evaluate the extent of decoupling in the existing code to determine whether changes in the backend alone will suffice. Additionally, review the existing specifications to gain a comprehensive understanding of the required modifications.

Dec. 11, 2023 - Feedback #1

Week 2: Dec. 11, 2023, Dec. 18, 2023
Having identified the required data for calculating the "references added" feature, investigate the misalignment.wmcloud.org/api/v1/quality-revid-features API to obtain data for the new implementation. This entails reviewing the API documentation and testing its endpoints to assess if the provided data aligns with requirements. It's crucial to confirm that this API is accessible for all languages. Additionally, comparing existing and new "references added" values in specific cases is neccesary to validate the accuracy of the new results.
In case the proposed API doesn't align with the project's requirements, exploration of alternative options may be necessary, potentially leading to an extension of the timeline.

Note: the existing misalignment API was originally designed as a research prototype so it's intended to be eventually removed, but also more importantly it's hosted in a shared space where someone could one day delete it unknowingly. Therefore, it's not a great place for it to exist to sustain the dashboard long-term.
Toolforge would protect against that accidental deletion piece and give wikiedu folks more easy access. In addition, the logic we need around references is relatively simple and the existing API is doing a bunch of other things that slow it down or could cause errors. Moving to toolforge is also a good opportunity to simplify further so it's easier to maintain, faster, and less likely to fail inexplicably.

Week 3: Dec. 18, 2023, Dec. 25, 2023
Flexible time for adjustments.

Week 4: Dec. 25, 2023, Jan. 01, 2024
Evaluate the need for modifications to models to store data from the new API. Handle database migration and spec updates if necessary.

Week 5-6: Jan. 01, 2024, Jan 15, 2024
Develop a streamlined method for making requests to the new API. Add specifications for the new behavior.
Note: LiftWingApi can serve as a reference for this purpose.

Jan. 15, 2024 - Feedback #2

Week 7-8: Jan. 15, 2024, Jan 29, 2024
Implement a way to use the newly created class to import and store data in the database. Add specifications for this new piece of code.
Note: RevisionScoreImporter can serve as a reference for this purpose.

Jan. 31, 2024 - Feedback #3

Week 9-10: Jan. 29, 2024, Feb 12, 2024
Integrate all changes. Conduct manual end-to-end tests and, ideally, implement automated ones.
Perform code cleanup if necessary.

Week 11-12: Feb 12, 2024. Feb 26, 2024
Develop a deployment plan for production.
Address final details, such as the "references added" description displayed in the dashboard.

Feb. 26, 2024 - Feedback #4

Week 13: Feb. 26, 2024, Mar 01, 2024
Conduct a final performance review with my mentor.
Perform auto-review and draw conclusions.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T346390 Outreachy Round 27: Improve how Wiki Education Dashboard counts references added
		Open		None	T352177 Proposal: Improve how Wiki Education Dashboard counts references added

Event Timeline

Gabinaluz created this task.Nov 28 2023, 2:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2023, 2:10 PM

Gabinaluz updated the task description. (Show Details)Nov 28 2023, 2:11 PM

Gabinaluz updated the task description. (Show Details)Nov 28 2023, 2:15 PM

Aklapper added a parent task: T346390: Outreachy Round 27: Improve how Wiki Education Dashboard counts references added.Nov 28 2023, 4:02 PM

Isaac subscribed.Nov 29 2023, 2:24 PM

First off congrats @Gabinaluz! I'm very excited to have someone working on this project!

One thought to add for when you get started: as you begin considering whether the misalignment.wmcloud endpoint can meet your needs in Weeks 1/2, I'd love to hear what specific features you want out of an API. Rather than relying directly on misalignment.wmcloud, I assume it'll be possible/better to build a focused version of that API to host on Toolforge that directly meets your needs.* That'll make it far easier for WikiEdu and others to participate in maintenance / improvements long-term and reduce the likelihood that the wmcloud API is accidentally removed given that it's in a research prototype space. I've set up many of these APIs by now so it'd be a relatively straightforward thing to do (shouldn't disrupt your overall schedule) and I'm happy to walk you through the process too if you're interested.

*Some background on these different backends for hosting APIs: there are two main cloud services where an API like this can be hosted: Toolforge and Cloud VPS. Cloud VPS is where the misalignment API is hosted. Cloud VPS is more flexible as far as installing packages, assigning memory/disk space, etc. but also more complex and therefore harder to collaborate / more liable to break. The type of API that I think you need (parsing revisions of Wikipedia articles to extract details like how many references changed) should be relatively "simple" though in that it doesn't require anything particularly fancy as far as software libraries or memory usage. For that reason, Toolforge will be a better fit as it's much simpler and easier to add collaborators. Porting it over to Toolforge will also make it easier for me to add you and others as collaborators in case you want to make changes to the API etc.

Gabinaluz updated the task description. (Show Details)Dec 5 2023, 8:28 PM

Hey @Isaac, thank you! I was reading your suggestion, and it sounds good. Thank you for the detailed context! I'm still doing some research on the current behavior for the references added. Will back to you when I get a better idea of this.

Gabinaluz updated the task description. (Show Details)Dec 8 2023, 7:07 PM

Responding here in relation to questions around existing ORES features being used so not missed on IRC:

the WIKI_SHORTENED_REF_TAGS piece can be found here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/enwiki.py#L49-L51. for what it's worth, these templates are specific to English Wikipedia and not guaranteed to have the same names (if they're used) in other language editions but the <ref> tag approach is universal to all language editions. that said, that template check isn't expensive to include if you still want to and could easily have more templates added if folks noticed issues. that code base I linked above in general will have all the features you're curious about -- e.g., the Wikidata one here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki.py#L57-L60 though a few might be in the base revscoring library: https://github.com/wikimedia/revscoring/tree/master. I'm not aware of active maintenance for these libraries though and you only want small pieces so I'd recommend copying what you need out as opposed to making them dependencies.

@Isaac thank you very much. I was checking the code!

for what it's worth, these templates are specific to English Wikipedia and not guaranteed to have the same names (if they're used) in other language editions but the <ref> tag approach is universal to all language editions. that said, that template check isn't expensive to include if you still want to and could easily have more templates added if folks noticed issues.

Reading that, I understand that templates are specific to languages, they aren't general. Do you know if I have an easy way to know which templates are used for every language in advance? I see that in articlequality/articlequality/feature_lists folder there are definitions for different languages; for example, fawiki uses the following templates (which is a subset from templates used for enwiki):

SFN_TEMPLATES = [
    r"Shortened footnote template", r"sfn",
    r"Sfnp",
    r"Sfnm"
]

But I'm not sure how to know which templates are used for other languages.
Appreciate your help!

Reading that, I understand that templates are specific to languages, they aren't general.

Correct -- oftentimes, editors duplicate a template behavior in another language but there's no guarantee they're the same.

Do you know if I have an easy way to know which templates are used for every language in advance? I see that in articlequality/articlequality/feature_lists folder there are definitions for different languages; for example, fawiki uses the following templates (which is a subset from templates used for enwiki):

My guess is that someone knowledgeable with fawiki added those (one of the former members of the team who developed the ORES models is also an active editor on fawiki so that would make sense). If you want to build language-specific lists, templates have sitelinks (like articles) that allow you to see what other languages they exist on (or at least some of the other languages -- it depends on editors adding these links and I don't know how complete they are for templates). So for instance Template:Sfn is available in at least ~150 languages (https://www.wikidata.org/wiki/Q6535844) and there are programmatic ways to gather that information. You can also get a sense of how often a template is used in a language through the templatecount tool.

That said, I think that the language-specific lists are probably overkill for what you want to do. The compromise solution would be to just use the English set as a "global" set. Many of them won't exist on other language editions but that's okay because in theory you just won't see editors using them there. You might miss some local-language variants of these templates but they can always be added later if someone alerts you to them. And it'd be highly unlikely that e.g., Spanish Wikipedia has a template called Sfn that actually does something totally different so I don't think this would create many false positives.

Just FYI, there is a first version of the new references API deployed in Toolforge, you can make a request:

https://reference-counter.toolforge.org/api/v1/references/wikipedia/en/829831409

The API is currently using mwapi library to get the wikitext for a revision.

# Request the wikitext
        session = mwapi.Session(f'https://{lang}.{project}.org')

        result = session.get(
                    action="parse",
                    oldid=revid,
                    prop='wikitext',
                    format='json',
                    formatversion=2
                )

I'm currently working on a direct connection to the replica db to run a query to get the wikitext, instead of using mwapi.

The code is on my personal repo for now: https://github.com/gabina/wikimedia-references

Just FYI, there is a first version of the new references API deployed in Toolforge

Congrats!! Very exciting!

I'm currently working on a direct connection to the replica db to run a query to get the wikitext, instead of using mwapi.

This may be possible but FYI my understanding is that the replica DBs do not contain the wikitext because Wikipedia etc. use this extension and the tables are quite big: https://www.mediawiki.org/wiki/Manual:External_storage. I think you might have to stick with using the APIs but hopefully that's not a bottleneck in terms of latency.

Also, a thought that should have occurred to me earlier: the reason all these prior APIs have worked with wikitext as opposed to the parsed HTML of a page is because it's traditionally been a lot easier to access a page's wikitext than HTML (background). Wikitext is available via dumps while HTML historically was not. Also wikitext used to be the only way to edit pages so any tooling that wanted to build on an editor had to be able to handle wikitext. As a result, there are many more Python libraries etc. for wikitext. The parsed HTML, however, is also available from the APIs and we've been slowly building up more tooling around it within Python. For many things, there isn't a major difference between wikitext and HTML but it does standardize citations quite nicely and so would solve this issue with shortened footnotes etc. Specifically, the count of unique references as you'd find them listed at the bottom of an article can be found by just counting up how many <span> elements have the class reference-text (example code though you'll see this uses a slightly different classname and there's background about that in T328695). The downside for this would be that it can be slightly slower to fetch and process the HTML as it's bulkier and is a transform over the wikitext but I doubt that would be noticeable for this application. If you want other variants like the number of times a reference is cited, that should be possible to extract too with different HTML selectors. I think your current solution is just fine but FYI if you decide you want to spend more time improving the counting logic.

Thank you! That was great context. Really appreciate it.

In case you want to check it out (and for the visibility of possible readers of this discussion), I put together this short article to organize my mind (I used Notion but please let me know if there is a better standardized tool for that) . The article summarizes the information you gave me, and clarifies what the current implementation of the new API is.

It seems reasonable to me to leave it like this for now, and continue with the integration of this new API in the dashboard. If there is time left at the end of the internship, I can always revisit the API and refine it.

I put together this short article to organize my mind

This is great, thanks for summarizing it!

I used Notion but please let me know if there is a better standardized tool for that

I'd ask the folks at WikiEdu how they want to preserve this logic more long-term as norms vary by the people / project.

It seems reasonable to me to leave it like this for now, and continue with the integration of this new API in the dashboard. If there is time left at the end of the internship, I can always revisit the API and refine it.

Makes sense to me -- good luck and looking forward to seeing the functionality!

Hey, I have a quick question again.
Currently, the sourcecode running in Toolforge is in my personal repo (https://github.com/gabina/wikimedia-references). I created that to test the app locally before deploying it to Toolforge.
I just noticed that Toolforge allows me to create a respository in https://gitlab.wikimedia.org/toolforge-repos.
This is the repo I created: https://gitlab.wikimedia.org/toolforge-repos/reference-counter
I would like to add files from the command line to that repo (instead of using my personal one), but I'm not able to make any git push because of the following error:

remote: GitLab: 
remote: A default branch (e.g. main) does not yet exist for toolforge-repos/reference-counter
remote: Ask a project Owner or Maintainer to create a default branch:
remote: 
remote:   https://gitlab.wikimedia.org/toolforge-repos/reference-counter/-/project_members
remote: 
To gitlab.wikimedia.org:toolforge-repos/reference-counter.git
 ! [remote rejected] main -> main (pre-receive hook declined)
error: failed to push some refs to 'gitlab.wikimedia.org:toolforge-repos/reference-counter.git'

It looks like I don't have permissions to create a default branch, maybe you can help me with that?
Please let me know if something doesn't make sense here. Thank you!

It looks like I don't have permissions to create a default branch, maybe you can help me with that?

Done -- you should have maintainer privileges now. Let me know if still not working and I can try some other things.

SEgt-WMF subscribed.Jan 8 2024, 4:58 PM

Dear team,
The #1Lib1Ref community is so excited about this!
As you can see in the column "Getting data to score" of this feedback board, one of the pain points in the past has been that the Dashboard did not register references for Spanish nor French (and I imagine some other languages)?
We (Gorana Gomirac,Sailesh Patnaik, and I) will be meeting with #1Lib1Ref organizers, and it would be awesome if we could share the good news with them about the advances! Are there any probable dates of when this update will be launched @Gabinaluz ?

Hi @SEgt-WMF! Glad to hear you're excited about this project!
As you may noticed, this is part of my Outreachy internship, which ends at Mar 01, 2024.
I'm currently in the first half of the internship and, although I have had to make some adjustments to the schedule,
all indications are that we are on time, and that the project would be ready to go into production in early March.

This is excellent news @Gabinaluz, your project will make many librarians and organizers happy! Wishing you all the best for this internship!
From March onwards, who would be best to contact to follow up on the implementation?
This is just to have the most accurate information to share with participants who would like to use the Dashboard for the campaign in May, but if there's also someone they could report potential bugs to that would be nice to know as well :)

This is excellent news @Gabinaluz, your project will make many librarians and organizers happy! Wishing you all the best for this internship!

Thank youu!

From March onwards, who would be best to contact to follow up on the implementation?
This is just to have the most accurate information to share with participants who would like to use the Dashboard for the campaign in May, but if there's also someone they could report potential bugs to that would be nice to know as well :)

@Ragesoss is the primary developer and maintainer of the Dashboard, so you will probably want to keep in touch with him.
In order to report bugs, you can also open issues on the Dashbaord repo (see https://github.com/WikiEducationFoundation/WikiEduDashboard/issues).

Gracias again @Gabinaluz <3
And please do let me know if I can serve as a bridge for anything you might need from the Foundation. I usually don't have the answers but know who to ask :)

I finally got around to doing some analysis of wikitext vs. HTML. High-level: about 90% of sources/citations in HTML are correctly identified via ref-tags in the wikitext. This varies by language. This is in existing articles though and we might see that a lot of them were e.g., initially added by bots but wikitext and HTML still match up pretty well for new edits (as would be relevant to PG&E dashboard). That said, I think this is a good indication that long-term it makes sense to switch over to HTML as source for this data.

Results: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/HTML-dumps/references-wikitext-vs-html.ipynb#Stats

@Isaac awesome, thanks! I agree, this data makes it seem clear that HTML will be a significantly more accurate data source for ref counts... at least, for current revisions. I can imagine that accuracy would degrade in some cases when you look at old revisions if there are very many cases of citation-generating templates that were deleted or changed, since we can generally only get HTML based on old wikitext rendered with current templates. But my intuition is that it wouldn't make much difference and would still outperform the wikitext-only approach.

@Gabinaluz if you think it would be easy enough to implement, it would be great to adapt your ref count API to support HTML counts as well. I'm happy with the current performance and I suspect it might be a lot slower with HTML, especially since the HTML old revisions is probably not cached and so would rely on getting rendered by MediaWiki for each query. But maybe it won't be too slow.

I suspect it might be a lot slower with HTML, especially since the HTML old revisions is probably not cached and so would rely on getting rendered by MediaWiki for each query. But maybe it won't be too slow.

@Ragesoss yeah, that's a fair point but hopefully not a blocker. Another point in favor: if you switch to HTML, with relatively little additional overhead you can also add in extraction of other elements. Most of the latency would come from requesting the HTML from the API and doing the initial parsing in Python but extracting additional features would be very cheap after that. My library has implemented this already for audio, categories, externallinks, sections, images, infoboxes, lists, math elements, message boxes, navboxes, hatnotes, references (unique sources as opposed to inline citations), videos, wikilinks, and wikitables. There are likely some gaps depending on whether the feature is an explicit mediawiki feature (e.g., images where extraction should be near perfect) or more norm-based and based on templates (e.g.., infoboxes where some language communities might not follow the norm). But at least the library codifies some expectations and works for any language of Wikipedia. If there are other features you're interested in too but don't see above, always happy to discuss and figure out how feasible adding support would be.

@Gabinaluz you can see basic code for getting the HTML from Parsoid in this notebook and some details on Parsoid's API on Mediawiki and in the endpoint documentation.

Cool. That's great context. I'll work on this and let you know my thoughts/progress.

I was playing around a bit with Parsoid and mwparserfromhtml, and I have a follow-up question.

Right now, when using LiftWing API we're counting the number of reference tags and shortened footnote templates added to articles, and can include multiple references to the same source. For example, for this article (revision 1201413530), LiftWing API counts 27 ref tags + 22 footnote templates = 49 references.
You can do the following request:

curl -X POST    -H "Content-Type: application/json"    -d '{"rev_id": 1201413530, "extended_output": true}'    https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-articlequality:predict

For the new reference-counter Toolforge API we followed a similar approach, and it counts ref tags and (some) footnote templates. The result for that revision id is the same (see the num_ref: 49).

As Isaac mentioned, the existing mwparserfromhtml.Article.get_references() returns unique sources as opposed to inline citations. So using the number of references from that method would return 38 instead of 49 for Gustav Stresemann article.

What do you think about this? I guess it is possible to implement a new method that does not count unique sources but references in total, but I would have to do some research on how it works (or ask if it's possible for the mwparserfromhtml team to support that).

I ask because perhaps for the Dashboard counting unique sources is not a problem, although it represents a change in the meaning of the "references added" metric.

Oooh yes excellent example to think through. I think there are two potential answers to the question of how many references are in an article but they only loosely relate to ref tags vs. footnote templates. There are two things that I think are relevant to count regarding references in an article. Other people might use different terminology (English Wikipedia notes that people often use these terms interchangeably unfortunately) but this how I'll distinguish:

Sources: how many unique references are in the article. In the screenshot below for the article in question, this is 38. This is what I call reference in the library which is perhaps confusing (sorry -- I'll have to think about whether to change this).
Citations: how many times a reference is used in an article. I think this is the current equivalent of references added in the dashboard. In the screenshot below you can count this by seeing how many footnotes link to each source. This is 1 if it's just a ^ label by the source but otherwise count up the individual letter labels. In this case, there are 48 total citations (1 for most sources and then 7 for source 11, 2 for source 18, and 4 for source 37). This is actually different than the 49 number returned by the wikitext-based APIs. I think what's happening is that there's a {{sfnref}} template also used in the article that's being counted in the footnote count but in reality is just providing some more information about a reference and not a distinct reference itself (again showing the challenges of working with wikitext).

For mwparserfromhtml, you would do len(get_references()) to get the 38 number and len(get_citations()) to get the 48 number. So both are equally easy to extract and no need to do any further implementation. I suspect you want to continue with your current definition (so use len(get_citations())) but Sage or campaigns folks could probably say whether there's any interest in also counting how many new unique sources folks are adding to articles.

Screenshot 2024-02-15 at 2.38.40 PM.png (1×1 px, 548 KB)

As an addendum, this article example is actually way more complicated than what I suggested above. The editors chose to cite specific pages from the same source, so certain books show up multiple times in the references list just with different page numbers. In reality then, the number of unique sources is actually far lower than 38. But there's really no good way to also account for that so I think it's fair to ignore it.

Excellent. Your description lines up with the terminology discussion we had today for "sources" vs "citations", and the ambiguity of "references". It's nice that your parser already can provide the citation count, which is what we'd use.

Thank you for the explanation. It looks like I was looking at an incorrect version of the mwparserfromhtml/parse/article.py file and didn't notice that get_citations() was already implemented. That's great news. Will use that then!

I have another follow-up question.

From the mwparserfromhtml documentation:
users can also use the Wikipedia API to obtain the HTML of a particular article from their title and parse the HTML string with this library.

So it looks like we're able to get the HTML for any Wikipedia language article, but not for any project.

Right now, the reference-counter Toolforge API is able to retrieve the number of references (actually, citations) for any language/project. For example, for an English wiktionary article:
https://reference-counter.toolforge.org/api/v1/references/wiktionary/en/76995678

I understand that getting citations from HTML will be possible only for Wikipedia articles. Maybe Isaac has more context on this.

Supposing that's the case, there are two options that come to mind:

We make the reference-counter Toolforge API work only for Wikipedia articles (but for every language).
We make the reference-counter Toolforge API use an hybrid approach. If the requested revision belongs to Wikipedia, then we count citations from HTML. If the requested revision belongs to any other project (such as wiktionary), then we count citations from wikitext as we're currently doing. The endpoint would keep doing only one API request (to retrieve either HTML or wikitext).

The logic for 2 would be fairly simple to code, but perhaps less clear to explain to Dashboard users.

It should work for any project actually assuming they follow the same approach to handling citations as Wikipedia does but I haven't tested much beyond Wikipedia. You'd just switch the project in the REST API URL -- e.g., https://en.wiktionary.org/api/rest_v1/page/html/heart/76995678 for the article you used above: https://en.wiktionary.org/wiki/?oldid=76995678

For context: something more Wikipedia-specific like infoboxes might be harder to extract effectively across projects but references/citations are handled by the Cite extension on Wikipedia (which is what determines how they show up in the HTML) and this does seem to be the case for most projects thankfully: https://extloc.toolforge.org/extensions/Cite

Oh that's great. Thank you. I was doing something wrong when trying to retrieve the wiktionary article, that's why it didn't work for me.

Going for this approach should be pretty straightforward. I think the main difference is that the html endpoint takes both the revision id and the title parameters, so I'm working on building the correct title for a given revision id to be able to make the requests.

Will keep this task updated with the progress.

Sounds good -- one thing that came up when I was chatting a bit with our Parsoid folks. what's the strategy for collecting the ref counts? would it be a batch job with a lot of concurrent API calls for the HTML (latency really could start to become a factor because old revisions are unlikely to be cached) or something a bit more spread-out / kinder to the APIs?

The process of importing ref counts is through (sidekiq) automatic tasks that calculate the "unscored" revisions for a course, and retrieve the ref counts for all of them. I think that depending on the size of the course and the kind of update (it could be a full or a daily update), the batch of requests would be considerably large. Does it answer your question?

Does it answer your question?

Getting us closer I think -- it is a batch job so possibility for a large number of requests all at once. Do you know what a maximum load might look like (doesn't have to be super specific, just a general sense to make sure it doesn't cause issues on Wikimedia end)? For instance, is it async but at most 20 concurrent requests or a sequential job that's only processing one revision at a time? There isn't necessarily a wrong answer though REST API documentation says max 200 reqs/second: https://en.wikipedia.org/api/rest_v1/. Older revisions could take some time to process as nothing would be cached on the Parsoid side.

Also, as I was asking about this with our parsing folks, I found out that there's a newer endpoint that they're slowly migrating to that also solves the "need page title to get revision" problem! So you might want to use this form of URL to get the revision HTML: https://en.wikipedia.org/w/rest.php/v1/revision/1201413530/html (documentation)

hmm @Ragesoss could you help me to answer theses questions please? These are my thoughts, but please let me know if I'm not in the right path:

is it async but at most 20 concurrent requests or a sequential job that's only processing one revision at a time?

I don't know how the sidekiq jobs consumption works at that level. I guess that we have several workers that consume enqueued tasks from the queue, and probably they work in parallel. But that's a guess. Is there any documentation about that? Or maybe you could give me some details?

Also, to determine what a maximum load might look like, I think I should run some queries on production database to calculate some averages of how many revisions the current courses have, to estimate how many API requests were made. Also, the courses have a flag that specifies how long the last update took, and I may be able to deduce something from that.

Also, as I was asking about this with our parsing folks, I found out that there's a newer endpoint that they're slowly migrating to that also solves the "need page title to get revision" problem! So you might want to use this form of URL to get the revision HTML: https://en.wikipedia.org/w/rest.php/v1/revision/1201413530/html (documentation)

Oh that's great! Using that endpoint would mean not changing anything on the ruby side of things I think (only to extend the Toolforge API).

Hey @Isaac, after talking today with Sage, he told me that we currently have 4 workers consuming tasks for programs and events dashboard, and 3 for wikieducational dashboard. So I guess that the max number of concurrent requests would be 7 (if all the workers are working at the same time). This shouldn't be too demanding for the API we think. Let us know if you have any other question/concern.

we currently have 4 workers consuming tasks for programs and events dashboard, and 3 for wikieducational dashboard. So I guess that the max number of concurrent requests would be 7 (if all the workers are working at the same time).

Yeah, that's quite reasonable! Thanks for looping back about it.

Hi team! I was recently participating in Wikimedia Mexico's #JuevesWiki as a volunteer and was made aware that the References Count seems to not be working yet for Spanish.
I added some references to test this article and couldn't see them reflected in the Dashboard. Are there any news on when we might see a change?

Those edits were just made a few minutes ago, and that event page has not updated again since they were made. The system currently estimates the next update in 9 hours, and after that update it should show up.

	F41916465: Screenshot 2024-02-15 at 2.38.40 PM.png
	Feb 15 2024, 8:36 PM

Proposal: Improve how Wiki Education Dashboard counts references addedOpen, Needs TriagePublicActions