Page MenuHomePhabricator

Duplicate usage examples in Wiktionary page definition endpoint
Closed, ResolvedPublic

Description

For example:

https://en.wiktionary.org/api/rest_v1/page/definition/fazer
https://en.wiktionary.org/api/rest_v1/page/definition/kuulua

The usage examples are not parsed correctly and have duplicated content. This was recently raised in a Wiktionary Grease pit discussion.

I had a quick look at the parsing code and could submit a change request which would also incorporate the microformat additions (T138709). This should make the parsing much more reliable.

Is the definition endpoint still used by the Android app?

Event Timeline

Mholloway subscribed.

Is the definition endpoint still used by the Android app?

Yes, it is.

I had a quick look at the parsing code and could submit a change request which would also incorporate the microformat additions (T138709). This should make the parsing much more reliable.

That sounds great. Patches very welcome!

@jberkel This is slightly off topic but since you seem to be interested in the Wiktionary definitions endpoint I was wondering whether you had any thoughts on T133051#3976643.

@bearND I think it would be preferable to extract the definition from the glossary and not the linked Wiktionary definition page (which might contain some other unrelated content). But I also understand that you don't want to add extra parsing code, so it might be a good first compromise. Maybe there's something simple we could do on Wiktionary to make things easier?

I have a first patch ready for review. Can I submit PRs via Github or do I have to go through gerrit 😱?

@jberkel While Gerrit patches are preferred I'd be willing to look at your PR in Github as well. Here is a link if you want to try the Gerrit route: [1].
If you go the GH route I would most likely have to make some modifications to the commit message format (add Change-Id and Bug: line with a mention of this task) unless you already add these yourself.[2]

[1] https://www.mediawiki.org/wiki/Gerrit/Getting_started
[2] https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines

@bearND I can handle Gerrit, just remembered that I made some contributions to the Wikipedia iOS app using GH a while back, it's just so much easier.

Change 419824 had a related patch set uploaded (by Jberkel; owner: Jberkel):
[mediawiki/services/mobileapps@master] Fix duplicate usage examples

https://gerrit.wikimedia.org/r/419824

Change 419824 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Fix duplicate usage examples

https://gerrit.wikimedia.org/r/419824

Mentioned in SAL (#wikimedia-operations) [2018-04-05T17:08:59Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@eed7961]: Update mobileapps to dbc0687 (T187430)

Mentioned in SAL (#wikimedia-operations) [2018-04-05T17:18:44Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@eed7961]: Update mobileapps to dbc0687 (T187430) (duration: 09m 45s)

bearND claimed this task.

@bearND Looks like the change hasn't been deployed yet? Still getting duplicate sentences

@jberkel Looks like there is something wrong in the RESTBase layer for definitions. See the task I just mentioned this one from.

Basically, I purged the word fazer (since a deploy doesn't automatically change a results). But I still see no parsedExamples on the one returned by RESTBase. Only when I run this locally on the server I see parsedExamples.
curl -s "https://en.wiktionary.org/api/rest_v1/page/definition/fazer" | grep parsedExamples

@bearND ok, thanks. if the new code is deployed, where would the old responses come from? or is the header change required for the cache invalidation? is the cache invalidated on source page change (e.g. /wiki/fazer) ? what about changes in dependent templates?

@jberkel The old responses are coming from RESTBase. It stores them in its database. The results should get updated when either the corresponding page is purged or edited (or a transcluded page is edited) via ChangeProp.

reopened this task since the API still returns duplicates

@jberkel Are you planning on making more changes to the definition endpoint soon?

yes, maybe T131092, that shouldn't take too long. it would also be good to get a sense of what else is needed from a client's perspective.

Ok, once those changes are done we could ask the Services team to run a dump on enwiktionary to refresh every stored result.

Is there no way to just invalidate the entire cache? seems quite complicated to make changes to the API.

Also, if I make a normal change to a Wiktionary page, shouldn't that invalidate the cache and return the new schema? What are the typical delays for the change propagation?

Yes, a change to a page should invalidate the cache and cause a re-render. This is how it works for Wikipedia and most other Wikimedia projects. Looks like that ChangeProp rule doesn't exist for Wiktionary yet (see T192157).
I think we could force a rerender on all English Wiktionary pages once we think the endpoint has stabilized. There are over 5.5 million English Wiktionary entries, so that would take a couple of days. That's why we don't want to do this automatically after any deploy and that's why I asked if you had more changes planned soon.

I think I still don't understand some parts of the service architecture. Why do all 5+ million entries have to be cached in one go? If the cache gets purged, won't it automatically/lazily repopulate the most requested ones? Some entries will probably never/rarely get requested, why should they get pre-cached?

From my understanding this is done eagerly to reduce latency, so RESTBase doesn't have to check for every read request if the stored results were rendered with the required version of the code. That mode where RESTBase check for every read access is possible, too, but the eager solution seems to be what the Services team prefers.

Looks like the storage part of RESTBase isn't conceptually a cache which can be "thrown away" (that was my assumption). Maybe this has to do with a requirement to serve older revisions for other endpoints (not applicable for definitions)?

Anyway, feels weird to have to batch up changes, and a bit frustrating not to get immediate feedback on code changes. It also introduces inconsistencies (newly created entries will generate different responses).

(As an aside, I found T144431 where conflicting design decisions [caching vs storage] are discussed)

Fix was deployed a long time ago; closing this now.