Fri, Jul 21
@Fjalapeno sure, np.
@Fjalapeno Yes, we can.
After chatting with @Fjalapeno I agree that the return trip (JSON to HTML) is not necessary since most clients would try to display the reflist content in (native or web) components. There is the option of structuring the reference content further but that is not needed at this time.
So, a separate library for reflist handling is not needed. It can be done directly inside of MCS/PCS.
We don't want to rely on the backlink counter alone, though, since that would increase the burden on clients to piece them back together and the data savings is not significant enough to warrant this complexity.
What reconstruction? What are clients inserting into the DOM?
Thu, Jul 20
In the past life as an Android app dev I used EL quite regularly. Now in Reading Infrastructure I haven't had the need yet but it could be coming. Either way is fine with me for now. If access gets removed and I need it later, I'll ask for it.
Fixing this is a good step in the right direction. FWIW, in this case I think keeping the parentheses would be preferable since it's to better clarify the following noun. I know that this is hard to determine programmatically. Having said that, editors could also just remove the parentheses from in der (Rechts-)Nachfolge to make it in der Rechtsnachfolge if it is really important enough.
Before gzip it's around 100 bytes per backlink. Assuming that most references only have one backlink, on the Barack_Obama page with almost 500 references it would be roughly 50KB before gzip theoretically. With gzip it's actually only 3KB.
The reconstruction would be part of the library we need for the roundtrip HTML -> JSON -> HTML. Either way the client would get HTML text blocks which it would insert into the DOM using that library.
Wed, Jul 19
Ok, that makes sense to me. You're right, the duplication is probably is not a big deal for the reference lists in infobox or other earlier cases in the article.
How do we distinguish in code between the two cases? Only by where the sections is relative to the end of the article? Or was there anything else we could go by?
@Fjalapeno Ah, I see. Yes, that makes sense.
Isn't lazy image loading already in the Page Library?
Here's a strawman proposal for the JSON structure build from the following HTML example:
Mon, Jul 17
Sounds to me that we should have a page-reference-list library which can got from HTML to JSON and back.
Fixed in https://gerrit.wikimedia.org/r/#/c/365364/ (already merged). I was not aware of this ticket when I created the patch. lol
FYI, not that it makes really a difference in terms of API, I bumped it to 0.6.3.
What about JSON responses?
Yes, sounds like a plan. I was just a bit surprised to see that the savings don't get close to adding up. My explanation for this is that the transforms for stripping the unneeded markup heavily strip the reference list content. There's still one included which I'll have to take out: stripping of ref back links. For the Android app we remove them but for the web case I think we would want to preserve them.
Fri, Jul 14
Here's a summary of the payload sizes I measured. The spreadsheet compares the gzipped (-6) payloads: plain Parsoid against various MCS/PCS read-html variants: no stripping, just stripping of unneeded markup, just stripping of reference lists, stripping both.
@Arlolra should we reopen this or create a new subtask?
Here's a spreadsheet with the results against the top monthly titles in enwiki and zhwiki.
All tests pass in mobileapps.
Thanks. My Frankenstein page get the working audio link now.
Wed, Jul 12
The bulk of the space-saving transformations comes from https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transforms.js#L193-L268.
@Mhurd The issue is that isAnchorNotForYear() filters out the anchor ./2008_Karmah_Bombing because it's starting with the same number as the year of the event.
Tue, Jul 11
I believe the issue is really that it uses normalized titles when it should be using db titles (getPrefixedDBKey) instead. (Underscores instead of spaces)
@Arlora. Oh, ok. I didn't know you had a different policy in Parsoid land. How do you get SAL to mark this as deployed? Might be good for me to do as well.
Probably best to keep this open until this is deployed.
Thanks for explanation of the wikidata case. That makes sense now.
@Arlolra I see you have closed this ticket. Has this been deployed yet? https://en.wikipedia.org/api/rest_v1/page/html/User%3ABSitzmann_(WMF)%2FMCS%2FTest%2FFrankenstein still shows the Sound link pointing to the png instead of the ogg file.
Mon, Jul 10
The advantage of keeping this separate is that clients can chose which endpoint they want to use: summary vs read-html vs mobile-sections ... , or build a talk page or other namespaced URL.
- The title property in most other places is not URL-encoded. I have not seen anything in T164291 to indicate that this should change. Example: https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg should be File:Collage_of_Nine_Dogs.jpg but not File%3ACollage_of_Nine_Dogs.jpg. I think it probably doesn't hurt (need to check with the apps to be sure) but wanted to point out that this is a new thing.
- plaintext_intro is a new property that the apps don't use yet and an alternative to the HTML version of intro the web is going to use. I'm not sure if this is needed if we don't have any actual users for this property.
- Isn't wikidata_label usually the same as normalized_title? What is it going to be used for? When would it be different from normalized_title? If this is only set for Wikidata then that's fine. I just haven't seen anything that says that explicitly.
- The Image type properties (thumbnail and original) are likely going to change in the future to use the new Thumbnail API once that is available (see T66214).
Thu, Jul 6
Also, audit added langs for missing 'date' pages. (Note: I have script for this somewhere...)
Sat, Jul 1
Fri, Jun 30
@ArielGlenn thanks for the the links and the patch! I agree this might have to do with it. Not sure why Parsoid doesn't provide the link to the actual media file anymore. Maybe this is related to T169293, too?
So, the difference is the --useBatchAPI option. When/where/why is it used?
Thu, Jun 29
Wed, Jun 28
@mobrovac The reason is that it seems that PageView API still sees some updates after the first time the aggregated feed entry for a day is stored. I assume that's shortly after 0:00 UTC. If I run the same thing in MCS a day later I shouldn't see significant differences (changed ranks) and duplicated entries from the last stored version.
Tue, Jun 27
In this template https://ja.wikipedia.org/wiki/Template:%E8%AA%AD%E3%81%BF%E4%BB%AE%E5%90%8D somewhere? Hmm, maybe somewhere else. I have a hard time finding any of the substrings there. The PHP-parsed version doesn't seem to have an issue with that, though.
Create a "null" skin that only returns the content area
Something like useskin=apioutput? Example: https://en.wikipedia.org/wiki/Special:Version?useskin=apioutput
Seems like an artifact coming from Parsoid: https://ja.wikipedia.org/api/rest_v1/page/html/%E6%B1%9F%E6%88%B8%E5%B7%9D%E3%82%B3%E3%83%8A%E3%83%B3 shows the same:
Yes, https://en.wikipedia.org/api/rest_v1/feed/featured/2017/06/15 shows two different results for the same topic: "Grenfell_Tower_fire". (There's also a related "Grenfell_Tower" result there but that's not the issue here.)
Mon, Jun 26
Here are some preliminary results (before stripping of references is implemented) with a small sample of test pages (some were taken from the most-read results from 6/22/2017):
If I don't strip any HTML and just add some of the markers and other changes needed MCS read-html adds around 1.9% to the payload.
If stripping of unneeded tags is included MCS read-html reduces the payload from 23% to 47% (avg. around 37%).