Oct 8 2020
Sep 26 2020
@ArielGlenn Oh yes... sounds a good candidate. Thx for linking both tickets!
Sep 23 2020
@ArielGlenn Seems you are right. Indeed Stripped from the link, see https://el.wikibooks.org/api/rest_v1/page/html/inux_%CE%B3%CE%B9%CE%B1_%CE%B1%CF%81%CF%87%CE%AC%CF%81%CE%B9%CE%BF%CF%85%CF%82%2F%CE%93%CE%B9%CE%B1%CF%84%CE%AF_Linux%3B
@ArielGlenn I have put the online link as a reference. It is not deleted and if there is a typo, where (I can not find it)? You can see the problem differently: how to get the Parsoid output via REST api for this very specific article?
Sep 22 2020
Jul 27 2020
@Andrew mwoffliner1 & mwoffliner3 have been re-created. Hope this solves your problem :)
Jul 21 2020
@ema not really this is case which had to be handled in MWoffliner. This is all.
Jul 16 2020
@Andrew Then good to me. Would deleting the instance and recreating them be good enough to solve our problem? Or should we follow an other procedure?
Jul 15 2020
@Andrew Hi Andrew. About which VMs are with talking about exactly? mwoffliner1, mwoffliner2 and mwoffliner3? It is possible for us to invest time to recreate them but I would like to secure with you than we won't get weaker hardware. This is really critical point for us that they get really similar hardware (like mwoffliner5).
Jul 7 2020
Jun 24 2020
Jun 21 2020
Jun 17 2020
@CDanis We get many HTTP 429 errors from the rest(base) API if we scrape with nodes outside the VPS cluster. Really a hassle to deal with. It seems to me we are impacted... But maybe I get something wrong.
Jun 16 2020
Jun 15 2020
FYI: Because it seems there is a knowledge/communication gap about openZIM/Kiwix dumping solution, a Tech talk is currently being planned (probably in August) https://phabricator.wikimedia.org/T255392. If you have questions/concerns/remarks, please make comments on that ticket. I will secure that the presentation address them.
Jun 8 2020
I can only emphasis that a ticket which does not transparently explain the problem which is tried to be solved is going to be successfuly only by chance. Therefore, this is probably my last comment on this as we run here a discussion being blind. One of thing I heard is that that dumps might have to include the Parsoid sementic tags, which is not the case for the dumps issued by MWoffliner (MWoffliner remove them). If this is the case, a POC can be done within a few hours to avoid remove them, we can ever just store the raw HTML issued from the API JSON.
Jun 5 2020
I believe I don't understand why additional HTML dumps are necessary, but like @ArielGlenn has written we do all of this already on a monthly base:
Jun 4 2020
@aborrero Thank you very much. Everything works like a charm now!
@Andrew I have been able to recreate mwoffliner2 properly. I believe 4 VCPUs and 8GB or RAM are missing in the quota.
@Andrew Thank your very much for this! I have been able to delete mwoffliner1 and recreate it successfully with a xlarge-xtradisk profile. The VM is up and running. I wanted to recreate mwoffliner3 the same way, deleted it but failed to create a new xlarge-xtradisk instance. It seems the quota is not proper (too low). Do I'm wrong somewhere?
May 28 2020
@Aklapper Thx for pointing me to this, I have updated the task with the expected information.
May 19 2020
May 9 2020
May 4 2020
@abi_ Thank you!
Apr 30 2020
Apr 28 2020
@Cparle Thank you for the feedback.
Apr 25 2020
@ArielGlenn Thank you for your interest and your patience. Here it is https://phabricator.wikimedia.org/tag/affects-kiwix-and-openzim/
@Ramsey-WMF Hi, what is the status on this? I kind of believe the work has made progresses... but this is tracked in an other ticket!?
Feb 6 2020
Jan 28 2020
Jan 13 2020
@Andrew Thank you for your offer. In 2019 we have invested to improve the performance of our MWoffliner scraper. This effort is not over, there is 2-3 additional important things I want to get done before asking you guys to do more work ;) For the moment the overall VM setup we have (mwoffliner project) is pretty much OK and I'm so thankful you provide it. But of course improving performance of software changes the HW footprint and we will probably later have to rebuild/changes our VM landscape to be more efficient.
Jan 7 2020
Jan 6 2020
@BD2412 Thank you for answering my request. I understand that we have to stick to OpenStack concepts. That said I have to emphasis that this approach IMO leads to a waste of HW resources. In addition, it sounds also a bit "outdated" to me, if I compare to other cloud solutions which provide a small(er) HW configuration granularity.
Nov 13 2019
@Aklapper Looks good to me. Thx.
I have forgotten to write that, after discussing this topic with @Aklapper, it looks like a solution might be:
- A dedicated tag "Kiwix & openZIM"
- A dedicated workboard where we would be able to put the tickets we would like to see fixed/implemented in priority.
@Dzahn Thank you. I have been able to connect to the admin UI.
Sep 11 2019
@Aklapper Happy you tackle that issue. I wanted to open that ticket myself for quite a long time already. You can put me in the list of admins.
Jul 31 2019
I don't think the last fix deployment has fixed anything regarding that ticket. The Parsoid output is still broken so far I can see.
Jul 18 2019
@bearND Thank you for the explanation
Jul 4 2019
@fgiunchedi You are perfectly right. This seems to be an invalid bug. In fact I made the error because "File:Poaceae Habitus 2010-10-03 FuentedeSanLorenzoSierraMadrona.jpg" has been redirected from "File:Isoetes setaceum Habitus 2010-10-03 FuentedeSanLorenzoSierraMadrona.jpg". This is probably a consequence of T226931
Jun 30 2019
@ssastry Thank you!
@Aklapper The whole offline/Kiwix stuff is handled by the new-readers team. That said, totally unsure if this is right to put it here. I would really like to have a tag or project to gather all openZIM/Kiwix impacting tickets. Would that be somehow possible?
Like T217540 is one good source of broken/outdated HTML (here with invalid images) delivered by Parsoid. This pretty annoyous for offline version of our wikis.
@ssastry I don't really understand why this ticket is closed. There is no agreement that this is a bug?
Jun 17 2019
Jun 15 2019
@mobrovac Thx, so I see that this concerns 307 web sites and for us, as around 20 are meta web sites that we don't want to scrape anyway, this is around 280 wikis we hardly can scrape.
BTW, it.wikibooks.org seems to be also impacted. Do we have a list somewhere of all the wikis where the mobile API endpoints are not "properly" available?
Jun 9 2019
Jun 7 2019
This problem stops us to provide a new offline version of Wikispecies.
Jun 5 2019
@Ariel Sorry this was a problem on our end. Is fixed now.