Mar 7 2021
Mar 2 2021
@tanny411 Thanks for the help with Dataframes, they are way faster when used like that.
Today was the second day of fighting with rolling out of production version, but I believe I finally fixed everything for the current version, so the main functionality is working - you can test it at https://abstract-wiki-ds.toolforge.org/
I'm planning onto fixing css tomorrow - and maybe adding language filter, which for now is left behind, as it takes too long to check all the entries.
Feb 16 2021
Feb 15 2021
Feb 9 2021
Feb 3 2021
Feb 1 2021
The update is live! See T272523 for the new connection scheme.
Jan 30 2021
@Quiddity thanks for the info, it was really interesting to find out how this problem was handled. It really looks like encountering errors of this type should not be unusual at all, considering previous way of storying interwiki links.
Jan 29 2021
@Quiddity thank you for this interesting observation! Do you know whether the initial linking to Wikidata pages was done by users or by bots? I'm curious because querying through API correctly shows, that w:tr:Modül:Konum haritası/veri/Polonya is Scribunto-type and belongs to namespace 828.
Jan 27 2021
@tanny411 you did a great job creating this report!
I've run this script from my local PC using ssh tunnels. I tried different variations:
- Connecting through ssh to meta database and connecting to enwiki database;
- Using pandas to fetch the result and using basic pymysql cursor.fetchall()
- Using LIMIT 500 and LIMIT 2 OFFSET 100
Jan 26 2021
Currently the idea is to use metric, based on the Levenstein distance as distance between texts (examples in this notebook or this notebook for bigger cases). Current idea of closeness detection algorithm look like this like this:
@tanny411 So, yes, my logic is something like that: they are all different, and it looks like all of them have "?" in title. Can we drop them, or there's something I miss?
@tanny411 I've been looking through your notebook and there are things I've seen previusly too. Modules like Module:inc-ash/dial/data/?? and Module:zh/data/ltc-pron/? all refer to different translation and/or prononciation information for the word. They would have different source code, that's logical, and I'm not sure we want to analyze them at all. At the same time, on my tests they are usually detected by Levenshtein distance quite easily, so it might not be worth the work to drop them.
Jan 20 2021
Jan 14 2021
Jan 13 2021
Jan 6 2021
Dec 31 2020
@tanny411 you did a really good job!
Dec 30 2020
Dec 29 2020
Dec 25 2020
Dec 24 2020
Dec 23 2020
Dec 22 2020
Faulty Toolforge update today slows things down, sadly...
Dec 21 2020
Dec 19 2020
For file downloading: scp seems to be working just file, but wasn't able to make ssh tunneling from here to work