@tanny411 Thanks for the help with Dataframes, they are way faster when used like that.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Mar 7 2021
Mar 2 2021
In T274787#6873628, @tanny411 wrote:Instead of having a list of hardcoded families, maybe you can use the meta_table acquired data to get a list of all families as well as language and display them. To avoid repeated calls, just saving them in a array should work when app initializes.
You will understand the app performance issues better, so I'm leaving the decision to you, let me know where I can jump in.
Today was the second day of fighting with rolling out of production version, but I believe I finally fixed everything for the current version, so the main functionality is working - you can test it at https://abstract-wiki-ds.toolforge.org/
I'm planning onto fixing css tomorrow - and maybe adding language filter, which for now is left behind, as it takes too long to check all the entries.
Feb 16 2021
Feb 15 2021
Update status:
Feb 9 2021
Feb 3 2021
Feb 1 2021
The update is live! See T272523 for the new connection scheme.
Jan 30 2021
@Quiddity thanks for the info, it was really interesting to find out how this problem was handled. It really looks like encountering errors of this type should not be unusual at all, considering previous way of storying interwiki links.
Jan 29 2021
@Quiddity thank you for this interesting observation! Do you know whether the initial linking to Wikidata pages was done by users or by bots? I'm curious because querying through API correctly shows, that w:tr:Modül:Konum haritası/veri/Polonya is Scribunto-type and belongs to namespace 828.
Jan 27 2021
@tanny411 you did a great job creating this report!
I've run this script from my local PC using ssh tunnels. I tried different variations:
- Connecting through ssh to meta database and connecting to enwiki database;
- Using pandas to fetch the result and using basic pymysql cursor.fetchall()
- Using LIMIT 500 and LIMIT 2 OFFSET 100
Jan 26 2021
Currently the idea is to use metric, based on the Levenstein distance as distance between texts (examples in this notebook or this notebook for bigger cases). Current idea of closeness detection algorithm look like this like this:
@tanny411 So, yes, my logic is something like that: they are all different, and it looks like all of them have "?" in title. Can we drop them, or there's something I miss?
@tanny411 I've been looking through your notebook and there are things I've seen previusly too. Modules like Module:inc-ash/dial/data/?? and Module:zh/data/ltc-pron/? all refer to different translation and/or prononciation information for the word. They would have different source code, that's logical, and I'm not sure we want to analyze them at all. At the same time, on my tests they are usually detected by Levenshtein distance quite easily, so it might not be worth the work to drop them.
Jan 20 2021
In T271957#6752450, @dr0ptp4kt wrote:@LostEnchanter I think I understand. I believe what you may be witnessing here is that no page uses Module:yesno (lowercase 'y') indirectly (e.g., a page using a template #invokeing some module which in turn requires yesno) on hiwiktionary, and therefore no MediaWiki parser hooks (and manually executed or scheduled link refresh jobs) run in a context where a determination is made to insert an entry into templatelinks.
Jan 14 2021
Jan 13 2021
In T271957#6744792, @Aklapper wrote:Hi, where exactly to see a Templatelinks table (URL)? Please follow https://www.mediawiki.org/wiki/How_to_report_a_bug whenever possible - thanks a lot!
Jan 6 2021
Dec 31 2020
In T270492#6715434, @tanny411 wrote:My idea was that some pages are highly protected and this may mean they are important modules (therefore also used in a lot of places). Those can be prioritized to be centralized.
@tanny411 you did a really good job!
Dec 30 2020
Dec 29 2020
Dec 25 2020
Dec 24 2020
Dec 23 2020
In T270500#6709572, @tanny411 wrote:We could use dbname but that wasnt not save from the content fetcher. When loading from database I guess that wont matter, so yes, we can use dbname for sure.
Dec 22 2020
Faulty Toolforge update today slows things down, sadly...
Dec 21 2020
In T270494#6704693, @tanny411 wrote:I've tried to compare pages collected by API and db(id and titles only) by ids. Had to go through a LOT of memory errors to run this script.
This is the output:Length of db pages: 275154 Length of api pages: 274543 Length of unique pages in db: 740 # pages not found from API calls Length of unique pages in api: 129 # pages not listed from db queries OkIt seems there are some discrepancies. I am looking into what these files are and if there's any pattern here.
@LostEnchanter I think it's a good idea to save data into databases and process from there. Loading contents gives couple of errors due to the presence of all kinds of symbols in the code (quotes and commas). Since we are going to use db anyways, I think its best not to try to solve all these errors now. (I did spend a good amount of time trying to load the csv to compare page entries with db, but then I went on with a work around for now)
Dec 19 2020
For file downloading: scp seems to be working just file, but wasn't able to make ssh tunneling from here to work