Apr 5 2021
I am closing this task. Some subtasks are pending (not priority), should we move these to a new task? Plus additional subtasks will be created for project improvement as well.
Mar 8 2021
Mar 7 2021
Should this be closed? @LostEnchanter
Mar 2 2021
@LostEnchanter Glad I could help! And really great work!
So I couldnt test it because I couldn't find where you populated the linked_df dataframe, sorry about that. But this snippet should be enough:
Get a list of all dbs with the chosen families:
dbs = linkage_df[linkage_df['family'].isin(chosen_families_list)]['database']
Filter the scores dataframe with the retrieved dbs list:
df = df[df['dbname']].isin(dbs)
Feb 26 2021
I've updated the pdf to include noise removal and tuning analysis. This should be it for now regarding similarity analysis. Feel free to send feedback.
Feb 23 2021
@LostEnchanter @gengh @dr0ptp4kt
I've attached my analysis and procedures so far. Some more tasks are added as todo. Tuning takes some time due to longer clustering time but good thing there's not much to tune.
Feb 17 2021
I shared a short doc on how the scoring metric works, maybe something we can incorporate into our final report too.
Feb 10 2021
Feb 3 2021
Thanks a lot, @Quiddity. These actually help!
Feb 2 2021
@Quiddity, @dr0ptp4kt, and others, we do need some help determining which features to give importance to, to identify important modules. Especially on how to combine the gathered stats on various data. Some questions I had specifically were:
@LostEnchanter Thats great! I believe we connect to the shards locally, and in our scripts we match and connect to appropriate shard? I am not sure what we ask toolforge library devs since when we connect locally we use pymysql anyways. I see they have made some changes wrt this recently, those changes may be worth a look.
Storing in Sources table is brilliant if we want to do it ourselves!
@Quiddity Thanks a lot. Your finds match with mine about the Scribunto vs wikitext types and I've checked from language links table as well, enwiki is not connected to trwiki indeed!
Jan 29 2021
Thanks all, this issue is now resolved!
Jan 28 2021
Thanks @bd808 that might be it. We used analytics when we connected locally.
Jan 27 2021
Jan 26 2021
@LostEnchanter Yes those with *??* create duplicate titles although actually, they are not the same, these are some alphabets or symbols that I couldn't get rendered anywhere (web or notebook).
Also if you were able to find out certain groups/clusters of pages that go together (like pronunciation modules) then maybe we can find modules similar to them and start reducing our data for further analysis.
Jan 25 2021
Jan 21 2021
Hi, we would love to test with our code (T263678). We already do connect to databases we need explicitly and we don't have any inter-wiki joins, so that's good. Although when working locally, we connect with meta and use all other dbs as required because connecting to *all* the dbs with SSH is quite a hassle. I believe this shortcut will not work anymore? I think we need to handle this hassle with mappings.
Jan 16 2021
Interesting! Does that mean the templatelinks table is updated only when a module is actually being 'used'?
Jan 14 2021
Jan 13 2021
Jan 12 2021
Indeed tl_from is a unique value as it is the page_id. SImilarly there were other instances where I could remove the DISTINCT. Still to get a number on the improvement of time, but there will not be any data loss, that's for sure.
Jan 11 2021
Hi, thanks for the feedback! I was working on solving some issues with the pageviews, I am going to try out your suggestion for the templatelinks table soon today.
Jan 8 2021
Jan 7 2021
@dr0ptp4kt It seems due to the running jobs toolforge has gotten super slow. It's really hard to continue working on other things from toolforge, should I stop the jobs for now? (although they have been running for a long time). Debating myself.
Jan 6 2021
Jan 5 2021
Dec 31 2020
My idea was that some pages are highly protected and this may mean they are important modules (therefore also used in a lot of places). Those can be prioritized to be centralized.
@LostEnchanter Hi, I spent couple of days going through the entire database layout and extracting as much information as I found relevant. I have listed them all out. Next, I will be going through how to get pageview information as those are not in the database and then will start storing all info in user database.
Dec 26 2020
After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules):
- 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored.
- 98 pages found from API but not in DB.
Dec 25 2020
Couple of confusion I ran into:
Dec 24 2020
Dec 23 2020
We could use dbname but that wasnt not save from the content fetcher. When loading from database I guess that wont matter, so yes, we can use dbname for sure.
Dec 21 2020
Can you please additionally describe, what do you mean by 'length' there? Amount of symbols in Lua sourcecode?
Dec 20 2020
I've tried to compare pages collected by API and db(id and titles only) by ids. Had to go through a LOT of memory errors to run this script.
This is the output:
Number of db pages: 275154 Number of api pages: 274543 Number of unique pages in db: 740 # pages not found from API calls Number of unique pages in api: 129 # pages not listed from db queries Ok
It seems there are some discrepancies. I am looking into what these files are and if there's any pattern here.
Dec 16 2020
Oct 30 2020
@SafiaKhaleel Yes. after recording a contribution you should submit a final application. Thats where you will be asked to write your prospective timeline of the project.
Oct 23 2020
@Tambe Alternately you could:
Oct 14 2020
Thanks @dr0ptp4kt. I was woking with the revision API where I wanted to get content for all the pages using a generator. But the API doesn't seem to return revision content for most pages.
Plus I wanted to get only the lastest revision content, but that seems to be possible only for single page queries. A little help here.
Hi, I am an outreachy applicant and interested in joining this project. I went through the task and I will get started with it right away.
Just need a little clarification, are we all going to solve the same task or are there other I have to look at?
Oct 13 2020
Hi, regarding comparison of dump and API data, should we compare all data or 10 randomly selected ones. Just to be sure if the API will support calling for lots of page ids.
Thanks @Miriam, makes sense :D
Hi everyone, I am an outreachy applicant and super excited to get on board and start contributing!
Hi @Miriam, I am an Outreachy applicant, excited to be a part of this project. Are there any additional steps before I start with the notebooks?
Also I see T263874 is the same for both inferring country and this project. Can be given some clarification as to which project I will be working with on completing the subtask? Or are they part of the same project?