Personal Accounts:
- Phab: tanny411
- Meta: Aisha Khatun
Check out my website/blog: http://tanny411.github.io/
Personal Accounts:
Check out my website/blog: http://tanny411.github.io/
Update week of 20 - 26 May, 2024:
Update week of 13 - 19 May, 2024:
Update week of 6 - 12 May, 2024:
Update week of 29 April - 5 May, 2024:
Update week of 22-28 April, 2024:
Update week 15 to 21 April 2024:
Update week 8 to 14 April 2024:
The exploratory part of link-recommendation for add-a-link is done.
Update 1/4/2024 - 7/4/2024:
Update 25/3/2024 - 31/3/2024:
Update 18/3/2024 - 24/3/2024:
Update 11/3/2024 - 17/3/2024:
Update 4/3/2024 - 10/3/2024:
Update 26/2/2024 - 3/3/2024:
Update 19/2/2024 - 25/2/2024:
Update 29/01/2024 - 04/02/2024:
Update week 22/1/2024 - 28/1/2024::
Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.
Update week 15/1/2024 - 21/1/2024::
Update week 8/1/2024 - 14/1/2024:
Update week 1/1/2024 - 7/1/2024:
Update week 19/12/2023 - 24/12/2023:
Update week 11/12/2023 - 17/12/2023:
Update week 27/11/2023 - 3/12/2023:
Update week 20/11/2023 - 26/11/2023:
Update week 13/11/2023 - 19/11/2023:
Update week 6/11/2023 - 12/11/2023:
Update week 30/10/2023 - 5/11/2023:
Update week 23/10/2023 - 29/10/2023:
Update week 16/10/2023 - 22/10/2023:
Update week 9/10/2023 - 15/10/2023:
Update week 2/10/2023 - 8/10/2023:
Update week 25/09/2023 - 1/10/2023:
Thanks @colewhite. I'm all set!
Update week 18/09/2023 - 24/09/2023:
I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?
Week 26/6/23 - 2/7/23 Update:
Week 19/6/23 - 25/6/23 Update:
Week 12/6/23 - 18/6/23 Update:
Week 5/6/23 - 11/6/23 Update:
Week 29/5/23 - 4/6/23 Update:
Week 22/5/23 - 28/5/23 Update:
Week 15/5/23 - 21/5/23 Update:
Week 8/5/23 - 14/5/23 Update:
Week 1/5/23 - 7/5/23 Update:
Week 24/4/23 - 30/4/23 Update:
Week 10/4/23 - 16/4/23 Update:
Week 3/4/23 - 9/4/23 Update:
Week 27/3/23 - 2/4/23 Update:
Week 20/3/23 - 26/3/23 Update:
Week 13/3/23 - 19/3/23 Update:
Week 6/3/23 - 12/3/23 Update:
Week 27/2/23 - 5/3/23 Update:
Week 20/2/23 - 26/2/23 Update:
Week 13/2/23 - 19/2/23 Update:
Week 6/2/23 - 12/2/23 Update:
Thank you, accessed!
Week 1/2/23 - 5/2/23 Update:
In T303831#8063021, @EBernhardson wrote:In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { wikidataTriples .filter(s"predicate='<$p31>'") .selectExpr("object as subgraph", "subject as item") .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:
In T303831#8058159, @EBernhardson wrote:the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).
subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.
Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.
The analysis is done here (for Q-ids): Wikidata_Item_ORES_Score_Analysis
In T288262#7629267, @Lydia_Pintscher wrote:@AKhatun_WMF: You mention on the wiki that some Items don't have an ORES score. All Items should have one 😬 Do you have an example of one that does not?
In T288262#7628599, @MPhamWMF wrote:@AKhatun_WMF , sorry, it's been a while since I wrote this, but I think what I meant when I wrote the question about "optimal separation" is given some distribution of ORES scores (e.g. a normal distribution), is it clear what the threshold is for what qualifies as a "high" vs "low" score: e.g. anything over .75 is a high score. But that's assuming the scores are continuous. I guess it's moot if they're binary (I don't actually know).
If this isn't a sensible way of thinking about the issue, let me know if there's a better way.