Update week of 17 June - 23 June, 2024:
- Change read/write methods to fix memory issues. Fixed for small-medium wikis. Additional errors for larger wikis (e.g. enwiki, jawiki)
- Refactor generate_anchor_dictionary script to modularize better.
Update week of 17 June - 23 June, 2024:
Update week of 10 June - 16 June, 2024:
Update week of 03 June - 09 June, 2024:
Update week of 27 May - 02 June, 2024:
Update week of 20 - 26 May, 2024:
Update week of 13 - 19 May, 2024:
Update week of 6 - 12 May, 2024:
Update week of 29 April - 5 May, 2024:
Update week of 22-28 April, 2024:
Update week 15 to 21 April 2024:
Update week 8 to 14 April 2024:
The exploratory part of link-recommendation for add-a-link is done.
Update 1/4/2024 - 7/4/2024:
Update 25/3/2024 - 31/3/2024:
Update 18/3/2024 - 24/3/2024:
Update 11/3/2024 - 17/3/2024:
Update 4/3/2024 - 10/3/2024:
Update 26/2/2024 - 3/3/2024:
Update 19/2/2024 - 25/2/2024:
Update 29/01/2024 - 04/02/2024:
Update week 22/1/2024 - 28/1/2024::
Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.
Update week 15/1/2024 - 21/1/2024::
Update week 8/1/2024 - 14/1/2024:
Update week 1/1/2024 - 7/1/2024:
Update week 19/12/2023 - 24/12/2023:
Update week 11/12/2023 - 17/12/2023:
Update week 27/11/2023 - 3/12/2023:
Update week 20/11/2023 - 26/11/2023:
Update week 13/11/2023 - 19/11/2023:
Update week 6/11/2023 - 12/11/2023:
Update week 30/10/2023 - 5/11/2023:
Update week 23/10/2023 - 29/10/2023:
Update week 16/10/2023 - 22/10/2023:
Update week 9/10/2023 - 15/10/2023:
Update week 2/10/2023 - 8/10/2023:
Update week 25/09/2023 - 1/10/2023:
Thanks @colewhite. I'm all set!
Update week 18/09/2023 - 24/09/2023:
I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?
Week 26/6/23 - 2/7/23 Update:
Week 19/6/23 - 25/6/23 Update:
Week 12/6/23 - 18/6/23 Update:
Week 5/6/23 - 11/6/23 Update:
Week 29/5/23 - 4/6/23 Update:
Week 22/5/23 - 28/5/23 Update:
Week 15/5/23 - 21/5/23 Update:
Week 8/5/23 - 14/5/23 Update:
Week 1/5/23 - 7/5/23 Update:
Week 24/4/23 - 30/4/23 Update:
Week 10/4/23 - 16/4/23 Update:
Week 3/4/23 - 9/4/23 Update:
Week 27/3/23 - 2/4/23 Update:
Week 20/3/23 - 26/3/23 Update:
Week 13/3/23 - 19/3/23 Update:
Week 6/3/23 - 12/3/23 Update:
Week 27/2/23 - 5/3/23 Update:
Week 20/2/23 - 26/2/23 Update:
Week 13/2/23 - 19/2/23 Update:
Week 6/2/23 - 12/2/23 Update:
Thank you, accessed!
Week 1/2/23 - 5/2/23 Update:
In T303831#8063021, @EBernhardson wrote:In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { wikidataTriples .filter(s"predicate='<$p31>'") .selectExpr("object as subgraph", "subject as item") .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:
In T303831#8058159, @EBernhardson wrote:the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).
subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.
Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.