Update week 8 to 14 April 2024:
- Went over airflow and research dataset repos
- Sketched an overview of our current code base workflow and a few research airflow repos.
Update week 8 to 14 April 2024:
The exploratory part of link-recommendation for add-a-link is done.
Update 1/4/2024 - 7/4/2024:
Update 25/3/2024 - 31/3/2024:
Update 18/3/2024 - 20/3/2024:
Update 11/3/2024 - 17/3/2024:
Update 4/3/2024 - 10/3/2024:
Update 26/2/2024 - 3/3/2024:
Update 19/2/2024 - 25/2/2024:
Update 29/01/2024 - 04/02/2024:
Update week 22/1/2024 - 28/1/2024::
Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.
Update week 15/1/2024 - 21/1/2024::
Update week 8/1/2024 - 14/1/2024:
Update week 1/1/2024 - 7/1/2024:
Update week 19/12/2023 - 24/12/2023:
Update week 11/12/2023 - 17/12/2023:
Update week 27/11/2023 - 3/12/2023:
Update week 20/11/2023 - 26/11/2023:
Update week 13/11/2023 - 19/11/2023:
Update week 6/11/2023 - 12/11/2023:
Update week 30/10/2023 - 5/11/2023:
Update week 23/10/2023 - 29/10/2023:
Update week 16/10/2023 - 22/10/2023:
Update week 9/10/2023 - 15/10/2023:
Update week 2/10/2023 - 8/10/2023:
Update week 25/09/2023 - 1/10/2023:
Thanks @colewhite. I'm all set!
Update week 18/09/2023 - 24/09/2023:
I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?
Week 26/6/23 - 2/7/23 Update:
Week 19/6/23 - 25/6/23 Update:
Week 12/6/23 - 18/6/23 Update:
Week 5/6/23 - 11/6/23 Update:
Week 29/5/23 - 4/6/23 Update:
Week 22/5/23 - 28/5/23 Update:
Week 15/5/23 - 21/5/23 Update:
Week 8/5/23 - 14/5/23 Update:
Week 1/5/23 - 7/5/23 Update:
Week 24/4/23 - 30/4/23 Update:
Week 10/4/23 - 16/4/23 Update:
Week 3/4/23 - 9/4/23 Update:
Week 27/3/23 - 2/4/23 Update:
Week 20/3/23 - 26/3/23 Update:
Week 13/3/23 - 19/3/23 Update:
Week 6/3/23 - 12/3/23 Update:
Week 27/2/23 - 5/3/23 Update:
Week 20/2/23 - 26/2/23 Update:
Week 13/2/23 - 19/2/23 Update:
Week 6/2/23 - 12/2/23 Update:
Thank you, accessed!
Week 1/2/23 - 5/2/23 Update:
In T303831#8063021, @EBernhardson wrote:In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { wikidataTriples .filter(s"predicate='<$p31>'") .selectExpr("object as subgraph", "subject as item") .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:
In T303831#8058159, @EBernhardson wrote:the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).
subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.
Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.
The analysis is done here (for Q-ids): Wikidata_Item_ORES_Score_Analysis
In T288262#7629267, @Lydia_Pintscher wrote:@AKhatun_WMF: You mention on the wiki that some Items don't have an ORES score. All Items should have one 😬 Do you have an example of one that does not?
In T288262#7628599, @MPhamWMF wrote:@AKhatun_WMF , sorry, it's been a while since I wrote this, but I think what I meant when I wrote the question about "optimal separation" is given some distribution of ORES scores (e.g. a normal distribution), is it clear what the threshold is for what qualifies as a "high" vs "low" score: e.g. anything over .75 is a high score. But that's assuming the scores are continuous. I guess it's moot if they're binary (I don't actually know).
If this isn't a sensible way of thinking about the issue, let me know if there's a better way.
@MPhamWMF Hi, could you please clarify the question Is there an optimal separation between high/low ORES scores?. Separation in what respect? To my mind comes the separation of items with respect to the subgraph it is part of.
@ACraze Indeed! I was confusing the models for revision (item quality) with edits (damaging/good faith). The latest revision is all I will need. Thank you!
Counts of queries and triples for astronomical objects were done here: Wikidata_Subgraph_Query_Analysis, along with the top ~300 large subgraphs.
For the specific case of Astronomical objects (and only astronomical objects), a list of all its subclasses was obtained and manually inspected for relevance to astronomical objects. The subclass list also consists of subclasses of subclasses and so on.
Details can be found here: Wikidata_Subgraph_Query_Analysis