Personal Accounts:
- Phab: tanny411
- Meta: Aisha Khatun
Check out my website/blog: http://tanny411.github.io/
Personal Accounts:
Check out my website/blog: http://tanny411.github.io/
Week 29/5/23 - 4/6/23 Update:
Week 22/5/23 - 28/5/23 Update:
Week 15/5/23 - 21/5/23 Update:
Week 8/5/23 - 14/5/23 Update:
Week 1/5/23 - 7/5/23 Update:
Week 24/4/23 - 30/4/23 Update:
Week 10/4/23 - 16/4/23 Update:
Week 3/4/23 - 9/4/23 Update:
Week 27/3/23 - 2/4/23 Update:
Week 20/3/23 - 26/3/23 Update:
Week 13/3/23 - 19/3/23 Update:
Week 6/3/23 - 12/3/23 Update:
Week 27/2/23 - 5/3/23 Update:
Week 20/2/23 - 26/2/23 Update:
Week 13/2/23 - 19/2/23 Update:
Week 6/2/23 - 12/2/23 Update:
Thank you, accessed!
Week 1/2/23 - 5/2/23 Update:
In T303831#8063021, @EBernhardson wrote:In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { wikidataTriples .filter(s"predicate='<$p31>'") .selectExpr("object as subgraph", "subject as item") .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:
In T303831#8058159, @EBernhardson wrote:the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).
subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.
Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.
The analysis is done here (for Q-ids): Wikidata_Item_ORES_Score_Analysis
In T288262#7629267, @Lydia_Pintscher wrote:@AKhatun_WMF: You mention on the wiki that some Items don't have an ORES score. All Items should have one 😬 Do you have an example of one that does not?
In T288262#7628599, @MPhamWMF wrote:@AKhatun_WMF , sorry, it's been a while since I wrote this, but I think what I meant when I wrote the question about "optimal separation" is given some distribution of ORES scores (e.g. a normal distribution), is it clear what the threshold is for what qualifies as a "high" vs "low" score: e.g. anything over .75 is a high score. But that's assuming the scores are continuous. I guess it's moot if they're binary (I don't actually know).
If this isn't a sensible way of thinking about the issue, let me know if there's a better way.
@MPhamWMF Hi, could you please clarify the question Is there an optimal separation between high/low ORES scores?. Separation in what respect? To my mind comes the separation of items with respect to the subgraph it is part of.
@ACraze Indeed! I was confusing the models for revision (item quality) with edits (damaging/good faith). The latest revision is all I will need. Thank you!
Counts of queries and triples for astronomical objects were done here: Wikidata_Subgraph_Query_Analysis, along with the top ~300 large subgraphs.
For the specific case of Astronomical objects (and only astronomical objects), a list of all its subclasses was obtained and manually inspected for relevance to astronomical objects. The subclass list also consists of subclasses of subclasses and so on.
Details can be found here: Wikidata_Subgraph_Query_Analysis
With the completion of all subtasks, this task is complete.
The analysis was completed and documented here: Wikidata_Subgraph_Query_Analysis
Some analysis was done here:
The analysis was completed and documented here: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis
Basically Wikidata's Properties have a datatype.
Ah, datatype of properties.
I am not seeing that in the analysis you linked but maybe I am overlooking something.
The one I listed is for datatype of objects, so you didn't miss anything.
Thank you for clarifying! It should be fairly easy to find out as well :)
@Lydia_Pintscher
Is this ticket asking for counts of various datatype used in WIkidata? Both URI and literals.
Does wikitech:User:AKhatun/Wikidata_Basic_Analysis#Object help?
Interested in playing with autoencoders.
write a script that will randomly combine these audio files and sample the latent spaces of their combined embeddings to create new machine-generated audio files
Does this entail we train the autoencoder with the dataset we curated from commons and then have it generate a sample audio file from random numbers? Maybe I'm a bit confused about what 'randomly combining' audio files means here.
Astronomical objects are structured hierarchically and so not everything is direct instance of Q6999 (unlike scholarly articles).
Query analysis report for some vertical slices of Wikidata: Wikidata_Vertical_Analysis#Query_Analysis
Summary: Wikidata_Vertical_Analysis#TL;DR
Here is the analysis done on scholarly articles in Wikidata and WDQS queries related to them: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Scholarly_Articles_Subgraph_Analysis