With the completion of T293632 and T293636, this task is complete.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 6 2022
With the completion of all subtasks, this task is complete.
The analysis was completed and documented here: Wikidata_Subgraph_Query_Analysis
Nov 15 2021
Nov 11 2021
Nov 9 2021
Some analysis was done here:
- Property usage across subgraphs: Predicates_across_subgraphs
- Top predicates also used in scholarly articles: Top_properties_used_in_other_subgraphs
The analysis was completed and documented here: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis
Nov 8 2021
Oct 19 2021
Basically Wikidata's Properties have a datatype.
Ah, datatype of properties.
I am not seeing that in the analysis you linked but maybe I am overlooking something.
The one I listed is for datatype of objects, so you didn't miss anything.
Thank you for clarifying! It should be fairly easy to find out as well :)
Oct 18 2021
@Lydia_Pintscher
Is this ticket asking for counts of various datatype used in WIkidata? Both URI and literals.
Does wikitech:User:AKhatun/Wikidata_Basic_Analysis#Object help?
Oct 4 2021
Interested in playing with autoencoders.
write a script that will randomly combine these audio files and sample the latent spaces of their combined embeddings to create new machine-generated audio files
Does this entail we train the autoencoder with the dataset we curated from commons and then have it generate a sample audio file from random numbers? Maybe I'm a bit confused about what 'randomly combining' audio files means here.
Sep 27 2021
Sep 24 2021
Astronomical objects are structured hierarchically and so not everything is direct instance of Q6999 (unlike scholarly articles).
Query analysis report for some vertical slices of Wikidata: Wikidata_Vertical_Analysis#Query_Analysis
Summary: Wikidata_Vertical_Analysis#TL;DR
Here is the analysis done on scholarly articles in Wikidata and WDQS queries related to them: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Scholarly_Articles_Subgraph_Analysis
Sep 17 2021
Aug 26 2021
Aug 10 2021
In T287225#7272792, @EBernhardson wrote:This is now deployed, the first hour of processing it applies to should be 2021-08-10T14:00Z
Aug 9 2021
Aug 6 2021
In T281854#7266495, @EgonWillighagen wrote:@AKhatun_WMF, when you write "authors connected to other subgraphs", do you mean subgraphs within Wikidata (so, excluding external identifiers), or also graphs from other resources part of, for example, the Linked Open Data Cloud?
Jul 26 2021
Joseph will suggest an optimization to this task when he is back. For now a simple .distinct() has been done on Spark dataframe to facilitate analysis on Wikidata dumps.
Jul 24 2021
In T281854#7062631, @Fnielsen wrote:Some of the statistics that is wanted are listed on Scholia, currently on the frontpage: https://scholia.toolforge.org/ (UPDATE: now here: https://scholia.toolforge.org/statistics)
"percentage, number of Wikidata entities that are scholarly article":
37.246.721 Scholarly articles, so 37/97 ~ 40% are scholarly articles.
Jul 23 2021
Jul 19 2021
@dcausse: Yes, just adding the prefix declaration in Jena parser is what we want to do.
@JAllemandou: Should I add the other prefixes as well?
Jul 16 2021
- For June, the average daily successful parsing rate was ~85%. Ranging from 75% to 90%. Note that this only includes queries with status 200 and 500.
- 11% of the distinct queries ran into errors related to prefixes. The number of distinct queries due to each prefix is shown below. By adding the first 4 prefixes (mwapi, geof, foaf, gas) into the query processors' prefix list the average daily successful parsing rate was ~95% (93% to 97%). A few prefixes were off slightly (data instead of wdata, ref instead of wdref. These account for very few queries, but I fixed them nevertheless.)
prefix_name | count |
mwapi | 7419357 |
geof | 54183 |
foaf | 17198 |
gas | 13753 |
wds | 2761 |
wdv | 216 |
fn | 62 |
dc | 50 |
mediawiki | 23 |
wdref | 22 |
wdata | 3 |
Total distinct queries: 67467327
- Other errors included:
- Variable used when already in-scope. This happened when the same variable was reused in a query. Testing such queries in WDQS returns results nicely. These form 2% of the errors in distinct queries.
- Another notable error is the WITH clause. Although it runs well in WDQS, parser doesn't accept it. These form 2.5% of the distinct queries.
Jul 13 2021
Jul 11 2021
Thanks!
Hi @akosiaris, I had to fresh install OS and lost my ssh keys. Is it possible to change it so I can regain access? Should I put on a new public key here?
Jun 23 2021
Some of the vertical analyses were done as a part of familiarizing with wikidata. See the findings in Wikidata_Vertical_Analysis. Will get back to this ticket when done with T282139.
Jun 22 2021
Jun 21 2021
Jun 4 2021
Jun 3 2021
Some of the suggested information to analyse or extract through this analysis are:
- Top items
- Top properties
- Top subject, object types
- Top property types
- Top wikidata vs other predicates
- Number of S, P, O that don't involve wikidata
- The aim is to find the size of the subgraph not concerning wikidata, i.e size of leaves. They are leaves because once they point to something outside of wikidata, they are not expanded within wikidata. Some things are not even exapandable like literals. If we have too many leaves, we may consider using property graphs (where leaves will be listed as properties of a node).
Jun 1 2021
Update 1 June 2021:
May 27 2021
May 25 2021
May 24 2021
Idea on how to store the SPARQL query as a list:
Let's make a list of generic custom class QueryElem[T]. QueryElem contains elemType: String and elem: T.
May 21 2021
May 20 2021
In T282130#7100051, @JAllemandou wrote:@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.