User Details
- User Since
- Jul 8 2019, 2:27 PM (297 w, 23 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Thadguidry [ Global Accounts ]
May 27 2024
One of our plans with DB2Rest is to provide a simple instant Recon API for database tables.
It will be a web app, like OpenRefine is, and allow a user to instantly create a Recon API from local files or existing database.
Jan 12 2024
Jun 14 2023
I wanted to give my support to adding AVIF format especially to allow upload AVIF images to Commons. In fact, that should be a primary use case more than any other.
AVIF is supported now in all the major operating systems and image software.
Agree with @Trougnouf that we should think of the other side of things, storage and convenience for users. HDR 8K + ranges are supported by AVIF.
Our wiki article https://en.wikipedia.org/wiki/AVIF is excellent to gain a sense of support now in 2023, and a few other highlights are mentioned on a Mozilla dev page https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types#avif_image
Aug 22 2022
@Spinster @Vojtech.dosta l would say that non-printable chars (hidden) should be non-reconcilable. The reasoning is that hidden characters are reserved for machine use rather than human use. Regardless, in OpenRefine we hope to finish this longstanding issue https://github.com/OpenRefine/OpenRefine/issues/1286 so that folks can have a visual sense of when data quality might suffer for reconciliation. BUT we also need a hidden characters facet https://github.com/OpenRefine/OpenRefine/issues/5207 added to OpenRefine which could be done in a few hours and submitted for a PR. Then as a best practice, the hidden characters facet would be a pre-step to any reconciliation workflow.
Apr 11 2022
@Dominicbm you bring up a good point that I also had concerns of... if the upload process is not synchronous/simultaneous with attaching structured data... then having orphaned/incomplete uploaded files. Even though we would want to be efficient with batch uploading files, I think the reality is that the "batch" would be considered "a series of individual file uploads with attached structured data". The whole batch could fail, or only 1 of the individual files where the batch would be considered partially upload completed.
Feb 23 2022
Ah, ok
Oh you likely mean only the Wikimedia language code https://www.wikidata.org/wiki/Property:P424 ?
@Spinster Wouldn't that be a bit too limiting, potentially for the future, if any new properties are minted? I'm not sure myself so just asking about the full context here. My thoughts are that maybe allowing options for further constraints in this dialog like properties that are subclassed under https://www.wikidata.org/wiki/Q18616084 or https://www.wikidata.org/wiki/Q20824104 ? Maybe there could be an extra toggle filter in this dialog that says "Only languages" and it filters to only those properties under those 2 subclasses?
Feb 6 2022
Hi @AndySeaborne What is the latest benchmarks for loading Wikidata all and truthy with Jena 4.4.0 release and the new TDB2 xloader with "--threads" argument? I noticed the release notes said this:
Nov 19 2021
@BenAtOlive I think for bikeshedding or hand-waving discussions, you can just start an new discussion thread in Oxigraph's GitHub Discussions (not Issues). Here: https://github.com/oxigraph/oxigraph/discussions
As someone who has "been there, done that" (even with Apache Geode)... I can tell you that data locality is very important when you want to maximize performance. But if the data is maintained as distributed, then the only way to squeeze out improved performance is if you can temporarily have that data locality and that sometimes means temporary or ad hoc data replication...which has a cost itself but isn't insurmountable.
Sep 15 2021
@Addshore That's what I figured. :-) This issue did feel old and sort of in a dustbin. Agree it should be closed.
Sep 10 2021
Aug 31 2021
@Tpt Looks great! The ROADMAP file was a suggested alternative to the Milestones, sorry didn't make that clear. I much prefer grouping or tagging issues against Milestones as you have done! You have the right idea regarding a single source of truth and exactly the best practices! Your a natural.
Aug 26 2021
Hi @Tpt Can you elaborate more in your Milestones and create more Milestone as necessary for your future vision? Like what you mean by "no storage format stability for now", and what that really means to users and what you are thinking about in the long term towards solving that?
Maybe a ROADMAP.md file in the repo might be good to add as a quick high-level overview, which then has links to Milestones (and perhaps make more future vision Milestone links, even if 2 years away, or just a dream but wrapped with practicality).
https://github.com/oxigraph/oxigraph/milestones
Aug 22 2021
I'd suggest adding replica shards (copies of primary shards) that help to both ensure redundancy to protect against failure, but they also vastly increase the capacity for read requests such as searching, like Adam's entity term lookup use case. You can change the number of replica shards at any time without affecting indexing or query operations. https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html
Aug 19 2021
+1 for Oxigraph. @Tpt has been putting in a ton of good effort, research, features, and stability. Sponsoring him now in GitHub as well for his effort.
As it's being developed in Rust, it automatically takes advantage of data streaming in places that utilizes intrinsic functions (forwarded through LLVM compiler IR) in CPU's. Java 17 is just now getting into a better position with it's new Vector API. On top of that, the RIO Parser is one of the fastest RDF parsers I've seen run on my system, which he also graciously maintains in Rust.
Aug 16 2021
We'll also want to improve the Help:Ranking page once this proposal task is implemented.
Agree generally on this proposals' assertions. It makes sense also from a data quality perspective, and since we are actively adding new tools to improve our data quality, then having a new "outdated" rank to represent a "once upon a time this was factual" would be very convenient and easier to arrive at community consensus. In fact, some blockchains form consensus in the exact same fashion, ex. Solana blockchain sorta does the same thing to gain speed and efficiency (otherwise getting consensus on details slows it down) ... it cares about the fact that something occurred or has changed or is outdated...but the details of when, where, how, can be deduced or ascertained later.
Aug 15 2021
Yes, that is the steps to reproduce. The general UX is that there's a context shift that the user didn't ask for yet and the users are not given any clue about it happening. It's a bad user experience and something that all the users in T10640 are experiencing and asking for the same thing. The flow of how to get to an Advanced Search for/from Manifests specifically is the core of the problem. Using the search field in the upper right corner is quite expected, and then users see the dropdown with the Advanced Search option...ultimately clicking it and leading down into the wrong hole.
To help the team, I've mocked up perhaps a better menu representation that would always display a Global Advanced Search underneath the context of Current Application - Advanced Search? (Diffusion, Manifest, etc. as Current Application context, but always also displaying the Global Advanced Search)
I see the documentation, but it doesn't match the interface. I do not see a Date field as mentioned in the documentation. What else am I missing?
Aug 10 2021
Hi @aidhog Aidan in my opinion I would say "NO, not a good test-case for this need". And the only reason is this... it's ASCII only (chars <128) and doesn't let us unsure proper load handling for all data in all languages, multilingual data (ASCII > 128) such as UTF-8, etc.
DBLP.xml is however a great test-case for any SAX parser as I can see in it's PDF https://dblp.uni-trier.de/xml/docu/dblpxml.pdf
Aug 2 2021
Someone needs to add a Documentation task to this.
I assume all the new options available and perhaps a reference link to this ticket would go somewhere in here? https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
Jul 30 2021
Jul 22 2021
Hmm, this is missing a detail of how your entity data sets or the community's is likely formatted (either from some other system or program, or manually created by database exports or software tools).
- What are the import formats that are likely to be wanted to import in bulk into Wikibase? Simple CSV Tables? JSON? RDF/XML? Or directly any of the formats that Rio https://github.com/oxigraph/rio currently provides (RDF-star is one of the newest it now supports)?
Jul 15 2021
This reaches beyond just GLAM and cultural heritage and impacts scientific organizations as well. Please add my support for this.
Jul 10 2021
I'd like to see this made a bit higher priority? It seems it would be fairly trivial to implement with a good impact. One that I have seen often repeated over and over by the Lexicographical community is this particular constraint and the explanation given over and over when folks hit the constraint but are left wondering what it really means. Here's an example where I've given "usage example" a constraint clarification text of that explanation we repeat so often to folks in Telegram chat.
https://www.wikidata.org/wiki/Property:P5831
May 19 2021
I thought there was already a standard around some "diff" format like DoubleCheck uses between Mediawiki revision table rev_ids? I recall using Wikiloop DoubleCheck which has an interesting interface to expose a portion of an edit for judgement and rollback.
It probably makes sense to pull someone from their team or others into this conversation as well to explore ideas on Merge conflict resolution displaying and what formats lend themselves well to that?
Mar 1 2021
Jan 9 2021
To Reproduce:
- Create a new Lexeme
- Lemma: type chevrette
- Language of Lemma: type cajun and look at dropdown listing
- Notice that Louisiana French Q3083213 is at the bottom of dropdown list instead of top of list.
Jan 8 2021
Oct 22 2020
In Freebase, we offered word, phrase, and full (exact match). I think the wbsearchentities API could offer something similar, although with a slight cost of indexing.
Besides name we also supported alias{full}. Using alias: matched both name and aliases, using name: matched only on name.
Oct 17 2020
In that case, then just on that wbsearchentities page I think having a single sentence about "userlang" and providing the link to the other page would help like hell for all those users that have asked in the last year alone. As an institution of holding the worlds knowledge in many languages then "userlang" is probably the most important parameter to let users know about especially given the frame of reference here when we are talking about THE Search API that so many eventually land upon using and building around it.
Oct 11 2020
Sep 22 2020
Sep 2 2020
@dcausse Dunno if this might help but could a simple window help or where you use KeyedProcessFunction on a KeyedStream? If the stream is unkeyed (or initially so), then the other thing might be just finding the patterns in the stream and CEP would help.
- the output of this is a simple event without any data saying: do a diff between rev X and Y, fully delete entity QXYZ, ...
Aug 22 2020
May 12 2020
Hi all - My personal opinion and those of a few other experts would be to embrace DRY (Don't Repeat Yourself - or others) and simply allow introduction of W3C standards for Tabular Data:
May 7 2020
If it helps or is needed, the query that you can use is here:
Apr 9 2020
@Lydia_Pintscher Oops! You forgot to include the main one also !!! .... Equivalent Property P1628 :-)
Feb 10 2020
Is there anything inherently wrong or technically infeasible or undesirable, if an id used 2 letters? ES45 versus E45 ?
Nov 12 2019
Thanks, updated ticket.
Nov 7 2019
@dcausse Yes, I mean running a full text search. "simple search" is a term used by Blazegraph sometimes. Fulltext searches are cheap when you index terms in multiple ways. Why would you not want to index terms in multiple ways? Freebase was able to leverage this quite easily with Lucene/Solr indexes and provided great results on its search box on each character typed. Are you hurting for RAM to store the cached inverted indexes or something else with the infra? My quick calculations on 1 simple index in memory for all the terms (not just label/alias) in Wikidata, currently stats say 78 billion x 10 bytes per term = 78 gigs. Does Wikidata not have hardware to support multiple indexes? 1TB RAM (16x 64GB)
Nov 6 2019
No problem @Tgr
What slows me down is not having an efficient quick way to filter properties that don't include " ID" (non-authority properties), in various Property explorer tools such as the excellent https://tools.wmflabs.org/prop-explorer/
Perhaps other Property explorer tools have a quick filter mechanism for this? I tried a simple regex in the Label column filter, but it doesn't work. https://github.com/stevenliuyi/wikidata-prop-explorer/issues/1
Oct 29 2019
TODO: Just wanted to highlight that once decisions are made... please ensure to update the Glossary item ! Currently it reads:
Jul 8 2019
@dbarratt in the Wikibase ontology I could not find those properties in the OWL document returned. Sorry, I'm getting caught up with your schema layouts as fast as I can :-) I expected my parser to retrieve information about their description, range, domain. I do see the class "Statement" however.
Something is amiss with these...not found.