User Details
- User Since
- Aug 23 2016, 11:49 PM (504 w, 21 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Pfps [ Global Accounts ]
Wed, Apr 15
The accesses should be appropriately anonymized before being made available, of course.
Wed, Apr 8
Sun, Mar 29
If your name on the GSoC site is different from your ID in Phabricator you need to say what your GSoC ID is. Otherwise your interaction here will not be associated with your proposal and it will be rejected.
A reminder that you need to submit your proposals to Google for them to be considered, in addition to anything done in Phabricator.
Sat, Mar 28
This project is medium difficulty and large (350 hours) in scale.
This project is medium difficulty and large size.
Fri, Mar 27
Hi Meghana: If you email me at pfpschneider@gmail.com we can set up a meeting in the afternoon, US East Coast time.
Hi Arina: By this time you should have been interacting with us, the potential mentors, for quite some time and have done some of the microtasks. We will evaluate your proposal even so, but iteration at this late date is not likely.
Thu, Mar 26
That's good. The next determination is whether to do a complete comparison or an incomplete one. Then there is the issue of whether to include a third or fourth engine so that compliance can be estimated.
Wed, Mar 25
I don't think that the problem is solvable in general. You can't even rerun queries and always expect the same results because of updates.
Mar 23 2026
Hi Meghana: The deadline for proposals is next week so you are starting very late in the process. By now you should have tried out some of the suggestions in the initial comment. If you want to proceed we can set up a call as described in my earlier comment.
Mar 19 2026
@Olea That's interesting (to me). Feel free to contact me at pfpschneider@gmail.com
Mar 18 2026
I'm not looking for a daily or even weekly timeline in your proposal. What I am looking for is a breakdown of the overall task into a few pieces that can be tracked. I am also looking for one or more pieces that can be done by the midterm of the coding period. If your proposal is selected we will be using the familiarization period to further refine the work to be done.
As far as I am concerned, the minimal deliverable at the end of the project is a playable game that implements fixes to some kinds of constraint violations. It is in your interest to have a proposal that can support this minimal deliverable, and also has significant optional parts.
The guidance from Google in https://google.github.io/gsocguides/student/writing-a-proposal is to have deliverables and timelines in your proposals. This is a good idea, but it is possible to go too far in this area. What we are looking for is a sense that you can break down the overall project into several pieces, each probably including design, coding, and documentation. What is important is a plan to have something that can be evaluated at the midpoint. That doesn't have to be a full system, but there should be some coding involved.
This project has areas where you can decide how much or how little to do and still have a working result. That may make it different from other GSOC projects. It is a good idea to make pieces of your proposal optional, so that at the end you (and we) can claim victory even if everything in the proposal is not implemented.
In the end, a big part of GSOC is to get people interested in open-source projects. In my view it's a win for GSOC if you end up doing significant open-source work in the future, even if not all of your proposal ends up being implemented.
Mar 17 2026
Sorry all. Due to some other issues, I have not been adequately responsive. But there is light at the end of the tunnel. I'll get through all the backlog today.
Mar 11 2026
I can check triple counts on my benchmark machine when the current benchmark run finishes, which may take another day or so.
Also, I never used munged files with QLever. I don't know whether that would slow down or speed up QLever.
Mar 10 2026
The parsing part of the QLever ingestion process can be heavily parallelized if the input file or files are well-behaved. Because of this need for well-behaved input files a special flag must be set. The RDF dump files are well-behaved.
https://www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking/Virtuoso#Run_the_Virtuoso_server_and_load_the_Wikidata_files_into_the_server already notes the slowdown in ingestion rate with Virtuoso as the process proceeds, and postulates a cause.
Mar 9 2026
Mar 6 2026
The reason that opened this ticket is that the team has been poor at responding to comments in the Wikidata wiki. So I looked in https://www.mediawiki.org/wiki/Wikidata_Platform#How_to_contact_us and the only contact methods there say to open Phabricator tickets. So I did.
Mar 5 2026
So who can I ask to see the interaction with the Privacy and Security team?
One problem I have here is that I haven't seen any of the interaction with the privacy and security people so I don't know what their requirements for anonymization are.
My belief is that this is a result of slow processing of transitive closure operations in Blazegraph and is likely to not be a problem in optimized SPARQL engines.
I just learned that the initial work to anonymize the queries was supported by the Wikimedia Foundation, https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
I am disappointed in this abrupt ending, particularly after nearly four months.
Mar 4 2026
This is the difference between join and left join. Regular joins (butting two triple patterns together) are associative and commutative, so the join order doesn't matter. Left joins (OPTIONAL) are associative (I think) but not commutative, so the join order does matter. So the order of the OPTIONALs matters. For official information you need to dig deeply into https://www.w3.org/TR/sparql11-query/. To extract information from that document sometimes requires a good grounding in the theory of querying.
Mar 3 2026
@Harikrshnaa I'll take a look this week.
Mar 2 2026
For a more realistic example, try https://qlever.dev/wikidata/xAiP0y then https://w.wiki/J5kE then https://w.wiki/J5kT then https://w.wiki/J5kH
The SCHOLIA replacement also works better if the label variable already has a value.
SPARQL defines what the result of evaluation is, just like in a programming language. Implementations are free to do anything so long as the specified result is reached. The above SPARQL must produce "only" de labels if there are any, according to the definition of SPARQL.
Moreover, the SCHOLIA work was already reported to the team in https://phabricator.wikimedia.org/T414453
It would be worthwhile for you to have more knowledge of the community activities in this area. Your team knows that the SCHOLIA queries have been rewritten into QLever so your team should have either reached out to them to find out what they did or looked at the changes that they made. There you would have seen a better transformation which instead of several OPTIONALs on different variables followed by a BIND uses a sequence of OPTIONALs on the same variable. Further, in some SCHOLIA queries there is an initial BIND that eliminates the problems that happen if the variable is unbound.
Mar 1 2026
I strongly suggest reaching out to the community to find out the current best practices for replacing the label service.
It appears from https://gitlab.wikimedia.org/repos/wikidata-platform/wdqs-query-proxy/-/blob/main/src/test/java/org/wikimedia/wdqs/WikibaseLabelParserTest.java?ref_type=heads that the replacement being evaluated is much less than ideal.
It would be useful to also rewrite named subqueries, as other SPARQL systems may not handle them either. And also remove Blazegraph optimization hints, which will interfere with results in other systems.
Is there a specification of exactly what the label service does?
Feb 27 2026
Feb 25 2026
One more thing that would be useful, if possible, is whether the query was syntactically legal according to Blazegraph. I could get this information by running the query through Blazegraph, but if this information is in the log I could use that instead.
Feb 24 2026
I want to mock up a server to investigate its load, so all I need for that is the anonymized query and a relative (or absolute) timestamp. User agent categories could be useful to better estimate future loads. Anonymizing string literals will be a problem for me, but I understand if this has to be done.
Looking through the code there appears to be use of SPARQL (for at least distinct values). Is this the main place where Wikidata itself depends on the WDQS? How does the split of the WDQS factor into the use of SPARQL? Does this mean that the distinct values constraint can't work on scholarly articles (or, actually, at all if it misses values from the scholarly graph)?
So the feature includes complex constraints?
Is there a good description of how the constraint system is implemented, preferably including the role of third-party tools?
Given that KrBot is a third-party closed-source (?) tool, I would be happy having it replaced.
Feb 23 2026
What is the difference, if any between the "quality constraints feature" and Wikidata property constraints? Having some notion of what is being investigated would be useful if the community is to provide useful input into the process.
I'm confused as to just what this ticket is about.
I just noticed the paper https://arxiv.org/abs/2602.14594
Feb 18 2026
It turns out that fallback functionality has become more important with the introduction of the mul "language". If I want English labels I need to have mull as a fallback or I may miss many labels.
Feb 17 2026
Thanks.
Feb 15 2026
It is very frustrating to have this task languish without any way of contacting the team that appears to be blocking progress.
The label service is indeed pervasive. How much are the mwapi services used?
Jan 22 2026
Jan 21 2026
Jan 20 2026
@Hridyesh_Gupta Thank you for your interest. It might be a bit early to start on the microtasks, as the period where potential contributors interact with mentors isn't for a while. As well, the topic will be getting some updates over the next little while.
Jan 14 2026
How can this be escalated?
Jan 12 2026
OK, there is a newer newsletter. But that's not a newer version of the information in the November newsletter, as far as I can tell. The wording in the inactive banner contains: "Either the page is no longer relevant or consensus on its purpose has become unclear." I don't think that either of these are the case and those who see the wording are likely to be misled.
This project is being revived for the 2026 GSoC.
https://www.wikidata.org/wiki/Wikidata:Wikidata_Platform_team/Newsletter_November_2025 is marked as inactive, with rather strong warnings about not being relevant. It this really the case?
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update mentions this task so maybe posting this request here will be effective.
Jan 10 2026
Dec 19 2025
Having a non-split query service available to interested users is going to be useful during the period from the end of the legacy service to the time that the new WDQS is available. This alternative service probably doesn't need the same uptime characteristics as even the WDQS.
Dec 5 2025
It's well past 11/21. Has there been any progress?
Nov 10 2025
Oct 7 2025
The evaluation of QLever on doubled Wikidata has some decent data to report on a preliminary basis. See https://www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking/Phase_2_Preliminary_Report for the report.
Oct 2 2025
My benchmarking used a machine with 16 cores (Ryzen 9950X) and 192GB of main memory, but I only ran one query at a time. Having lots of main memory is useful for measuring throughput with multiple queries running in parallel.
If you want to play around with loading Wikidata into QLever a 16-core machine is very useful as it can considerably cut down loading time.
Oct 1 2025
I'm looking at the dumps from 20241028 and thereabouts (because those are the ones that I have benchmark data about and I'm doing some more benchmarking). Maybe some of the prefixes have changed since then and only data: is problematic.
Aug 29 2025
Feb 27 2025
I added links to the Phabricator pages of the mentors.
I added pointers to several Phabricator tasks related to the Distributed Game. These links can be used to find games implemented in the Distributed Game and other information that would be useful in the microtasks and throughout the project.
Feb 25 2025
Feb 21 2025
@Hanna_Bast Thanks for the detailed comments. I have updated the benchmarking code, which does output TSV files that are later analyzed to produce statistics. Many of the benchmarks are run in three variants - as-is, with only counts returned, and with DISTINCT added. The benchmarking code also records a bit of information about the output - counts for multiple results and a single value for single results. The latter provided the first indication that different engines produce different results for numeric and GeoSPARQL values.