Mon, Nov 11
Per-item data are mostly independent, so different items can be easily processable in parallel, however that would require splitting the incoming data per item (note that item data not necessarily have item URI as subject - there are statements, references, values, sitelinks, etc.)
Wed, Nov 6
Actually these WDQS servers are merely reading RCStream
Sat, Nov 2
Looks like css/js artifacts aren't deployed correctly.
Thu, Oct 31
Tue, Oct 15
Our fork is in https://github.com/wikimedia/wikidata-query-blazegraph
Oct 7 2019
Sep 30 2019
The issue is that by default Blazegraph uses tertiary ICU collation level IIRC (I can check specific one) so it ignores some differences like that one - generating same term key for both. It can be switched to Identical but that would generate much larger term keys which would hurt performance and increase storage size.
Sep 26 2019
@seav Please explain the case for millimeter-precision coordinates. Which objects in Wikidata have locations known with millimeter precision?
4 digits is 11m precision, 5 digits is 1m precisions. We could bump the max to 5 digits I presume, but I am not sure which coordinates really have that many significant digits and whether these coordinates indeed are precise within meter or just claim to be so. But changing it wouldn't be very hard - just change COORDINATE_PRECISION in GlobeCoordinateRdfBuilder from 4 to 5.
Sep 19 2019
Probably related to other issues about Unicode and to ICU collation level. I presume collation level enabled now at Blazegraph confuses these two.
Sep 9 2019
I don't think it's worth bothering with depooling, unless the number of affected items is very large, it should be quick enough so nobody should really notice.
Sep 7 2019
This may happen because value nodes are not updated when data is updated (since they are supposed to be immutable). So if some bad data sneaked in when the problem was there, the bad value (and possibly reference since they behave the same way) nodes are still there. The best way to do it would be:
Sep 4 2019
Both evaluating Virtuoso and other solutions (like JanusGraph) would require that. @Gehel should know the details.
Aug 29 2019
Looks like new updater actually handles it better, but we need to verify that.
Loading 1 hour 25 mins of updates from 201908010000 under both updaters shows no differences except ones that can be attributed to edits (since we always load the latest version on old changes). So this first test seems to be a success.
Procedure for comparing journals:
Aug 28 2019
Testing on wdqs-test shows new Updater is 2x faster than old one. Didn't verify validity yet but speed looks good :)
Mac OS 10.13.6 (High Sierra), Firefox 68.0.2
Aug 27 2019
After the patch is merged and deployed, categories DB needs to be re-loaded according to procedure here: https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Categories_reload_procedure
Looks like DELETE SPARQL clauses that the daily dump is generating are wrong... Weird I haven't noticed it.
Looks like there's some problem with deletion handling. E.g. https://en.wikipedia.org/wiki/Category:Delaware_elections,_2006 has been deleted and is listed in enwiki-20190826-daily.sparql.gz dump as deleted, but still present in the database. Strangely enough, the log shows the file was successfully processed - but somehow the results are not there. Will investigate further.
I've created T231390: MWAPI can only match one result per page for handling the multiple values in one result issue, so that we have clearly focused task.
RecentChanges has many flaws (for example, it is not a reliable stream as timestamps are not sequential and it can't be queried by RC ID - see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/302368) but it is the only way to get change stream for a wiki without setting up Kafka, etc. as I understand. So I imagine until we get containers with all that stuff working we're stuck with RC as the only option to get changes in public.
@Addshore 0.3.2 should be up already.
Aug 26 2019
I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.
Aug 25 2019
@Multichill eventually yes, but since they are not being used anywhere yet it's too early to document them. Once RDF export is properly set up to use these prefixes then we can document them officially.
Aug 23 2019
All should be updated now.
Aug 22 2019
Immediate RDF breakage fixed, now I'll have to update lexemes that were affected.
Ok I am getting multiple builds taking 50+ minutes again for Wikibase, e.g.: