Mon, Sep 9
I don't think it's worth bothering with depooling, unless the number of affected items is very large, it should be quick enough so nobody should really notice.
Sat, Sep 7
This may happen because value nodes are not updated when data is updated (since they are supposed to be immutable). So if some bad data sneaked in when the problem was there, the bad value (and possibly reference since they behave the same way) nodes are still there. The best way to do it would be:
Wed, Sep 4
Both evaluating Virtuoso and other solutions (like JanusGraph) would require that. @Gehel should know the details.
Thu, Aug 29
Looks like new updater actually handles it better, but we need to verify that.
Loading 1 hour 25 mins of updates from 201908010000 under both updaters shows no differences except ones that can be attributed to edits (since we always load the latest version on old changes). So this first test seems to be a success.
Procedure for comparing journals:
Wed, Aug 28
Testing on wdqs-test shows new Updater is 2x faster than old one. Didn't verify validity yet but speed looks good :)
Mac OS 10.13.6 (High Sierra), Firefox 68.0.2
Tue, Aug 27
After the patch is merged and deployed, categories DB needs to be re-loaded according to procedure here: https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Categories_reload_procedure
Looks like DELETE SPARQL clauses that the daily dump is generating are wrong... Weird I haven't noticed it.
Looks like there's some problem with deletion handling. E.g. https://en.wikipedia.org/wiki/Category:Delaware_elections,_2006 has been deleted and is listed in enwiki-20190826-daily.sparql.gz dump as deleted, but still present in the database. Strangely enough, the log shows the file was successfully processed - but somehow the results are not there. Will investigate further.
I've created T231390: MWAPI can only match one result per page for handling the multiple values in one result issue.
RecentChanges has many flaws (for example, it is not a reliable stream as timestamps are not sequential and it can't be queried by RC ID - see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/302368) but it is the only way to get change stream for a wiki without setting up Kafka, etc. as I understand. So I imagine until we get containers with all that stuff working we're stuck with RC as the only option to get changes in public.
@Addshore 0.3.2 should be up already.
Mon, Aug 26
I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.
Sun, Aug 25
@Multichill eventually yes, but since they are not being used anywhere yet it's too early to document them. Once RDF export is properly set up to use these prefixes then we can document them officially.
Fri, Aug 23
All should be updated now.
Thu, Aug 22
Immediate RDF breakage fixed, now I'll have to update lexemes that were affected.
Ok I am getting multiple builds taking 50+ minutes again for Wikibase, e.g.:
RDF generated is:
wdq6: 17:16:23.890 [update 0] WARN org.wikidata.query.rdf.tool.Updater - Contained error syncing. Giving up on L60296 org.wikidata.query.rdf.tool.exception.ContainedException: RDF parsing error for https://www.wikidata.org/wiki/Special:EntityData/L60296.ttl?flavor=dump&nocache=1566494183408 at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.collectStatementsFromUrl(WikibaseRepository.java:401) at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRdfForEntity(WikibaseRepository.java:457) at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRdfForEntity(WikibaseRepository.java:433) at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:362) at org.wikidata.query.rdf.tool.Updater.lambda$handleChanges$0(Updater.java:236) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.openrdf.rio.RDFParseException: Default namespace used but not defined [line 42] at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440) at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685) at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405) at org.openrdf.rio.helpers.RDFParserBase.getNamespace(RDFParserBase.java:342) at org.openrdf.rio.turtle.TurtleParser.parseQNameOrBoolean(TurtleParser.java:1032) at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:643) at org.openrdf.rio.turtle.TurtleParser.parseSubject(TurtleParser.java:474) at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:407) at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:259) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214) at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.collectStatementsFromUrl(WikibaseRepository.java:392) ... 8 common frames omitted
So that's why it's not updated. I'll check why this happens.
Wed, Aug 21
Tags should work, at least for now, I think, if I can filter by tag efficiently. There's not a lot of data edits so far, compared to overall Commons edit volume.
Probably not a lot. Search for English labels returns 188 results, unfortunately search for statements and every label doesn't seem to work (probably needs a reindex?) so I don't know how many but probably also not a lot. I'll check tomorrow if I can get more specific figures.