User Details
- User Since
- Jun 9 2015, 9:03 AM (433 w, 5 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Thu, Sep 28
@Hannah_Bast thanks for making such a change! I did a quick test locally and everything seems to work fine now, re-opening this accordingly.
Wed, Sep 27
Tue, Sep 26
Mon, Sep 25
Fri, Sep 22
I think we should test what elastic is doing with one of its datapath broken, the ideal failure scenario for option 1 is that elastic continues to work properly but simple declares that it lost all the nodes from this data-path and that hot-swapping the disk will resume operation properly. It is very possible that the shard is detected for failure only when a write or read happens to it, reading https://github.com/elastic/elasticsearch/issues/71205 I understand that MDP users are mostly complaining about recovery time and that MDP allows them to recover less data. The "Elasticsearch does not balance shards across a node’s data paths" is wha
Thu, Sep 21
Wed, Sep 20
Tue, Sep 19
Mon, Sep 18
Fri, Sep 15
Thu, Sep 14
Wed, Sep 13
@bking thanks! I can confirm that the job is running fine, the dashboards show some activity the test stream wdqs_streaming_updater_test_T289836 is seeing all the mutations.
Regarding H/A with zookeeper I believe it's properly using as I don't see the usual k8s configmaps being created when the KUBERNETES H/A mode was used. I believe we should be good to do some testing of the various flink operations.
Tue, Sep 12
Mon, Sep 11
apifeatureusage are logstash hosts so it might be better to ask the o11y team for advises here, regarding search-loaders hosts @EBernhardson might know best but since they should run asynchronous processes and are stateless so I suspect that the upgrade should be straightforward and not too risky.
Fri, Sep 8
@Hannah_Bast sorry about this I mixed this ticket with another one, supporting https://qlever.cs.uni-freiburg.de/api/dblp would require changing the Accept the header that blazegraph sends during federation requests and it does not appear to be something that can be done without patching blazegraph (which is something we'd like to avoid unless really necessary). It's the first time we seem to encounter an endpoint that refuses to produce application/sparql-results+xml and we added almost 80 of them so far, it sounds to me like it would be nice to implement this on your side.
Note that even if we changed blazegraph to accept multiple formats for all endpoints by setting the header that you suggest (Accept: application/sparql-results+xml, application/sparql-results+json) the https://data.nlg.gr/sparql endpoint still produces an http 500 error.
@Hannah_Bast Blazegraph does properly send the header Accept: application/sparql-results+xml but it seems that this endpoint does only work when requesting application/sparql-results+json, anything else produces an http 500 error:
Thu, Sep 7
For context schema:url was added to help to join P18 statements from wikidata with commons media items (T277665) which was almost impossible to do without it.
In order to help prioritize this request and justify the extra space usage could you elaborate on what use-cases are not possible in the absence of the triple mentioned in the ticket description?
From a query service (owned by the Search Platform Team) perspective because we are still addressing scalability issues we are generally very conservative when it comes to adding more triples to the RDF dumps.
Process wise the MediaInfo RDF schema is owned by the Structured Data team and is subject to the Stable Interface Policy.
Wed, Sep 6
Going to work on improving the tooling regarding reconciliations of missed deletes but I won't be working on the root cause. I agree with @Milimetric here and we need to get a better sense of the qualitity the EventBus/EventGate system, 5 (*identified*) missed events on a stream that is relatively low volume over 1 month seems concerning.
The search indices have been updated.
Checked the search queries mentioned in the task description and they seem to return the expected number of results, moving to our board's "Needs Reporting" column but please consider the parent task unblocked.
Tue, Sep 5
Mon, Sep 4
Searching for "Moritz M. Daffinger" does not yield the expected page in english and german wikipedias because the term "M" is not part of page text of en:Moritz_Michael_Daffinger nor de:Moritz_Daffinger.
CirrusSearch is configured in such a way that it requires all the query terms to appear in the page for it to be found.
For french the reason it is found is that the term "m" alone is considered a stop word and thus not required.
@mfossati could you update the recovery plan to reflect what is currently happening, as far as I understand the let 2023-08-21's production run compute the proper delta step did not run as expected since you are requesting to import the full dataset in T345545, could you also elaborate on what when wrong that could explain why the delta is not correct?
Aug 11 2023
Aug 10 2023
Aug 9 2023
At a glance I suspect that now you might get duplicated QIDs in
sa_and_sasc_ids = ( df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids")) .where(col("predicate") == P31_DIRECT_URL) .where(col("object").isin(sa_and_sasc_qids)) .alias("sa_and_sasc_ids") )
Which could be explained by entities being tagged with multiple entries found in sa_and_sasc_qids.
What happens if you apply a distinct here:
sa_and_sasc_ids = ( df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids")) .where(col("predicate") == P31_DIRECT_URL) .where(col("object").isin(sa_and_sasc_qids)) .disctinct() .alias("sa_and_sasc_ids") )
Aug 8 2023
Aug 7 2023
item | deletion date |
Q10813441 | 2023-06-06T14:21:49 |
Q32994683 | 2023-05-03T16:04:27 |
Q55929561 | 2023-05-31T20:24:58 |
Q109548562 | 2023-06-06T14:22:29 |
Q111436860 | 2023-05-04T06:54:56 |
Aug 4 2023
I suspect that because the claims field being an array of complex types it can potentially be huge and asking to generate its string representation using f"{claims}" might cause excessive mem usage and is I believe a very slow operation.
I would look into ways to avoid having to serialize it as a string and iterate over the object representation (I suspect a Row?) to do your filtering or possibly asking hive to do a lateral view with the mainSnak (which is I believe what you're looking for?):
select id, claims_ex.mainSnak.property, claims_ex.mainSnak.dataValue.value from wmf.wikidata_entity lateral view explode(claims) claims_explode as claims_ex where snapshot = '2023-07-24' AND claims_ex.mainSnak.property = 'P31' limit 10;
OK id property value Q38488724 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q37619467 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q38738598 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q37797268 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q38708632 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q37781259 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q39051969 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q37373175 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"} Q38327391 P31 {"entity-type":"item","numeric-id":5,"id":"Q5"} Q37598817 P31 {"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Then adding yet another filter on claims_ex.mainSnak.dataValue.value = '{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}' should work.
Aug 3 2023
@AndrewTavis_WMDE sure! I'll send you an invite for next monday, in the meantime could you share your notebook somewhere so that I can take a look before the call?
@AndrewTavis_WMDE thanks! this is really exciting, we couldn't hope for better results... it's almost a 50-50 split. And on top of that, and if I read your results correctly we only have 197,583 common triples (7,521,423,558 - 7,521,225,975) that will have to be duplicated in both subgraphs.
early dashboard for inconsistencies: https://superset.wikimedia.org/superset/dashboard/451/?native_filters_key=FNd9a7DWn2Q11F_72h3t5CO2dt2DwwhiIwyUdKIvokscWf8NOjlD4Vf9ltjfBOZn
Aug 2 2023
- <http://www.wikidata.org/entity/> prefix generally wd refers to the concept URI of the entity, this is generally how an entity (whether it's a property, item or lexeme) is identified, e.g. Q42 is identified as wd:Q42 -> <http://www.wikidata.org/entity/Q42>, this is the form that is used to link items to statements and its other constituents (can be seen as a subject or an object)
- <http://www.wikidata.org/entity/statement/> prefix s in the dumps and Special:EntityData and wds in WDQS, these are used to identify a wikibase statement, e.g. wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F is the identity of the date of birth statement for Q42 (can be seen as a subject or an object)
- <http://www.wikidata.org/prop/statement/> prefix ps is what actually link the statement ID as defined above to its actual simple value form, so the actual date of birth of Q42 is stored in the triple wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F ps:P569 ?dateOfBirth (can be seen only as a predicate)
@LSobanski thanks for the ping, just cleaned up some files in my home dir
@BPirkle thanks for the investigations! Your conclusions do seem correct to me, I'm not being very knowledgeable in the MW API subsystems but I'm with you and Gergo here and I believe that since the REST api is "relatively new" it might be safer to apply the same normalization steps for all string params that the action API does. In fact in this case I believe that we somehow got "lucky" that CirrusSearch was very picky (not very robust to be more correct :) by failing on such utf8 sequences but I imagine that in other scenarios/endpoint the internal MW API could simply silently accept such strings and possibly update the database with them and pollute our datastores with these "bad" utf chars and sequences.
Jul 31 2023
The table wikibase_rdf does have one row per triple and an additional column named context that we use to annotate the entity the triple was extracted from while reading the dump.
With the caveats of shared values and references that have respectively <http://wikiba.se/ontology#Value> and <http://wikiba.se/ontology#Reference> set as their context column.
It is true that duplicates are in there and a select count(*) from wikibase_rdf will give a number greater than the number of triples stored in blazegraph.
Jul 28 2023
I believe that at first we are interested in knowing the number of triples that would be moved out if all item that verifies the condition: ?s wdt:P31 Q13442814 are moved out with all the triples belonging to these items.
The triples that belongs to an entity (e.g. Q1895685) are the ones visible via https://www.wikidata.org/wiki/Special:EntityData/Q1895685.ttl?flavor=dump with the additional complexity of shared values and references that have to be treated separately because they might be shared by other entities.
In here you'll notice that for instance the triple s:Q1895685-9a482323-4d57-acf2-b6b7-bc36d578bd57 ps:P478 "171" does not reference the QID of the paper but this triple must be counted as well.
This makes knowing the triples that belongs to ?s a bit tricky but we could leverage the structure of the wikibase_rdf table for this:
- first count the number of triples that are not shared with other entities using the context column, this column (not a data available in WDQS can help to group the triples by entity)
- count the number of triples attached to shared references and values, here we should also count the ones that are shared between S and not S because these will have to be duplicated in both graph
should be resolved
Jul 27 2023
There was a stale /srv/query_service/aliases.map file with some content in it (that I copied to /root/aliases.map.T342762) which I believe was confusing nginx causing it to replace the "wcq" part of the URI with wcqs20220915 which does not exist. Emptying this file seems to have fixed the readiness-probe.
Jul 26 2023
Initially caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/940254 where I started to inject the ExtensionRegistry required to load some search profiles that the GrowthExperiment extension declares, sadly I forgot to pull the profiles for the "rescore_function_chains" component and this is what caused the regression, the attached patch should solve the problem, sorry about that.
looking into it
moving back to needs triage as this starts to be blocking other teams