Page MenuHomePhabricator

dcausse (David Causse)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jun 9 2015, 9:03 AM (433 w, 5 d)
Availability
Available
IRC Nick
dcausse
LDAP User
DCausse
MediaWiki User
DCausse (WMF) [ Global Accounts ]

Recent Activity

Thu, Sep 28

dcausse moved T346456: Improve concurrency limits configuration of the wdqs updater from In Progress to To Be Deployed on the Discovery-Search (Current work) board.
Thu, Sep 28, 6:27 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
dcausse moved T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Thu, Sep 28, 6:27 PM · Patch-For-Review, Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse reopened T339347: qlever dblp endpoint for wikidata federated query nomination as "Open".

@Hannah_Bast thanks for making such a change! I did a quick test locally and everything seems to work fine now, re-opening this accordingly.

Thu, Sep 28, 7:52 AM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dcausse moved T339347: qlever dblp endpoint for wikidata federated query nomination from Done to Incoming on the Data-Platform-SRE board.
Thu, Sep 28, 7:52 AM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Wed, Sep 27

dcausse created T347515: The WDQS streaming updater should have a way to disable or tag side output events.
Wed, Sep 27, 5:25 PM · Wikidata, Wikidata-Query-Service

Tue, Sep 26

dcausse moved T347333: Tune process_sparql_query_hourly so that it does not get killed by yarn from In Progress to Needs review on the Discovery-Search (Current work) board.
Tue, Sep 26, 4:52 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse claimed T347333: Tune process_sparql_query_hourly so that it does not get killed by yarn.
Tue, Sep 26, 4:30 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse added a project to T347333: Tune process_sparql_query_hourly so that it does not get killed by yarn: Discovery-Search (Current work).
Tue, Sep 26, 4:29 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Mon, Sep 25

dcausse created T347333: Tune process_sparql_query_hourly so that it does not get killed by yarn.
Mon, Sep 25, 6:32 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse edited projects for T347284: Restore service for https://query.wikidata.org/bigdata/ldf, added: Wikidata-Query-Service; removed Wikidata Query UI.
Mon, Sep 25, 12:16 PM · Data-Platform-SRE, Wikidata, Wikidata-Query-Service
dcausse created T347284: Restore service for https://query.wikidata.org/bigdata/ldf.
Mon, Sep 25, 12:15 PM · Data-Platform-SRE, Wikidata, Wikidata-Query-Service
gmodena awarded T346015: [Search Update Pipeline] Consider dropping support for java8 a Like token.
Mon, Sep 25, 8:32 AM · Discovery-Search (Current work), CirrusSearch

Fri, Sep 22

dcausse added a comment to T231010: Change partitioning scheme for elasticsearch from RAID to JBOD.

I think we should test what elastic is doing with one of its datapath broken, the ideal failure scenario for option 1 is that elastic continues to work properly but simple declares that it lost all the nodes from this data-path and that hot-swapping the disk will resume operation properly. It is very possible that the shard is detected for failure only when a write or read happens to it, reading https://github.com/elastic/elasticsearch/issues/71205 I understand that MDP users are mostly complaining about recovery time and that MDP allows them to recover less data. The "Elasticsearch does not balance shards across a node’s data paths" is wha

Fri, Sep 22, 7:25 AM · Discovery-Search (Current work), Data-Platform-SRE, Elasticsearch

Thu, Sep 21

dcausse updated the task description for T347034: RESTBase /v1/related endpoint should call the MW action API with a GET not a POST.
Thu, Sep 21, 12:37 PM · Maintenance-Worktype, Wikifeeds, Sustainability (Incident Followup), Discovery-Search
dcausse updated the task description for T347034: RESTBase /v1/related endpoint should call the MW action API with a GET not a POST.
Thu, Sep 21, 12:31 PM · Maintenance-Worktype, Wikifeeds, Sustainability (Incident Followup), Discovery-Search
dcausse created T347034: RESTBase /v1/related endpoint should call the MW action API with a GET not a POST.
Thu, Sep 21, 12:29 PM · Maintenance-Worktype, Wikifeeds, Sustainability (Incident Followup), Discovery-Search

Wed, Sep 20

dcausse claimed T346456: Improve concurrency limits configuration of the wdqs updater.
Wed, Sep 20, 1:39 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service

Tue, Sep 19

dcausse moved T345634: [Search Update Pipeline] Add a way to filter input events per wiki from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
Tue, Sep 19, 2:57 PM · Discovery-Search (Current work), CirrusSearch
dcausse moved T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink from In Progress to Needs review on the Discovery-Search (Current work) board.
Tue, Sep 19, 2:56 PM · Patch-For-Review, Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse moved T346015: [Search Update Pipeline] Consider dropping support for java8 from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Tue, Sep 19, 7:41 AM · Discovery-Search (Current work), CirrusSearch
dcausse created T346719: [Search Update Pipeline] Upgrade to flink 1.17.1.
Tue, Sep 19, 7:27 AM · Discovery-Search (Current work), CirrusSearch
dcausse added a subtask for T317045: [Epic] Re-architect the Search Update Pipeline: T346718: [Search Update Pipeline] Set max parallelism explicitely on operators with a state.
Tue, Sep 19, 7:22 AM · Discovery-Search (Current work), Epic
dcausse added a parent task for T346718: [Search Update Pipeline] Set max parallelism explicitely on operators with a state: T317045: [Epic] Re-architect the Search Update Pipeline.
Tue, Sep 19, 7:22 AM · Discovery-Search
dcausse added a project to T346718: [Search Update Pipeline] Set max parallelism explicitely on operators with a state: Discovery-Search.
Tue, Sep 19, 7:22 AM · Discovery-Search
dcausse created T346718: [Search Update Pipeline] Set max parallelism explicitely on operators with a state.
Tue, Sep 19, 7:21 AM · Discovery-Search
dcausse added a subtask for T317045: [Epic] Re-architect the Search Update Pipeline: T346717: [Search Update Pipeline] Name and identify operators that have a state.
Tue, Sep 19, 7:18 AM · Discovery-Search (Current work), Epic
dcausse added a parent task for T346717: [Search Update Pipeline] Name and identify operators that have a state: T317045: [Epic] Re-architect the Search Update Pipeline.
Tue, Sep 19, 7:18 AM · Discovery-Search
dcausse updated subscribers of T346717: [Search Update Pipeline] Name and identify operators that have a state.
Tue, Sep 19, 7:16 AM · Discovery-Search
dcausse created T346717: [Search Update Pipeline] Name and identify operators that have a state.
Tue, Sep 19, 7:15 AM · Discovery-Search

Mon, Sep 18

dcausse moved T346456: Improve concurrency limits configuration of the wdqs updater from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.
Mon, Sep 18, 3:49 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
dcausse moved T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink from Incoming to In Progress on the Discovery-Search (Current work) board.
Mon, Sep 18, 3:14 PM · Patch-For-Review, Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse moved T344284: Rename usages of whitelist to allowlist in query service rdf repo from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Sep 18, 3:08 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service, Data-Platform-SRE

Fri, Sep 15

dcausse updated the task description for T346456: Improve concurrency limits configuration of the wdqs updater.
Fri, Sep 15, 5:03 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
dcausse created T346456: Improve concurrency limits configuration of the wdqs updater.
Fri, Sep 15, 5:01 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
dcausse awarded T345195: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? a Love token.
Fri, Sep 15, 12:17 PM · Data Engineering and Event Platform Team (Sprint 3), Data-Engineering, Event-Platform

Thu, Sep 14

dcausse claimed T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink.
Thu, Sep 14, 3:03 PM · Patch-For-Review, Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse added a project to T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink: Discovery-Search (Current work).
Thu, Sep 14, 2:15 PM · Patch-For-Review, Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse updated the task description for T346315: Improve the flink-app chart to provide more useful defaults.
Thu, Sep 14, 9:48 AM · Patch-For-Review, Discovery-Search (Current work), serviceops, Event-Platform, Data-Engineering
dcausse created T346315: Improve the flink-app chart to provide more useful defaults.
Thu, Sep 14, 9:43 AM · Patch-For-Review, Discovery-Search (Current work), serviceops, Event-Platform, Data-Engineering
dcausse awarded T338189: WANCache stats missing for 'CirrusSearchParserOutputPageProperties' a Love token.
Thu, Sep 14, 7:56 AM · MediaWiki-Platform-Team, MediaWiki-libs-ObjectCache

Wed, Sep 13

dcausse added a comment to T345957: Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process.

@bking thanks! I can confirm that the job is running fine, the dashboards show some activity the test stream wdqs_streaming_updater_test_T289836 is seeing all the mutations.
Regarding H/A with zookeeper I believe it's properly using as I don't see the usual k8s configmaps being created when the KUBERNETES H/A mode was used. I believe we should be good to do some testing of the various flink operations.

Wed, Sep 13, 9:53 AM · Data-Platform-SRE, Discovery-Search (Current work)

Tue, Sep 12

dcausse claimed T346015: [Search Update Pipeline] Consider dropping support for java8.
Tue, Sep 12, 2:51 PM · Discovery-Search (Current work), CirrusSearch
dcausse reassigned T345634: [Search Update Pipeline] Add a way to filter input events per wiki from dcausse to EBernhardson.
Tue, Sep 12, 2:50 PM · Discovery-Search (Current work), CirrusSearch
dcausse awarded T345956: Archive the search/MjoLniR repository (moved to gitlab) a Love token.
Tue, Sep 12, 1:47 PM · Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup
dcausse claimed T345634: [Search Update Pipeline] Add a way to filter input events per wiki.
Tue, Sep 12, 10:11 AM · Discovery-Search (Current work), CirrusSearch

Mon, Sep 11

dcausse moved T344284: Rename usages of whitelist to allowlist in query service rdf repo from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.
Mon, Sep 11, 3:14 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service, Data-Platform-SRE
dcausse added a comment to T346039: Migrate search-loader hosts to Bullseye or later.

apifeatureusage are logstash hosts so it might be better to ask the o11y team for advises here, regarding search-loaders hosts @EBernhardson might know best but since they should run asynchronous processes and are stateless so I suspect that the upgrade should be straightforward and not too risky.

Mon, Sep 11, 2:12 PM · Data-Platform-SRE
dcausse claimed T344284: Rename usages of whitelist to allowlist in query service rdf repo.
Mon, Sep 11, 12:43 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service, Data-Platform-SRE
dcausse moved T342593: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service from In Progress to Needs review on the Discovery-Search (Current work) board.
Mon, Sep 11, 11:58 AM · Event-Platform, Data-Engineering, Data Engineering and Event Platform Team, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse added a subtask for T317045: [Epic] Re-architect the Search Update Pipeline: T346015: [Search Update Pipeline] Consider dropping support for java8.
Mon, Sep 11, 8:12 AM · Discovery-Search (Current work), Epic
dcausse added a parent task for T346015: [Search Update Pipeline] Consider dropping support for java8: T317045: [Epic] Re-architect the Search Update Pipeline.
Mon, Sep 11, 8:12 AM · Discovery-Search (Current work), CirrusSearch
dcausse created T346015: [Search Update Pipeline] Consider dropping support for java8.
Mon, Sep 11, 8:11 AM · Discovery-Search (Current work), CirrusSearch

Fri, Sep 8

dcausse added a comment to T339347: qlever dblp endpoint for wikidata federated query nomination.

@Hannah_Bast sorry about this I mixed this ticket with another one, supporting https://qlever.cs.uni-freiburg.de/api/dblp would require changing the Accept the header that blazegraph sends during federation requests and it does not appear to be something that can be done without patching blazegraph (which is something we'd like to avoid unless really necessary). It's the first time we seem to encounter an endpoint that refuses to produce application/sparql-results+xml and we added almost 80 of them so far, it sounds to me like it would be nice to implement this on your side.

Fri, Sep 8, 2:27 PM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dcausse added a comment to T339347: qlever dblp endpoint for wikidata federated query nomination.

Note that even if we changed blazegraph to accept multiple formats for all endpoints by setting the header that you suggest (Accept: application/sparql-results+xml, application/sparql-results+json) the https://data.nlg.gr/sparql endpoint still produces an http 500 error.

Fri, Sep 8, 9:34 AM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dcausse added a comment to T339347: qlever dblp endpoint for wikidata federated query nomination.

@Hannah_Bast Blazegraph does properly send the header Accept: application/sparql-results+xml but it seems that this endpoint does only work when requesting application/sparql-results+json, anything else produces an http 500 error:

Fri, Sep 8, 9:26 AM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Thu, Sep 7

dcausse moved T344876: Wikibase MediaInfo should provide access to page name via query service from Incoming to RDF Model on the Wikidata-Query-Service board.

For context schema:url was added to help to join P18 statements from wikidata with commons media items (T277665) which was almost impossible to do without it.
In order to help prioritize this request and justify the extra space usage could you elaborate on what use-cases are not possible in the absence of the triple mentioned in the ticket description?
From a query service (owned by the Search Platform Team) perspective because we are still addressing scalability issues we are generally very conservative when it comes to adding more triples to the RDF dumps.
Process wise the MediaInfo RDF schema is owned by the Structured Data team and is subject to the Stable Interface Policy.

Thu, Sep 7, 7:42 AM · Wikidata, Structured-Data-Backlog, Wikidata-Query-Service, WikibaseMediaInfo

Wed, Sep 6

dcausse claimed T342593: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service.

Going to work on improving the tooling regarding reconciliations of missed deletes but I won't be working on the root cause. I agree with @Milimetric here and we need to get a better sense of the qualitity the EventBus/EventGate system, 5 (*identified*) missed events on a stream that is relatively low volume over 1 month seems concerning.

Wed, Sep 6, 1:55 PM · Event-Platform, Data-Engineering, Data Engineering and Event Platform Team, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
dcausse added a comment to T345188: Add Image: all wikis ran out of image recommendations.

The search indices have been updated.

Wed, Sep 6, 6:58 AM · Growth-Team (Current Sprint), GrowthExperiments-NewcomerTasks, Image-Suggestions
dcausse moved T345545: Search indices image suggestion tags differ from the dataset used to update from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Checked the search queries mentioned in the task description and they seem to return the expected number of results, moving to our board's "Needs Reporting" column but please consider the parent task unblocked.

Wed, Sep 6, 6:56 AM · Discovery-Search (Current work), Section-Level-Image-Suggestions, Image-Suggestions, CirrusSearch

Tue, Sep 5

dcausse added a subtask for T317045: [Epic] Re-architect the Search Update Pipeline: T345638: [Search Update Pipeline] Add a way to configure a default http route.
Tue, Sep 5, 2:58 PM · Discovery-Search (Current work), Epic
dcausse added a parent task for T345638: [Search Update Pipeline] Add a way to configure a default http route: T317045: [Epic] Re-architect the Search Update Pipeline.
Tue, Sep 5, 2:58 PM · Discovery-Search, CirrusSearch
dcausse created T345638: [Search Update Pipeline] Add a way to configure a default http route.
Tue, Sep 5, 2:55 PM · Discovery-Search, CirrusSearch
dcausse added a subtask for T317045: [Epic] Re-architect the Search Update Pipeline: T345634: [Search Update Pipeline] Add a way to filter input events per wiki.
Tue, Sep 5, 2:24 PM · Discovery-Search (Current work), Epic
dcausse added a parent task for T345634: [Search Update Pipeline] Add a way to filter input events per wiki: T317045: [Epic] Re-architect the Search Update Pipeline.
Tue, Sep 5, 2:24 PM · Discovery-Search (Current work), CirrusSearch
dcausse created T345634: [Search Update Pipeline] Add a way to filter input events per wiki.
Tue, Sep 5, 2:24 PM · Discovery-Search (Current work), CirrusSearch

Mon, Sep 4

dcausse merged T345510: Inconsistent search results for same term across de/en/fr Wikipedias into T343148: Relax 'AND' operator in search queries.
Mon, Sep 4, 5:01 PM · Discovery-Search, CirrusSearch
dcausse merged task T345510: Inconsistent search results for same term across de/en/fr Wikipedias into T343148: Relax 'AND' operator in search queries.
Mon, Sep 4, 5:01 PM · Discovery-Search (Current work), CirrusSearch
dcausse added a comment to T345510: Inconsistent search results for same term across de/en/fr Wikipedias.

Searching for "Moritz M. Daffinger" does not yield the expected page in english and german wikipedias because the term "M" is not part of page text of en:Moritz_Michael_Daffinger nor de:Moritz_Daffinger.
CirrusSearch is configured in such a way that it requires all the query terms to appear in the page for it to be found.
For french the reason it is found is that the term "m" alone is considered a stop word and thus not required.

Mon, Sep 4, 5:01 PM · Discovery-Search (Current work), CirrusSearch
dcausse added a comment to T345141: No ALIS for 2023-08-14 snapshot.

@mfossati could you update the recovery plan to reflect what is currently happening, as far as I understand the let 2023-08-21's production run compute the proper delta step did not run as expected since you are requesting to import the full dataset in T345545, could you also elaborate on what when wrong that could explain why the delta is not correct?

Mon, Sep 4, 1:43 PM · Image-Suggestions, Structured-Data-Backlog (Current Work)
dcausse edited projects for T345545: Search indices image suggestion tags differ from the dataset used to update, added: Discovery-Search (Current work); removed Discovery-Search.
Mon, Sep 4, 9:48 AM · Discovery-Search (Current work), Section-Level-Image-Suggestions, Image-Suggestions, CirrusSearch

Aug 11 2023

dcausse created P50499 dcausse ssh config.
Aug 11 2023, 2:21 PM

Aug 10 2023

dcausse moved T300793: Refactor CirrusSearch WebdriverIO tests from sync to async mode from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Aug 10 2023, 8:21 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), Discovery-Search (Current work), User-zeljkofilipin, CirrusSearch
dcausse updated the task description for T256626: Refactor WebdriverIO tests from sync to async mode.
Aug 10 2023, 8:20 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), Epic, WMDE-TechWish-Maintenance-2023, Quality-and-Test-Engineering-Team (QTE) (Test Infrastructure), User-pwangai, User-vaughnwalters, MediaWiki-Core-Tests, Browser-Tests, Outreachy (Round 23), User-zeljkofilipin
dcausse added a comment to T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses).

Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)

Aug 10 2023, 7:23 AM · Wikidata Analytics (Kanban), Wikidata

Aug 9 2023

dcausse added a comment to T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses).

At a glance I suspect that now you might get duplicated QIDs in

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .alias("sa_and_sasc_ids")
)

Which could be explained by entities being tagged with multiple entries found in sa_and_sasc_qids.
What happens if you apply a distinct here:

sa_and_sasc_ids = (
    df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
    .where(col("predicate") == P31_DIRECT_URL)
    .where(col("object").isin(sa_and_sasc_qids))
    .disctinct()
    .alias("sa_and_sasc_ids")
)
Aug 9 2023, 1:08 PM · Wikidata Analytics (Kanban), Wikidata

Aug 8 2023

dcausse added projects to T342593: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service: Data Engineering and Event Platform Team, Data-Engineering, Event-Platform.
Aug 8 2023, 7:26 AM · Event-Platform, Data-Engineering, Data Engineering and Event Platform Team, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Aug 7 2023

dcausse added a comment to T342593: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service.
itemdeletion date
Q108134412023-06-06T14:21:49
Q329946832023-05-03T16:04:27
Q559295612023-05-31T20:24:58
Q1095485622023-06-06T14:22:29
Q1114368602023-05-04T06:54:56
Aug 7 2023, 4:40 PM · Event-Platform, Data-Engineering, Data Engineering and Event Platform Team, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Aug 4 2023

dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

I suspect that because the claims field being an array of complex types it can potentially be huge and asking to generate its string representation using f"{claims}" might cause excessive mem usage and is I believe a very slow operation.
I would look into ways to avoid having to serialize it as a string and iterate over the object representation (I suspect a Row?) to do your filtering or possibly asking hive to do a lateral view with the mainSnak (which is I believe what you're looking for?):

select id, claims_ex.mainSnak.property, claims_ex.mainSnak.dataValue.value
from wmf.wikidata_entity lateral view explode(claims) claims_explode as claims_ex
where snapshot = '2023-07-24' AND claims_ex.mainSnak.property = 'P31' limit 10;
OK
id	property	value
Q38488724	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37619467	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38738598	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37797268	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38708632	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37781259	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q39051969	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q37373175	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}
Q38327391	P31	{"entity-type":"item","numeric-id":5,"id":"Q5"}
Q37598817	P31	{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}

Then adding yet another filter on claims_ex.mainSnak.dataValue.value = '{"entity-type":"item","numeric-id":13442814,"id":"Q13442814"}' should work.

Aug 4 2023, 4:26 PM · Wikidata Analytics (Kanban), Wikidata

Aug 3 2023

dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

@AndrewTavis_WMDE sure! I'll send you an invite for next monday, in the meantime could you share your notebook somewhere so that I can take a look before the call?

Aug 3 2023, 5:12 PM · Wikidata Analytics (Kanban), Wikidata
dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

@AndrewTavis_WMDE thanks! this is really exciting, we couldn't hope for better results... it's almost a 50-50 split. And on top of that, and if I read your results correctly we only have 197,583 common triples (7,521,423,558 - 7,521,225,975) that will have to be duplicated in both subgraphs.

Aug 3 2023, 4:53 PM · Wikidata Analytics (Kanban), Wikidata
dcausse added a comment to T328330: Create SLI / SLO on Search update lag and error rate.

early dashboard for inconsistencies: https://superset.wikimedia.org/superset/dashboard/451/?native_filters_key=FNd9a7DWn2Q11F_72h3t5CO2dt2DwwhiIwyUdKIvokscWf8NOjlD4Vf9ltjfBOZn

Aug 3 2023, 9:45 AM · Patch-For-Review, Discovery-Search (Current work)

Aug 2 2023

dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).
  • <http://www.wikidata.org/entity/> prefix generally wd refers to the concept URI of the entity, this is generally how an entity (whether it's a property, item or lexeme) is identified, e.g. Q42 is identified as wd:Q42 -> <http://www.wikidata.org/entity/Q42>, this is the form that is used to link items to statements and its other constituents (can be seen as a subject or an object)
  • <http://www.wikidata.org/entity/statement/> prefix s in the dumps and Special:EntityData and wds in WDQS, these are used to identify a wikibase statement, e.g. wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F is the identity of the date of birth statement for Q42 (can be seen as a subject or an object)
  • <http://www.wikidata.org/prop/statement/> prefix ps is what actually link the statement ID as defined above to its actual simple value form, so the actual date of birth of Q42 is stored in the triple wds:q42-D8404CDA-25E4-4334-AF13-A3290BCD9C0F ps:P569 ?dateOfBirth (can be seen only as a predicate)
Aug 2 2023, 1:14 PM · Wikidata Analytics (Kanban), Wikidata
dcausse added a comment to T343280: people1004 running out of disk space.

@LSobanski thanks for the ping, just cleaned up some files in my home dir

Aug 2 2023, 12:18 PM · collaboration-services
dcausse added a comment to T340185: The MW Rest API does not normalize its string request parameters.

@BPirkle thanks for the investigations! Your conclusions do seem correct to me, I'm not being very knowledgeable in the MW API subsystems but I'm with you and Gergo here and I believe that since the REST api is "relatively new" it might be safer to apply the same normalization steps for all string params that the action API does. In fact in this case I believe that we somehow got "lucky" that CirrusSearch was very picky (not very robust to be more correct :) by failing on such utf8 sequences but I imagine that in other scenarios/endpoint the internal MW API could simply silently accept such strings and possibly update the database with them and pollute our datastores with these "bad" utf chars and sequences.

Aug 2 2023, 8:45 AM · Platform Engineering, MediaWiki-REST-API

Jul 31 2023

dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

Also for all's information, the duplicate triple values in discorvery.wikibase_rdf is very very small as seen in the following snippet/output:

percent_repeat_triples = round((1 - (total_distinct_triples / total_rows)) * 100, 4)
# (1 - (15,043,483,216 / 15,043,046,814)) * 100
percent_repeat_triples
# 0.0029
Jul 31 2023, 2:09 PM · Wikidata Analytics (Kanban), Wikidata
dcausse updated subscribers of T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

@dcausse, a general point on my end is that when I'm trying to run the code that you sent along via an HTML on people.wikimedia.org I'm getting the following as an output of Spark runs repeated over and over again:

23/07/31 13:01:58 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding physical memory limits. 4.4 GB of 4.4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

This seems to be happening given your create_custom_session setup, and doesn't happen when I do normal create_session as seen below:

spark_session = wmf.spark.create_session(type='yarn-large', app_name="wdqs-subgraph-analysis")

Would you be able to let me know if there's something in my permissions or setup that's causing this? I'm assuming that your setup will make queries faster, but we can disregard if my working setup gets me mostly there. I'm running Jupyter on stat1005, and saw that AKhatun was using stat1008, in case that's helpful information :)

Jul 31 2023, 2:08 PM · Wikidata Analytics (Kanban), Wikidata
dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

Hi @dcausse, thank you so much, this is very helpful! \o/

I believe that at first we are interested in knowing the number of triples that would be moved out

The "number of triples that would be moved out" seems to be the primary metric of interest for the Blazegraph split. But after your explanation of the table, I now realize that this metric produces is not equal to the number of rows in that table that are required to represent these triples in the table, correct? So could you quickly confirm, that the "number of triples that would be moved out" (distinct triples) is actually the preferable metric for our purposes (and not e.g. the "number of rows that would be moved out")?

The table wikibase_rdf does have one row per triple and an additional column named context that we use to annotate the entity the triple was extracted from while reading the dump.
With the caveats of shared values and references that have respectively <http://wikiba.se/ontology#Value> and <http://wikiba.se/ontology#Reference> set as their context column.
It is true that duplicates are in there and a select count(*) from wikibase_rdf will give a number greater than the number of triples stored in blazegraph.

Jul 31 2023, 9:53 AM · Wikidata Analytics (Kanban), Wikidata
dcausse moved T337801: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater from In Progress to Needs review on the Discovery-Search (Current work) board.

Added few notes at: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Running_from_YARN

Jul 31 2023, 9:26 AM · Discovery-Search (Current work), SRE-OnFire, Sustainability
dcausse added a comment to T341227: Make local_sites_with_dupe filter configurable and count duplicates.

@dcausse, Would you have a preference, where to implement the deduplication? I identified two options so far:

  • a) inside \CirrusSearch\Search\BaseCirrusSearchResultSet::extractResults
  • b) in a dedicated subclass of \CirrusSearch\Search\BaseCirrusSearchResultSet, DeduplicatedCirrusSearchResultSet, that is used instead of the dynamic instance in \CirrusSearch\Search\FullTextResultsType::transformElasticsearchResult (which has a comment // Should we make this a concrete class? anyways)
Jul 31 2023, 7:31 AM · MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), Discovery-Search (Current work), CirrusSearch

Jul 28 2023

dcausse added a comment to T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).

I believe that at first we are interested in knowing the number of triples that would be moved out if all item that verifies the condition: ?s wdt:P31 Q13442814 are moved out with all the triples belonging to these items.
The triples that belongs to an entity (e.g. Q1895685) are the ones visible via https://www.wikidata.org/wiki/Special:EntityData/Q1895685.ttl?flavor=dump with the additional complexity of shared values and references that have to be treated separately because they might be shared by other entities.
In here you'll notice that for instance the triple s:Q1895685-9a482323-4d57-acf2-b6b7-bc36d578bd57 ps:P478 "171" does not reference the QID of the paper but this triple must be counted as well.
This makes knowing the triples that belongs to ?s a bit tricky but we could leverage the structure of the wikibase_rdf table for this:

  • first count the number of triples that are not shared with other entities using the context column, this column (not a data available in WDQS can help to group the triples by entity)
  • count the number of triples attached to shared references and values, here we should also count the ones that are shared between S and not S because these will have to be duplicated in both graph
Jul 28 2023, 4:25 PM · Wikidata Analytics (Kanban), Wikidata
dcausse closed T342924: Search on wikifunctions.org results in a cirrussearch-backend-error and no results, a subtask of T342820: Migrate wikifunctions.org from locked-down to limited mode, letting users edit wikitext pages and some, as Resolved.
Jul 28 2023, 10:19 AM · Epic, Abstract Wikipedia team, Wikifunctions
dcausse closed T342924: Search on wikifunctions.org results in a cirrussearch-backend-error and no results as Resolved.

should be resolved

Jul 28 2023, 10:19 AM · Discovery-Search, CirrusSearch, Abstract Wikipedia team, Wikifunctions

Jul 27 2023

dcausse committed rDPOM87ae983b7ffc: Use maven-resources-plugin 3.3.1 (authored by dcausse).
Use maven-resources-plugin 3.3.1
Jul 27 2023, 9:05 AM
dcausse added a comment to T342762: 404 from nginx on wcqs2001.

There was a stale /srv/query_service/aliases.map file with some content in it (that I copied to /root/aliases.map.T342762) which I believe was confusing nginx causing it to replace the "wcq" part of the URI with wcqs20220915 which does not exist. Emptying this file seems to have fixed the readiness-probe.

Jul 27 2023, 8:43 AM · sre-alert-triage, Data-Platform-SRE
dcausse closed T342744: CirrusSearch\Profile\SearchProfileException: Cannot load a profile type rescore_function_chains: growth_underlinked_chain not found as Resolved.
Jul 27 2023, 8:13 AM · MW-1.41-notes (1.41.0-wmf.19; 2023-07-25), Growth-Team (Current Sprint), GrowthExperiments, Discovery-Search, CirrusSearch, Wikimedia-production-error

Jul 26 2023

dcausse added a comment to T342744: CirrusSearch\Profile\SearchProfileException: Cannot load a profile type rescore_function_chains: growth_underlinked_chain not found.

Initially caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/940254 where I started to inject the ExtensionRegistry required to load some search profiles that the GrowthExperiment extension declares, sadly I forgot to pull the profiles for the "rescore_function_chains" component and this is what caused the regression, the attached patch should solve the problem, sorry about that.

Jul 26 2023, 2:38 PM · MW-1.41-notes (1.41.0-wmf.19; 2023-07-25), Growth-Team (Current Sprint), GrowthExperiments, Discovery-Search, CirrusSearch, Wikimedia-production-error
dcausse added a comment to T342744: CirrusSearch\Profile\SearchProfileException: Cannot load a profile type rescore_function_chains: growth_underlinked_chain not found.

looking into it

Jul 26 2023, 2:25 PM · MW-1.41-notes (1.41.0-wmf.19; 2023-07-25), Growth-Team (Current Sprint), GrowthExperiments, CirrusSearch, Discovery-Search, Wikimedia-production-error
dcausse moved T300793: Refactor CirrusSearch WebdriverIO tests from sync to async mode from watching / waiting to needs triage on the Discovery-Search board.

moving back to needs triage as this starts to be blocking other teams

Jul 26 2023, 1:38 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), Discovery-Search (Current work), User-zeljkofilipin, CirrusSearch

Jul 25 2023

dcausse updated the task description for T342620: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing.
Jul 25 2023, 1:22 PM · Data-Platform-SRE, Discovery-Search (Current work), SRE-swift-storage, Data-Persistence

Jul 24 2023

dcausse created T342562: File uploads do not appear to get indexed properly.
Jul 24 2023, 5:05 PM · MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), Discovery-Search (Current work), CirrusSearch