User Details
- User Since
- Jun 9 2015, 9:03 AM (470 w, 14 h)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Today
Triggered a reindex of all the lexemes using https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender, might take about 3 hours to complete.
Yesterday
Thu, Jun 6
@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:
cookbook sre.wdqs.data-reload \ --task-id T349069 \ --reason "Test wdqs reload based on HDFS" \ --reload-data wikidata_full \ --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \ --stat-host stat1009.eqiad.wmnet \ wdqs_host
Tue, Jun 4
Mon, Jun 3
Yes (all the images under docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater should no longer be used and can be safely removed if needed)
Sorry to see this happening again, it is probable that we missed some edge cases when deploying T317045.
Fri, May 31
Thu, May 30
Hi, we might have a use-case related to "other dumps" that might benefit from the Dumps 2.0 infrastructure, I filed T366248 with some details about it.
Wed, May 29
The system should now index lexemes properly.
We still have to reindex all the lexemes to fix the ones created/edited before the fix was applied.
@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins
Tue, May 28
Output with:
cirrus = (spark.table("discovery.cirrus_index").where('cirrus_replica="codfw" AND snapshot="20240428"'))
The search fields specific to Lexemes are currently ignored causing this NOTICE but also preventing lexemes from being searchable (esp. the new ones).
The schemas should be adapted to support these fields and the lexemes will have to be re-indexed.
@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.
Thu, May 23
Thu, May 16
Wed, May 15
Tue, May 14
Mon, May 13
May 7 2024
May 6 2024
Possible options I see so far:
- Runs hdfs-rsync directly from the blazegraph hosts
- requires installing its dependencies
- open a holes between blazegraph and the hadoop cluster
- Schedule hdfs-rsync on a stat machine copying the ttl dumps from hdfs to /srv/analytics-search/wikibase_processed_dumps/wikidata/$SNAPSHOT
- cons: consumes some space on a stat machine
- Run hdfs-rsync on-demand to copy the ttl dump from hdfs to /srv/analytics-search/wikibase_processed_dumps/temp and cleanup this folder once done
- cons: slows down a bit the process
Another approach could be to use the /mnt/hdfs mountpoint? I have been told that it might not be stable enough but perhaps it's OK for doing a copy?
May 3 2024
Looking at the constraints I believe that 4 may use sparql:
- FormatChecker.php
- TypeChecker.php
- UniqueValueChecker.php
- ValueTypeChecker.php
May 2 2024
@BTullis @bking I plan to use a cookbook to transfer some data out of hdfs to blazegraph machines, a naive approach I thought about was to use a temp folder somewhere in /srv of a stat100x machine that would be populated using hdfs dfs or hdfs-rsync and then re-use the transferpy python module.
The current dumps are about 200G, do you think that this option is viable? Can we use a folder in /srv as a temp folder for such transfers? This data is only useful for the transfer and should be deleted by the cookbook when it ends.
Apr 30 2024
Apr 29 2024
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 switches from using a scroll to a search_after approach which should be more robust by handling retries and errors properly.
Question is whether we should do more by adding more checks or not? Unfortunately not all wikis are building a new index and promoting it, to optimize cluster operations most of the wikis recycle the same index where we don't have a chance to do such sanity checks prior to promoting.
Closing, the issue is tracked at T363516.
Apr 26 2024
tagging @serviceops for help regarding the connectivity issue and this new delayed connect error: 113 error
completion traffic is now served from codfw which has proper indices, lowering prio
This is still happening, raising to UBN
The errors "delayed connect error: 113" seem to have started on apr 24 21:30 right after deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937.
The errors affect both mw@wikikube and mwmaint1002 https://logstash.wikimedia.org/goto/5ac680b477389129ffb5ddf33fa09940
I think we should switch completion traffic to codfw while we work on a more resilient version of this maint script and also understand why we get these errors.
Apr 24 2024
Quick update that some fix has been deployed two weeks ago (T359580#9699108) to stop pushing these late events.
Apr 23 2024
Apr 19 2024
I think there are two issues to be discussed here. Defining qualitative requirements and how to repair inconsistencies.
Regarding qualitative requirements, for search and WDQS we don't have a good sense of what would be good enough. the only visible criteria we have at the moment is when users complain about stale data but without concrete measurement of the instability it is hard to define a number I guess. Could we do the other way around by starting to measure how consistent the streams are compared to the source of truth? Could this be done for some important streams like revision-create/page-delete/page-undelete/page-state by applying similar techniques than the one used in T215001#7523796? It is probable that missed events are rare in normal conditions but I still see huge spikes in the logs with many events failing to reach event-gate (T362977), could there be ways to improve the situation at a reasonable cost?
Apr 16 2024
Apr 15 2024
Apr 9 2024
Apr 8 2024
Reopening since it seems some of these hosts are still mentioned somewhere. The elastic settings check is complaining with CRITICAL - ['elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500', 'elastic2073.codfw.wmnet:9500', 'elastic2086.codfw.wmnet:9500', 'elastic2092.codfw.wmnet:9500', 'elastic2100.codfw.wmnet:9500', 'elastic2106.codfw.wmnet:9500'] does not match ['elastic2073.codfw.wmnet:9500', 'elastic2086.codfw.wmnet:9500', 'elastic2092.codfw.wmnet:9500', 'elastic2100.codfw.wmnet:9500', 'elastic2106.codfw.wmnet:9500']
Apr 5 2024
Two scholia queries were rewritten:
- https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Property_paths
- https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Number_of_articles_with_CiTO-annotated_citations_by_year
The pages also contains some documentation about to approach such rewrites.
I'm boldly moving this ticket to our Needs Reporting (prior to be closed) column as I believe further explorations about how to rewrite scholia queries to support the split could perhaps be better handled in https://github.com/WDscholia/scholia.
Apr 4 2024
Thanks! I'm not very familiar with alerts being set from grafana neither, I'll try to get more info on this, worst case we can always set up a new one directly in alertmanager just for the wdqs lag and sent to the search team using the same formula used by updateQueryServiceLag.php.
@Lucas_Werkmeister_WMDE thanks! Do you know where we could update this to include our alert email for such alerts?
According to @Urbanecm_WMF these queries are probably emitted while running https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/maintenance/refreshLinkRecommendations.php.
Discussing possible fixes it would be ideal if cirrus could detect that it is being run via a maint script and possibly call something like disablePoolCountersAndLogging but perhaps without disabling statsd since the user script might require stats to be emitted.
Apr 3 2024
Should be working properly now
Mar 29 2024
won't be required after all