Gehel (Guillaume Lederrey)
Operations Engineer - Discovery

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 9 2015, 9:18 PM (157 w, 2 d)
Availability
Available
IRC Nick
gehel
LDAP User
Gehel
MediaWiki User
GLederrey (WMF) [ Global Accounts ]

Recent Activity

Tue, Nov 13

Gehel closed T204240: Cleanup rspec tests for tilerator and wdqs puppet modules as Resolved.
Tue, Nov 13, 6:30 PM · Patch-For-Review, Discovery-Search
Gehel moved T209030: Fix elasticsearch_hot_threads on relforge (ImportError: No module named 'yaml') from Needs review to Done on the Discovery-Search (Current work) board.

this has been fixed already

Tue, Nov 13, 6:29 PM · Discovery-Search (Current work)
Gehel moved T209257: Refactor wdqs::gui - Separate cron tasks from the module from In progress to Done on the Discovery-Search (Current work) board.
Tue, Nov 13, 6:20 PM · Patch-For-Review, Operations, Wikidata-Query-Service, Discovery-Search (Current work), Wikidata

Fri, Nov 9

Gehel added a comment to T207046: Code health metrics spike.

To demo SonarQube for the working group, I recommend to set one up on a WMCS instance.

Fri, Nov 9, 5:17 PM · Patch-For-Review, User-zeljkofilipin, Release-Engineering-Team (Kanban), Code-Health-Metrics
Gehel committed rDPOMfa297d152aee: Tune default sonar-maven-plugin. (authored by Gehel).
Tune default sonar-maven-plugin.
Fri, Nov 9, 3:03 PM
Gehel committed rDPOM822282c6aa8a: [maven-release-plugin] prepare for next development iteration (authored by Gehel).
[maven-release-plugin] prepare for next development iteration
Fri, Nov 9, 3:03 PM
Gehel committed rDPOM89caf8b6424d: Tune default sonar-maven-plugin. (authored by Gehel).
Tune default sonar-maven-plugin.
Fri, Nov 9, 3:03 PM
Gehel committed rDPOM8b59f3861c91: [maven-release-plugin] prepare release discovery-parent-pom-1.22 (authored by Gehel).
[maven-release-plugin] prepare release discovery-parent-pom-1.22
Fri, Nov 9, 3:03 PM
Gehel claimed T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.
Fri, Nov 9, 9:24 AM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel moved T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.
Fri, Nov 9, 9:23 AM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel moved T208956: Wikidata Query Service Throttling Filter state should be sized based on real usage from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.
Fri, Nov 9, 9:19 AM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint

Thu, Nov 8

Gehel added a comment to T208496: search platform maven projects failing post merge build.

The build succeeded. The new docs are published at https://doc.wikimedia.org/search-highlighter/experimental/, the publication date has been updated, and the site looks fine (minus the usual broken links).

Thu, Nov 8, 5:17 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)

Wed, Nov 7

Gehel triaged T208956: Wikidata Query Service Throttling Filter state should be sized based on real usage as Normal priority.
Wed, Nov 7, 3:25 PM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint
Gehel created T208956: Wikidata Query Service Throttling Filter state should be sized based on real usage.
Wed, Nov 7, 3:25 PM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint
Gehel moved T207834: Cleanup Wikidata Query Service logging configuration from Done to Needs review on the Discovery-Wikidata-Query-Service-Sprint board.
Wed, Nov 7, 2:34 PM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel moved T207834: Cleanup Wikidata Query Service logging configuration from Needs review to Done on the Discovery-Wikidata-Query-Service-Sprint board.

@Smalyshev we're good on this from my point of view. Could you check that running updater manually (with the -S option to output to console, or with -v for verbose) works as you expect?

Wed, Nov 7, 2:34 PM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel created T208938: Use maven wrapper (mvnw) to build maven based project from search platform team.
Wed, Nov 7, 9:33 AM · Discovery-Search (Current work), Release-Engineering-Team

Tue, Nov 6

Gehel moved T198352: Setup two elasticsearch clusters on relforge to test multi-instance from In progress to Needs review on the Discovery-Search (Current work) board.
Tue, Nov 6, 6:25 PM · Patch-For-Review, Discovery-Search (Current work)
Gehel claimed T208496: search platform maven projects failing post merge build.
Tue, Nov 6, 6:23 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)

Mon, Nov 5

Gehel added a comment to T208533: Access for imarlier to wdqs servers.

Approved in weekly SRE meeting for full root access to wdqs servers

Mon, Nov 5, 5:53 PM · Patch-For-Review, Operations, SRE-Access-Requests
Gehel added a comment to T208496: search platform maven projects failing post merge build.

Why not drop the wrapper completely and just allow running an arbitrary command in the project directory? That way, we'll be able to configure ./mvnw clean install && ./mvnw site site:stage (or whatever we need). And not care about transient state needed to be propagated between containers. I'm not sure what value is that wrapper bringing, so I might be missing something.

Mon, Nov 5, 3:36 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)

Fri, Nov 2

TJones awarded T208496: search platform maven projects failing post merge build a Like token.
Fri, Nov 2, 6:46 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)
Gehel committed rDPOM146a2c7c5d61: Add basic configuration of pitest (authored by Gehel).
Add basic configuration of pitest
Fri, Nov 2, 4:07 PM
Gehel committed rDPOMa451c42585c6: Add basic configuration of pitest (authored by Gehel).
Add basic configuration of pitest
Fri, Nov 2, 4:07 PM
Gehel committed rDPOM79886e597080: [maven-release-plugin] prepare for next development iteration (authored by Gehel).
[maven-release-plugin] prepare for next development iteration
Fri, Nov 2, 4:07 PM
Gehel committed rDPOMad73fc0dceeb: [maven-release-plugin] prepare release discovery-parent-pom-1.21 (authored by Gehel).
[maven-release-plugin] prepare release discovery-parent-pom-1.21
Fri, Nov 2, 4:07 PM
Gehel committed rDPOM9f8056b48221: Add basic configuration of pitest (authored by Gehel).
Add basic configuration of pitest
Fri, Nov 2, 4:06 PM
Gehel removed a project from T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it: Patch-For-Review.
Fri, Nov 2, 3:12 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Pintoch awarded T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it a Like token.
Fri, Nov 2, 3:03 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.

The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.

Fri, Nov 2, 2:57 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.
  • Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have accidentally listed the same paper twice, or if two different batch jobs check/create the same paper, then the first create is sometimes "invisible" due to SPARQL lag, and a duplicate item is created. This has apparently happened a lot in the last few days.
Fri, Nov 2, 2:38 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.

Thanks for the feedback!

Fri, Nov 2, 2:28 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.

What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.

Fri, Nov 2, 1:45 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service

Thu, Nov 1

Gehel added a comment to T202764: Wikidata produces a lot of failed requests for recentchanges API.

For context, T202765 is about a bot sending annoying and somewhat expensive requests. That specific issue is now resolved.

Thu, Nov 1, 7:57 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), User-Addshore, Operations, Wikidata-Query-Service, Wikidata
Gehel added a comment to T208407: Check whether maps-team project requires NFS or not.

maps is a different project from maps-team. But we can still delete the maps-team project, we now have maps instances on deployment-prep, which makes more sense.

Thu, Nov 1, 2:53 PM · Patch-For-Review, Maps, Cloud-VPS
Gehel added a comment to T208407: Check whether maps-team project requires NFS or not.

Honestly, I'm not sure what the underlying issue is. What I can say is that there is no foreseeable need to have any NFS mount in current or future instances in the maps-project.

Thu, Nov 1, 2:43 PM · Patch-For-Review, Maps, Cloud-VPS
Gehel assigned T208407: Check whether maps-team project requires NFS or not to Krenair.

At this point, there are no live instances in the maps-team project, and no immediate need to use NFS. So feel free to remove it, we'll ask for it again if the need arises.

Thu, Nov 1, 2:41 PM · Patch-For-Review, Maps, Cloud-VPS
Gehel updated subscribers of T208496: search platform maven projects failing post merge build.

That change is probably reasonably easy to implement in the JJB templates, but I'm lost there. I hope @hashar can provide some support.

Thu, Nov 1, 1:25 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)
Gehel triaged T208496: search platform maven projects failing post merge build as Normal priority.
Thu, Nov 1, 1:24 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)
Gehel created T208496: search platform maven projects failing post merge build.
Thu, Nov 1, 1:22 PM · Patch-For-Review, Release-Engineering-Team, Continuous-Integration-Config, Discovery-Search (Current work)

Tue, Oct 30

Gehel moved T207843: increase restart interval of wdqs updater from In progress to Done on the Discovery-Search (Current work) board.
Tue, Oct 30, 5:36 PM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel added a comment to T207834: Cleanup Wikidata Query Service logging configuration.

new configuration deployed, but raising some deprecations, needs some tuning.

Tue, Oct 30, 5:36 PM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Fri, Oct 26

Gehel committed rDPOM86d96284c00e: [maven-release-plugin] prepare for next development iteration (authored by Gehel).
[maven-release-plugin] prepare for next development iteration
Fri, Oct 26, 2:29 PM
Gehel committed rDPOM22c108e5f939: [maven-release-plugin] prepare release discovery-parent-pom-1.20 (authored by Gehel).
[maven-release-plugin] prepare release discovery-parent-pom-1.20
Fri, Oct 26, 2:29 PM
Gehel committed rDPOM812f76df365c: fixed failing site generation (authored by Gehel).
fixed failing site generation
Fri, Oct 26, 2:25 PM
Gehel committed rDPOM4a31d450a6f6: [maven-release-plugin] prepare release discovery-parent-pom-1.19 (authored by Gehel).
[maven-release-plugin] prepare release discovery-parent-pom-1.19
Fri, Oct 26, 2:07 PM
Gehel committed rDPOM671e9c0f040c: [maven-release-plugin] prepare for next development iteration (authored by Gehel).
[maven-release-plugin] prepare for next development iteration
Fri, Oct 26, 2:07 PM
Gehel committed rDPOM407e9aeba3cb: upgrading plugins and dependencies to latest stable versions (authored by Gehel).
upgrading plugins and dependencies to latest stable versions
Fri, Oct 26, 1:19 PM
Gehel committed rDPOM16cad16ae6ad: add basic configuration of sonar scanner plugin (authored by Gehel).
add basic configuration of sonar scanner plugin
Fri, Oct 26, 1:19 PM
Gehel committed rDPOM93105166240a: add basic configuration of sonar scanner plugin (authored by Gehel).
add basic configuration of sonar scanner plugin
Fri, Oct 26, 1:19 PM
Gehel committed rDPOM660a89db988b: upgrading plugins and dependencies to latest stable versions (authored by Gehel).
upgrading plugins and dependencies to latest stable versions
Fri, Oct 26, 1:19 PM

Thu, Oct 25

Gehel added a comment to T207837: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers.

There are 3 issues here, and maybe they should be addressed on different tickets:

Thu, Oct 25, 7:09 PM · Wikidata, Operations, Discovery-Search (Current work), Wikidata-Query-Service
Gehel triaged T207947: Switch wdqs1003 with one of the internal wdqs cluster as High priority.
Thu, Oct 25, 2:15 PM · Operations, Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service, Wikidata
Gehel added a comment to T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.

@Andrew Also looks like there is some puppet issue there:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item lvs::configuration::lvs_service_ips in any Hiera data file and no default supplied at /etc/puppet/modules/lvs/manifests/configuration.pp:71:20 on node t206636-2.wikidata-query.eqiad.wmflabs

Judging from the names of the settings, it's not one of mine, so could you please look into what is going on there?

Thu, Oct 25, 2:05 PM · User-Smalyshev, Wikidata, cloud-services-team, Operations, Wikidata-Query-Service
Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

So this shows that we have less than 0.04% of packet loss on the elasticsearch eqiad cluster? I would expect a loss rate that low to not be an issue (which matches the fact that we don't see a functional issue on that cluster). The goal is not to raise alerts, it is to raise alerts if we reach a level that is problematic.

Thu, Oct 25, 8:33 AM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations

Wed, Oct 24

Gehel moved T207724: Investigate reducing number of servers in the elasticsearch cluster from Backlog to Done on the Discovery-Search (Current work) board.
Wed, Oct 24, 9:27 PM · Discovery-Search (Current work), Operations
Gehel added a comment to T207724: Investigate reducing number of servers in the elasticsearch cluster.

Actually, already some pool counter errors with 29 nodes on eqiad.

Wed, Oct 24, 5:30 PM · Discovery-Search (Current work), Operations
Gehel added a comment to T207834: Cleanup Wikidata Query Service logging configuration.

My current patch is trying to put all that logic into logback.xml, but it is definitely starting to be unreadable. And coding ifs in XML just seems wrong :/

Wed, Oct 24, 4:26 PM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel added a comment to T207817: WDQS Updater ran into issue and stopped working.

Interesting! I checked Jodatime stuff to make sure one of our Java based pipeline handled the timestamp format change, I'm surprised that Jackson can't parse this!

Wed, Oct 24, 3:43 PM · EventBus, Core Platform Team Backlog (Later), Analytics, Services (next), Discovery-Wikidata-Query-Service-Sprint, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Wikimedia-Incident, Operations, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
Gehel added a comment to T207817: WDQS Updater ran into issue and stopped working.

Do we have a patch or should we roll back group0?

Wed, Oct 24, 2:51 PM · EventBus, Core Platform Team Backlog (Later), Analytics, Services (next), Discovery-Wikidata-Query-Service-Sprint, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Wikimedia-Incident, Operations, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
Gehel triaged T207843: increase restart interval of wdqs updater as High priority.
Wed, Oct 24, 11:44 AM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel triaged T207837: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers as High priority.
Wed, Oct 24, 9:41 AM · Wikidata, Operations, Discovery-Search (Current work), Wikidata-Query-Service
Gehel triaged T207834: Cleanup Wikidata Query Service logging configuration as High priority.
Wed, Oct 24, 9:28 AM · Discovery-Wikidata-Query-Service-Sprint, Wikimedia-Incident, Patch-For-Review, Operations, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Gehel renamed T207656: WDQS logging should be rate limited from WDQS logging to logstash should be rate limited to WDQS logging should be rate limited.
Wed, Oct 24, 7:51 AM · Patch-For-Review, Operations, Discovery-Wikidata-Query-Service-Sprint
Gehel added a subtask for T207817: WDQS Updater ran into issue and stopped working: T207656: WDQS logging should be rate limited.
Wed, Oct 24, 7:49 AM · EventBus, Core Platform Team Backlog (Later), Analytics, Services (next), Discovery-Wikidata-Query-Service-Sprint, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Wikimedia-Incident, Operations, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
Gehel added a parent task for T207656: WDQS logging should be rate limited: T207817: WDQS Updater ran into issue and stopped working.
Wed, Oct 24, 7:49 AM · Patch-For-Review, Operations, Discovery-Wikidata-Query-Service-Sprint

Tue, Oct 23

Gehel added a comment to T207724: Investigate reducing number of servers in the elasticsearch cluster.

With 6 servers depooled / banned, the cluster seems to be just fine. Starting at 7 nodes depooled, I see the load rising on some of the other servers. The response times don't show any significant change.

Tue, Oct 23, 2:40 PM · Discovery-Search (Current work), Operations
Gehel triaged T207724: Investigate reducing number of servers in the elasticsearch cluster as Normal priority.
Tue, Oct 23, 8:23 AM · Discovery-Search (Current work), Operations

Mon, Oct 22

Gehel added a comment to T206105: Optimize networking configuration for WDQS.

Some minimal packet drop is still seen (< 100 packet / 24h), so the situation is very much better. More work needs to be done on limiting CPU usage on the blazegraph side.

Mon, Oct 22, 3:59 PM · Patch-For-Review, Wikidata, Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service
Gehel created T207665: Run test queries automatically on wdqs autodeployed servers.
Mon, Oct 22, 3:14 PM · Wikidata, Wikidata-Query-Service
Gehel triaged T207656: WDQS logging should be rate limited as High priority.
Mon, Oct 22, 1:46 PM · Patch-For-Review, Operations, Discovery-Wikidata-Query-Service-Sprint
Gehel created T207656: WDQS logging should be rate limited.
Mon, Oct 22, 1:45 PM · Patch-For-Review, Operations, Discovery-Wikidata-Query-Service-Sprint

Fri, Oct 19

Gehel added a comment to T206560: [Epic] Evaluate alternatives to Blazegraph.

A few wishes I have from an operations point of view for any replacement. Those are not necessarily mandatory, but we should evaluate them at some point:

Fri, Oct 19, 1:42 PM · Wikidata, Epic, Wikidata-Query-Service
Gehel added a comment to T207195: Configure LVS endpoints for new elasticsearch clusters.

After some discussion (P7699), the idea is:

Fri, Oct 19, 1:36 PM · Discovery-Search
Gehel created P7699 discussion: elasticsearch multi instance traffic routing.
Fri, Oct 19, 1:34 PM · Discovery-Search

Thu, Oct 18

Gehel triaged T207360: osmosis state file on maps1004 is outdated as High priority.
Thu, Oct 18, 8:11 AM · Reading-Infrastructure-Team-Backlog (Kanban), Maps

Wed, Oct 17

Gehel added a comment to T206880: Investigate runaway Blazegraph threads.

@Smalyshev if you could take a heap dump of blazegraph under load, we might be able to trace more precisely where this unnamed thread pool is coming from. Feel free to send me the dump for analysis.

Wed, Oct 17, 10:52 PM · User-Smalyshev, Wikidata, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service
Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

What should be the runbook/actions when this alert goes off?

Wed, Oct 17, 8:26 PM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations
Gehel moved T206105: Optimize networking configuration for WDQS from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.
Wed, Oct 17, 1:08 PM · Patch-For-Review, Wikidata, Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service

Tue, Oct 16

Gehel triaged T207195: Configure LVS endpoints for new elasticsearch clusters as Normal priority.
Tue, Oct 16, 5:11 PM · Discovery-Search
Gehel updated subscribers of T198352: Setup two elasticsearch clusters on relforge to test multi-instance.

The current puppet code for tlsproxy::localssl does not allow for multiple default_server, even when on different ports.

Tue, Oct 16, 3:52 PM · Patch-For-Review, Discovery-Search (Current work)

Oct 15 2018

Gehel moved T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn from Backlog to Needs review on the Discovery-Wikidata-Query-Service-Sprint board.
Oct 15 2018, 3:51 PM · Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Operations, Wikidata
Gehel added a project to T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn: Discovery-Wikidata-Query-Service-Sprint.
Oct 15 2018, 3:51 PM · Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Operations, Wikidata
Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

I would consider also making the threshold a percentage of the normal traffic.

Oct 15 2018, 10:12 AM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations
Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

Just in eqiad there's 839 matches, so likely we'll need some filtering/tuning first

Oct 15 2018, 10:09 AM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations
Gehel added a comment to T206187: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts.

I think the proposal make sense. This check is here so that we don't forget to reshard when needed, but there isn't a hard limit on the max shard size (well, there is the overall disk space, but we're going to be in trouble well before that). The main goal being to get a low priority alert when things are climbing too high. And "too high" isn't well defined. So we have some latitude as to what limit we want to set.

Oct 15 2018, 8:19 AM · Discovery-Search (Current work), Patch-For-Review, Elasticsearch, Operations, Icinga

Oct 12 2018

Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

The ones I have seen are relatively short burst of errors in error counters in interfaces on WDQS (node_network_receive_drop in prometheus). In the case of WDQS, it seems related to CPU starvation and seems to be actionnable (T206105). It looks to me like something we need to address more generally, but that's sufficiently out of my comfort zone that feedback is definitely welcomed!

Oct 12 2018, 8:48 AM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations

Oct 11 2018

Gehel updated subscribers of T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.

I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users won't even notice (persistent lag is different of course). If however the user's queries time out, that is different.

Oct 11 2018, 7:55 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T206114: Create an Icinga check to alert on packet dropped.

General question on how to deploy this kind of change:

Oct 11 2018, 7:50 PM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations
Gehel added a comment to T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.

The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried setting up a wdqs test instance on WMCS, but IO contention meant that we were not able to keep up with the update flow. Our production instance consume ~3-4K IOPS just for updates. If there is a way to get this kind of throughput on our VMs, then that would be great!

Oct 11 2018, 7:28 PM · User-Smalyshev, Wikidata, cloud-services-team, Operations, Wikidata-Query-Service
Gehel moved T198352: Setup two elasticsearch clusters on relforge to test multi-instance from Backlog to In progress on the Discovery-Search (Current work) board.
Oct 11 2018, 1:05 PM · Patch-For-Review, Discovery-Search (Current work)
Gehel edited projects for T198352: Setup two elasticsearch clusters on relforge to test multi-instance, added: Discovery-Search (Current work); removed Discovery-Search.
Oct 11 2018, 1:05 PM · Patch-For-Review, Discovery-Search (Current work)
Gehel triaged T206648: Increase throttling rates for wdqs internal cluster as High priority.
Oct 11 2018, 8:50 AM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint
Gehel moved T206648: Increase throttling rates for wdqs internal cluster from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.
Oct 11 2018, 8:50 AM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint

Oct 10 2018

Gehel created T206648: Increase throttling rates for wdqs internal cluster.
Oct 10 2018, 4:03 PM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint
Gehel created T206639: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps.
Oct 10 2018, 3:40 PM · Patch-For-Review, Operations, Maps
Gehel created T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.
Oct 10 2018, 3:18 PM · User-Smalyshev, Wikidata, cloud-services-team, Operations, Wikidata-Query-Service
Gehel added a comment to T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it.

Coming back to this discussion, I'll try to make my point more clear:

Oct 10 2018, 2:45 PM · Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel added a comment to T206105: Optimize networking configuration for WDQS.

With some trial an error, it looks like the smp_affinity = 00ff00ff would allow the IRQ to be managed by any CPU, but it is still managed by the first one (in this case, any == CPU0). Setting each IRQ on a specific CPU (and one only) will spread them. It looks like puppet interface::rps is setting this up, but it is also setting up a few other things. Time to read some more.

Oct 10 2018, 1:11 PM · Patch-For-Review, Wikidata, Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service

Oct 9 2018

Gehel added a comment to T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn.

Looking at dropped packets, it looks like we did not have any over the last few days. So we have another cause to our lag. Also not that while the issue still seems more present on wdqs2003, we also see issue with other nodes.

Oct 9 2018, 8:11 AM · Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata, Operations