@Scott_French Does it seem reasonable that what I called the "control group" will always be all logs minus the target deployment's logs? I'm wondering if we need to find a way to configure both lookups separately or if it can be formulaic with only the thresholds needing to be set independently.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Thu, Jun 11
Resolved per https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive&q=project%3Ddeployment-prep monitoring.
bd808@deployment-poolcounter07:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-poolcounter07.deployment-prep.eqiad1.wikimedia.cloud Info: Applying configuration version '(e1acc46365) gitpuppet - beta: Add a wmf-beta-update-all timer and script' Notice: Applied catalog in 6.53 seconds
Self-resolved?
Wikibugs has a lot of maintainers in theory: https://toolsadmin.wikimedia.org/tools/id/wikibugs
In T402454#12008371, @Aklapper wrote:Now I wonder who could review that.
In T256168#11985947, @dancy wrote:In production, the train-presync.service uses /usr/local/bin/systemd-timer-mail-wrapper for this purpose. I wonder if it works in beta.
Wed, Jun 10
In T428354#12005573, @aputhin wrote:We'll update the documentation to reflect the fact that --filelog should be added
Tue, Jun 9
In T426073#12000638, @atsuko wrote:So no, it wasn't a data loss, but the data was populated only on the single cluster.
In T426073#11987919, @atsuko wrote:@bd808 I noticed that the opensearch-toolhub-test.svc.codfw.wmnet wasn't populated, I've done it myself using your script.
The projects were missing any members at all. I added @DamianZaremba to both as a "member" which is the name our Keystone setup gives to sysop users. You can add other member and reader accounts as needed Damian.
Mon, Jun 8
@dduvall The tls-server-name: 127.0.0.1 trick in the kubeconfig does not work for the executor access?
Too old to do anything about
bd808@deployment-schema-3:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-schema-3.deployment-prep.eqiad1.wikimedia.cloud Info: Applying configuration version '(2c7ddab7b7) git-sync-upstream - [BETA HACK] haproxy: disable warn-blocked-traffic-after' Notice: Applied catalog in 5.28 seconds
bd808@deployment-puppetserver-1.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud Info: Applying configuration version '(2c7ddab7b7) git-sync-upstream - [BETA HACK] haproxy: disable warn-blocked-traffic-after' Notice: /Stage[main]/Profile::Puppetserver::Volatile/File[/srv/puppet_fileserver/volatile/datacenter_vendors]: Not removing directory; use 'force' to override Notice: /Stage[main]/Profile::Puppetserver::Volatile/File[/srv/puppet_fileserver/volatile/datacenter_vendors]/ensure: removed (corrective) Notice: Applied catalog in 12.12 seconds
Puppet works on the puppetserver itself again. The alert for the original error here has cleared as well.
bd808@deployment-puppetserver-1.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Must define a private entry in profile::puppetserver::git::repos to use profile::puppetserver::volatile (file: /srv/puppet_code/environments/production/modules/profile/manifests/puppetserver/volatile.pp, line: 30, column: 9) on node deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
It looks like maybe I messed up the cherry-pick? I'll try again.
diff --git a/deployment-prep/deployment-puppetserver.yaml b/deployment-prep/deployment-puppetserver.yaml index bb6dcfa..3833832 100644 --- a/deployment-prep/deployment-puppetserver.yaml +++ b/deployment-prep/deployment-puppetserver.yaml
New problems:
bd808@deployment-puppetserver-1.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::conftool::hiddenparma::root_token' (file: /srv/puppet_code/environments/production/modules/profile/manifests/puppetserver/volatile.pp, line: 13) on node deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
This is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1298768 conflicting with the Beta Cluster config. We need to rename our profile::conftool::hiddenparma::api_tokens to profile::conftool::hiddenparma::root_token in the deployment-puppetserver prefix hiera apparently.
Hack patch updated in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137013/6. Then I did the fetch, interactive rebase to drop the old patch, and cherry-pick the updated patch dance on the puppetserver.
gitpuppet@deployment-puppetserver-1:/srv/git/operations/puppet$ git fetch Auto packing the repository in background for optimum performance. See "git help gc" for manual housekeeping. gitpuppet@deployment-puppetserver-1:/srv/git/operations/puppet$ git rebase --interactive origin/production INFO: Deploying Puppet code... INFO: Exiting, skipping Puppet code deploy, because a git rebase is in progress INFO: Deploying Puppet code... INFO: Exiting, skipping Puppet code deploy, because a git rebase is in progress INFO: Deploying Puppet code... INFO: Exiting, skipping Puppet code deploy, because a git rebase is in progress INFO: Deploying Puppet code... INFO: Exiting, skipping Puppet code deploy, because a git rebase is in progress Auto-merging modules/profile/manifests/puppetserver/volatile.pp CONFLICT (content): Merge conflict in modules/profile/manifests/puppetserver/volatile.pp error: could not apply 6c0417d261... [BETA HACK] Changes to profile::puppetserver::volatile hint: Resolve all conflicts manually, mark them as resolved with hint: "git add/rm <conflicted_files>", then run "git rebase --continue". hint: You can instead skip this commit: run "git rebase --skip". hint: To abort and get back to the state before "git rebase", run "git rebase --abort". Could not apply 6c0417d261... [BETA HACK] Changes to profile::puppetserver::volatile
Fri, Jun 5
bd808@deployment-docker-mobileapps02:/var/log$ sudo rm syslog.1 bd808@deployment-docker-mobileapps02:/var/log$ df -h Filesystem Size Used Avail Use% Mounted on udev 3.9G 0 3.9G 0% /dev tmpfs 796M 624K 795M 1% /run /dev/sda1 20G 15G 4.2G 78% / tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock /dev/sda15 124M 12M 113M 10% /boot/efi tmpfs 796M 0 796M 0% /run/user/0 tmpfs 796M 0 796M 0% /run/user/3518
This will probably get us by for a while.
bd808@deployment-docker-mobileapps02:/var/log$ sudo du -sh *| sort -h | tail -10 28M syslog.27.gz 29M syslog.28.gz 30M syslog.21.gz 42M syslog.24.gz 45M syslog.2.gz 56M syslog.12.gz 1.7G account 2.9G syslog 3.2G syslog.1 3.7G journal bd808@deployment-docker-mobileapps02:/var/log$ ls -lh account/pacct -rw-r----- 1 root adm 1.7G Jun 4 15:12 account/pacct bd808@deployment-docker-mobileapps02:/var/log$ sudo truncate -s 0 /var/log/account/pacct bd808@deployment-docker-mobileapps02:/var/log$ df -h | grep sda1 /dev/sda1 20G 18G 776M 96% / /dev/sda15 124M 12M 113M 10% /boot/efi bd808@deployment-docker-mobileapps02:/var/log$ sudo rm syslog.??.gz bd808@deployment-docker-mobileapps02:/var/log$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-docker-mobileapps02.deployment-prep.eqiad1.wikimedia.cloud Info: Applying configuration version '(68641ed297) gitpuppet - [BETA HACK] haproxy: disable warn-blocked-traffic-after' Notice: Applied catalog in 8.03 seconds
bd808@deployment-docker-mobileapps02:~$ sudo -i puppet agent -tv Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-docker-mobileapps02.deployment-prep.eqiad1.wikimedia.cloud Info: Applying configuration version '(68641ed297) gitpuppet - [BETA HACK] haproxy: disable warn-blocked-traffic-after' Error: Could not prefetch package provider 'apt': Execution of '/usr/bin/apt-mark showmanual' returned 100: E: Write error - write (28: No space left on device) E: IO Error saving source cache E: The package lists or status file could not be parsed or opened. ... Warning: /Stage[main]/Rsyslog/Service[rsyslog]: Skipping because of failed dependencies Error: Failed to apply catalog: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/state.yaml20260605-3112715-1t5l9pe.lock Error: Could not save last run local report: No space left on device @ dir_s_mkdir - /var/cache/puppet/public/last_run_summary.yaml20260605-3112715-1ivm4du.lock Error: Could not send report: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/last_run_report.yaml20260605-3112715-odci0o.lock
Disk is full. Seems to be /var/log
Thu, Jun 4
A different path would be to talk with the folks who run the wikidumpparse Cloud VPS project which is the current home of the https://humaniki.wmcloud.org project and see if your project can become part of theirs. The humaniki-2 name you have chosen makes it seem like you would like your project to be associated with theirs in the minds of users.
In T428090#11984393, @Danya wrote:I'll answer your remarks in order:
- I don't want to run a MediaWiki instance alongside my tool.
Wed, Jun 3
using an SQLite database
The Beta Cluster cache nodes are Debian Bullseye running HAProxy version 2.8.18-1~bpo11+1 2025/12/26. It looks like the prod CDN edge is using Debian Trixie and HAProxy 3.2.15-1~bpo13+1.
In T428052#11981214, @Urbanecm_WMF wrote:Pushed this on beta as a stopgap:
Tue, Jun 2
I missed seeing this for some reason. There is work in progress to upgrade the base image for shellbox to bookworm. Once that is done this should be a relatively simple follow up. I think this will be the first time that we can use the createPygmentizeBundle.php script that @SD0001 built too.
Mon, Jun 1
Thu, May 28
In T415293#11965049, @bd808 wrote:I haven't reviewed the whole task tree, but is there a planned step to replace the custom skin with something we generally use such as vector-2022 as part of the decomm process? It would be nice to be able to drop and archive the wikimediaapiportal skin (T259661: Restrict skin options on API Portal Beta Site) rather than carry it indefinitely as baggage in the train.
I haven't reviewed the whole task tree, but is there a planned step to replace the custom skin with something we generally use such as vector-2022 as part of the decomm process? It would be nice to be able to drop and archive the wikimediaapiportal skin (T259661: Restrict skin options on API Portal Beta Site) rather than carry it indefinitely as baggage in the train.
In T426073#11930022, @atsuko wrote:Hi, indices access should now work without HTTP auth now. If the test cluster is working, I'll provision the prod cluster as well.
Wed, May 27
In T256168#11959969, @hashar wrote:Can we claim victory on this one did you have following steps in mind? The ones I think of are removing the agent in Jenkins and deleting the jobs (I can take care of that).
Tue, May 26
https://tools.wmflabs.org/toolhub was where we were publishing the initial versions of the toolinfo schema. In those early versions, we used identifiers like https://tools.wmflabs.org/toolhub/schema/1.1.1 which we intended to be usable as a URI to retrieve the specification. It looks like https://tools.wmflabs.org/toolhub/schema/1.2.0-draft01 was the last specification published there on 2020-09-15. Later we began using https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/toolhub/+/refs/heads/main/jsonschema/toolinfo as the git source tree location for schemas which ends up exposing them to the internet via URLs like https://toolhub.wikimedia.org/static/jsonschema/toolinfo/1.2.2.json.
Sat, May 23
In T427037#11948808, @valerio.bozzolan wrote:
- and if Toolforge tools are in scope anyway 🤔 - maybe a question for the Wikidata bar
Thu, May 21
WMCS has an ancient and largely locally undocumented tie to OpenStreetMap projects. The connection comes from a OpenStreetMap-Wikimedia cooperation that included OpenStreetMap tools/servers being part of the Toolserver environment. When Toolserver was disbanded some of the OSM projects came to what we now call Cloud VPS. See also osmwiki:Collaboration with Wikipedia which talks about some of this.
In T152581#11945389, @valerio.bozzolan wrote:→ So, in all cases, we need answers. And we can produce a small export with:
- tool name
- license
- repo link
Unfortunately I do not know Toolforge internals (Striker internals? etc.) to produce such export. But this sounds like a feasible sub-task.
In T420833#11943406, @bd808 wrote:That is 11,118 distinct /24 networks each sending 2 requests in the last ~2 hours.
The current traffic pattern is full of the "2 requests per /24 network" pattern. Range blocks will need to be vast again to make much difference. We really need to get requestctl and more fine grained pattern matching tech into the Beta Cluster CDN edge.
Wed, May 20
In T426822#11942742, @bd808 wrote:This is currently blocking an unblock request for Beta Cluster.
This is currently blocking an unblock request for Beta Cluster.
Tue, May 19
https://k8s-status.toolforge.org/namespaces/tool-jimmy/ is rendering today. It is very difficult to say what API crawling failure was breaking this particular tool's display 8 months ago.
Mon, May 18
In T393782#11931952, @Andrew wrote:"The new CAPI driver and the old Heat driver are compatible and can both be active on the same deployment"
I created T426638: Make it possible for users to self-claim the "Volunteer" badge for discussion of finding a way for this badge to be self-awarded again.
I suppose one option available could be a tool hosted on Toolforge or elsewhere that would perform the granting action on behalf of an authenticated user. That is more moving parts to keep track of than just button clicks within Phabricator itself, but if the permissions model of Badges does not separate editing the badge from awarding the badge that direct use might not be reasonably possible.
Sat, May 16
In T415237#11928230, @TripleCamera2022 wrote:@bd808 Could you please import all the pads referenced by mwstake.org? Special:LinkSearch can quickly list all of them.
Probably was a T410540: Wikibugs does not rejoin channels automatically following a BNC restart recurrence.
tools.wikibugs@tools-bastion-14:~$ kubectl get po NAME READY STATUS RESTARTS AGE gerrit-6464bb8dc9-4dr2f 1/1 Running 0 17h gitlab-7cb775c5f9-7kjzg 1/1 Running 0 18h irc-66579f4f68-k4dpq 1/1 Running 0 95s phorge-6cb946dfdb-krtd9 1/1 Running 0 20h wikibugs-5bf6746d8f-2j6cr 1/1 Running 0 17h znc-6658b7c4f4-jtlfh 1/1 Running 0 17h
That znc pod restart likely being the trigger.
Fri, May 15
I'm pretty sure this is a variation of the older bug at T144943: Groups and tools only refreshed at login.
