[05:12:47] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#10009307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye executed with errors: - deploy1003 (**FAIL**) - Downtimed on Icinga/A... [10:15:51] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10009738 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=52c5c269-d4e9-4489-a397-00874b75eb1c) set by cgoubert@cumin1002 for 21 days, 0:0... [10:16:46] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#10009740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye [10:35:30] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" with Cloudflare - https://phabricator.wikimedia.org/T370118#10009868 (10akosiaris) Adding some more info, I 've went to https://dash.cloudflare.com/?to=/:account/:zone/security/bot... [10:44:59] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" with Cloudflare - https://phabricator.wikimedia.org/T370118#10009901 (10Volans) @akosiaris with the wikimedia account we have we do have access to the `Add Verified Bot` form and p... [10:50:59] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10009945 (10akosiaris) [11:27:33] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10010093 (10akosiaris) [11:40:49] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10010121 (10akosiaris) OK, thanks I can see that too now thanks. I 've been collecting... [12:03:15] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10010167 (10akosiaris) @ppelberg, @DLynch @zoe. The verified bot form requires entering... [12:39:26] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#10010229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed: - deploy1003 (**PASS**) - Downtimed on Icinga/Alertmanager... [13:50:30] 06serviceops, 10ChangeProp, 10MediaWiki-Core-HTTP-Cache, 06MediaWiki-Engineering, and 3 others: Reduce the number of resource_change and resource_purge events emitted due to template changes - https://phabricator.wikimedia.org/T369898#10010587 (10MSantos) [14:21:22] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T370672#10010743 (10Jhancock.wm) 05Open→03Resolved [14:23:44] folks as FYI I am working with Content Transform to move Kartotherian to K8s (hopefully, if we manage to upgrade it to bookworm/node-20) - https://phabricator.wikimedia.org/T216826 [14:23:54] please lemme know if you have anything against it [14:26:06] 06serviceops, 10ChangeProp, 10MediaWiki-Core-HTTP-Cache, 06MediaWiki-Engineering, and 3 others: Reduce the number of resource_change and resource_purge events emitted due to template changes - https://phabricator.wikimedia.org/T369898#10010778 (10akosiaris) >>! In T369898#9990749, @Ottomata wrote: >> The n... [14:26:51] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#10010781 (10akosiaris) Just armed keyholder, everything looks ok right now. I 'll send a notification to wikitech-l and engineering in slack for a deployment server move. Not much different from what we do for the switch... [14:27:47] elukey: I sure don't :D [14:30:03] ack :) [14:30:16] my main concern is long term maintenance, we'll see [14:31:04] 06serviceops, 10MW-on-K8s: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555#10010805 (10akosiaris) [14:34:01] elukey: there is some work to coordinate this at some point in Q2 with WMDE (context is the nodejs upgrade indeed). [14:35:55] akosiaris: related to kartotherian or nodejs in general? [14:36:12] both [15:14:47] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Fennec Fox (Aug 12-23, 2024)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10011036 (10jijiki) @santhosh, we'll discuss in our next week's team meeting and get back to you, thank... [15:30:53] Hi, sorry to bother, I'm investigating https://phabricator.wikimedia.org/T370304 (the outages) and since write spikes on s4 master seems to be different things for different instances but both were coming from jobs, I was wondering if the job queue concurrency setting or sharding might have bugs or something like that. Do you think this angel is worth investigating? [16:04:20] Amir1: when you say bugs, what kind of bugs ? [16:04:47] it might also be that configuration isn't what your expectations match? Btw here's the settings: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/changeprop-jobqueue/values.yaml#180 [16:05:44] as you can see it's multiple toggles, just for htmlCacheUpdate job and per the comment, it has a specific higher overall concurrency than other stuff [16:06:30] akosiaris: I meant those values not being enforced all the time [16:06:40] concurrency is always fun [16:07:36] We reduced the values for linter, I might reduce it for html cache job too. I wanted to rule out anything that might be unrelated. [16:07:57] since we had these thresholds for years and nothing terrible happened until recently [16:10:58] concurrency doesn't exactly mean jobs running [16:11:05] the code is pretty simple, all it does is [16:11:22] if (this._pendingMsgs.size < this.concurrency) { [16:11:22] this._consume(); [16:12:40] ah, it might mean we are hitting them now [16:12:44] cool. makes sense [16:50:42] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10011671 (10Ottomata) I think it will be quite a while before we are fully able to decom Dumps 1. This tas... [19:47:47] 06serviceops, 06MediaWiki-Platform-Team, 07Epic: Migrate Wikimedia production from PHP 8.1 to PHP 8.3 - https://phabricator.wikimedia.org/T360995#10012597 (10bd808) [19:47:49] 06serviceops, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10012598 (10bd808) [19:49:29] 06serviceops, 06MediaWiki-Platform-Team, 07Epic: Migrate Wikimedia production from PHP 8.1 to PHP 8.3 - https://phabricator.wikimedia.org/T360995#10012600 (10bd808) [20:17:41] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10012706 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [20:19:02] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10012714 (10VRiley-WMF) I have shut down the server and completed a flea power drain. Booted this server back u... [21:07:20] 06serviceops: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 (10Scott_French) 03NEW [21:14:03] 06serviceops: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10012895 (10Scott_French) [21:30:44] 06serviceops: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10012930 (10Scott_French) [22:21:55] 06serviceops, 07Datacenter-Switchover: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10013085 (10Scott_French) [23:29:55] 06serviceops, 10Charts, 10Shellbox, 06SRE: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10013287 (10aude) For building this as a node service, is it still recommended to use service-template-node? I noticed that it has some security issue...