[07:31:59] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9887561 (10JMeybohm) >>! In T365253#9829691, @MoritzMuehlenhoff wrote: >>>! In T365253#9829677, @elukey wrote: >> I checked the dragonfly repo and I have a ques... [07:49:03] o/ when seeing container cpu usage like this: https://w.wiki/ANdL is this something you would consider troubling (the cpu throttle behavior), for this envoy container we set concurrency: 2, [request|limit].cpu: 1 [07:57:14] dcausse: if there is no "real" issue, I wouldn't say it's troubling [07:59:55] jayme: ok, we believe the pipeline does not manage to achieve the troughput we expect, we're just looking at the various datapoints and found this one a bit odd because it starts to throttle before hitting its limit [08:04:25] dcausse: that's because throttling is complex (https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#How_CPU_requests_and_limits_are_applied) and bound to (time) windows [08:04:57] IIRC Peter was playing around with concurrency and resource settings for cirrus some time ago [08:05:38] if you feel like it's limiting you, you may ofc increase the limits. But thb the throttling does not seem like "that much" [08:07:09] ok, thanks! [08:19:10] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Improve calico-typha firewall rules - https://phabricator.wikimedia.org/T365687#9887639 (10JMeybohm) [08:39:36] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9887671 (10akosiaris) [08:49:07] 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9887706 (10Gehel) Adding #serviceops and #infrastructure-foundations to get a review [08:59:27] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [09:08:40] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9887749 (10Clement_Goubert) @VRiley-WMF Do you object to us reusing that task by reopening it whenever we have a batch of servers to relabel, or would yo... [09:10:33] 06serviceops, 06Data-Engineering-Icebox, 06Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551#9887755 (10akosiaris) In Kubernetes Special Interest Group, we recently re-evaluated the approach of running Docker outside of Kubernetes.... [09:38:13] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9887868 (10brouberol) [09:38:17] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [10:03:54] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887946 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2003.codfw.... [10:05:59] 06serviceops, 10decommission-hardware, 13Patch-For-Review: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9887949 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: `mw[2281,2283-2286].codfw.wmnet` - mw2281.co... [10:07:54] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE-OnFire, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9887959 (10kamila) >>! In T366205#9880294, @Papaul wrote: > @kamila your plan works for us as well, just depool and power the fi... [10:26:55] 06serviceops, 10decommission-hardware, 13Patch-For-Review: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888038 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: `mw[2287-2290].codfw.wmnet` - mw2287.codfw.w... [10:31:12] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888054 (10Clement_Goubert) @Papaul All servers except `mw2282` decommissioned. [10:31:16] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888045 (10Clement_Goubert) [10:32:16] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888074 (10MoritzMuehlenhoff) [10:32:33] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888049 (10Clement_Goubert) a:05Clement_Goubert→03None [10:32:42] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888076 (10MoritzMuehlenhoff) [11:12:49] 06serviceops, 10Beta-Cluster-Infrastructure, 13Patch-Needs-Improvement: Implement API Gateway solution for deployment-prep - https://phabricator.wikimedia.org/T254917#9888170 (10hnowlan) 05Open→03Declined I don't think this will be needed. [11:45:39] FYI stead increase in 503s since this morning for mw-api-ext-ro, just hit the paging threshold [11:45:51] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet&from=1718192748018&to=1718279148018&viewPanel=12 [12:02:09] 06serviceops: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC - https://phabricator.wikimedia.org/T367401 (10fgiunchedi) 03NEW [12:04:25] filed ^ not a big deal as far mw-api-ext is concerned overall though it still paged [12:20:45] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888356 (10WDoranWMF) [12:37:54] 06serviceops: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC - https://phabricator.wikimedia.org/T367401#9888430 (10fgiunchedi) 503s are gone as of ~12:20 UTC [12:53:13] godog: check with Amir1, it's probably the circuit breaking [12:53:39] claime: we thought so too, though that should yield 500s :| see also -sre [12:53:48] claime: it is not circuit breaking :D [12:53:48] yeah, catching up now [13:21:40] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888659 (10Papaul) @Clement_Goubert thank you. [13:40:57] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888707 (10Jhancock.wm) rails, power, and network cables prepped for mw2282 move. [13:41:15] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888709 (10hnowlan) >>! In T361835#9712223, @SGupta-WMF wrote: > @WDoranWMF Ye... [14:12:28] hi folks! [14:12:34] Is https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-parsoid/tags/ needed? [14:12:55] or can we drop it? IIUC this runs Stretch, seems abandoned [14:14:28] same for https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-restbase/tags/ [14:14:50] https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-geoshapes/tags/ [14:17:04] they're not in deployment-charts at least [14:17:48] wtf, til there's a .pipeline/blubber.yaml file for restbase. But it's not building? O_o [14:19:39] restbase definitely can be dropped [14:19:46] blubber support in parsoid was dropped in 2020 [14:21:29] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9888856 (10Clement_Goubert) [14:30:11] There's still errors connecting to shellbox, but it doesn't seem overloaded [14:31:30] I'm going to bump it by a couple of replicas, see what shakes [14:34:38] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043087 [14:52:15] claime: not sure if relevant but fwiw some mediawiki calls will (serially) make 10-50 calls to shellbox_constraints [14:56:12] cdanis: looks like errors have gone with the bump in resources [14:56:23] neat [14:56:58] claime: what does a shellbox_constraints actually run? how many php-fpm workers [14:57:39] 2 workers afaict [14:57:45] oh [14:58:02] are the requests very short usually? [14:58:15] yes [14:58:16] if so that might be ok [14:58:34] mean is around 1.2ms [15:15:36] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [15:21:23] <_joe_> elukey: the parsoid stuff can be burnt with fire [15:21:44] <_joe_> same as for restbase [15:21:52] <_joe_> but we should remove the blubber files there [15:22:27] <_joe_> claime, cdanis shellbox-constraints is used to safely evaluate user-provided regexes away from mediawiki [15:22:42] yeah I know of it from seeing it in traces :) [15:22:45] <_joe_> to ensure they're both converging and they don't run too long [15:22:47] I just removed eventgate-ci (Andrew confirmed it wasn't used) [15:23:13] <_joe_> If it's overwhelmed, I'd suggest increasing the workers per podd probably [15:25:21] 06serviceops: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427 (10elukey) 03NEW [15:27:14] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [15:27:22] 06serviceops: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9889308 (10elukey) Dropped eventgate-ci as well (Andrew Otto confirmed that it is not used anymore since ages). [15:28:12] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [15:35:15] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE-OnFire, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9889387 (10Papaul) @kamila no problem we can move that one. Once done we will update the task. [15:38:00] 06serviceops, 10MediaWiki-Configuration, 06MediaWiki-Engineering, 10MW-on-K8s, and 2 others: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T346971#9889404 (10brennen) Still seen in 1.43.0-wmf.9. Recent versions: {F55289448} ==== Error ==== * servi... [15:40:40] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889411 (10Jdforrester-WMF) Looks like this is now done except for "some straggling traffic" for the api-gateway? {F55289507} [15:54:26] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889560 (10Clement_Goubert) Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters. [15:57:56] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889584 (10hnowlan) I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway refers to traffic to the mediawiki API as "mwapi_cluster"... [17:41:03] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9890137 (10VRiley-WMF) @Clement_Goubert I believe it would be better to open a new task for any servers that need to be relabeled. [18:14:16] wikikube-ctrl1003 alerted as being down, uptime though is 2hrs, _kamila was re-imaging, the box does not puppet clean. We think it is okay to leave as is [19:10:36] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890462 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ffb1c0b-d404-4615-accd-65085d64f738) set by kamila@c... [19:16:48] ^ wikikube-ctrl1003 should be fixed-ish once puppet finishes [19:16:59] thanks kamila_ [19:27:18] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [21:11:00] 06serviceops, 13Patch-For-Review: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932#9890861 (10Scott_French) Next steps: * Let this soak for a bit and check the noise level of KubernetesDeploymentUnavailableReplicas [0] (currently warning, so it matches no receivers). * Follow up...