[07:31:59] <wikibugs>	 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9887561 (10JMeybohm) >>! In T365253#9829691, @MoritzMuehlenhoff wrote: >>>! In T365253#9829677, @elukey wrote: >> I checked the dragonfly repo and I have a ques...
[07:49:03] <dcausse>	 o/ when seeing container cpu usage like this: https://w.wiki/ANdL is this something you would consider troubling (the cpu throttle behavior), for this envoy container we set concurrency: 2, [request|limit].cpu: 1
[07:57:14] <jayme>	 dcausse: if there is no "real" issue, I wouldn't say it's troubling
[07:59:55] <dcausse>	 jayme: ok, we believe the pipeline does not manage to achieve the troughput we expect, we're just looking at the various datapoints and found this one a bit odd because it starts to throttle before hitting its limit
[08:04:25] <jayme>	 dcausse: that's because throttling is complex (https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#How_CPU_requests_and_limits_are_applied) and bound to (time) windows
[08:04:57] <jayme>	 IIRC Peter was playing around with concurrency and resource settings for cirrus some time ago
[08:05:38] <jayme>	 if you feel like it's limiting you, you may ofc increase the limits. But thb the throttling does not seem like "that much"
[08:07:09] <dcausse>	 ok, thanks!
[08:19:10] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Improve calico-typha firewall rules - https://phabricator.wikimedia.org/T365687#9887639 (10JMeybohm)
[08:39:36] <wikibugs>	 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9887671 (10akosiaris)
[08:49:07] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9887706 (10Gehel) Adding #serviceops and #infrastructure-foundations to get a review
[08:59:27] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[09:08:40] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9887749 (10Clement_Goubert) @VRiley-WMF Do you object to us reusing that task by reopening it whenever we have a batch of servers to relabel, or would yo...
[09:10:33] <wikibugs>	 06serviceops, 06Data-Engineering-Icebox, 06Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551#9887755 (10akosiaris) In Kubernetes Special Interest Group, we recently re-evaluated the approach of running Docker outside of Kubernetes....
[09:38:13] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9887868 (10brouberol)
[09:38:17] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[10:03:54] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887946 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2003.codfw....
[10:05:59] <wikibugs>	 06serviceops, 10decommission-hardware, 13Patch-For-Review: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9887949 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: `mw[2281,2283-2286].codfw.wmnet` - mw2281.co...
[10:07:54] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE-OnFire, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9887959 (10kamila) >>! In T366205#9880294, @Papaul wrote: > @kamila  your plan works for us as well, just depool and power the fi...
[10:26:55] <wikibugs>	 06serviceops, 10decommission-hardware, 13Patch-For-Review: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888038 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: `mw[2287-2290].codfw.wmnet` - mw2287.codfw.w...
[10:31:12] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888054 (10Clement_Goubert) @Papaul All servers except `mw2282` decommissioned.
[10:31:16] <wikibugs>	 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888045 (10Clement_Goubert)
[10:32:16] <wikibugs>	 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888074 (10MoritzMuehlenhoff)
[10:32:33] <wikibugs>	 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888049 (10Clement_Goubert) a:05Clement_Goubert→03None
[10:32:42] <wikibugs>	 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888076 (10MoritzMuehlenhoff)
[11:12:49] <wikibugs>	 06serviceops, 10Beta-Cluster-Infrastructure, 13Patch-Needs-Improvement: Implement API Gateway solution for deployment-prep - https://phabricator.wikimedia.org/T254917#9888170 (10hnowlan) 05Open→03Declined I don't think this will be needed.
[11:45:39] <godog>	 FYI stead increase in 503s since this morning for mw-api-ext-ro, just hit the paging threshold
[11:45:51] <godog>	 https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet&from=1718192748018&to=1718279148018&viewPanel=12
[12:02:09] <wikibugs>	 06serviceops: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC - https://phabricator.wikimedia.org/T367401 (10fgiunchedi) 03NEW
[12:04:25] <godog>	 filed ^ not a big deal as far mw-api-ext is concerned overall though it still paged
[12:20:45] <wikibugs>	 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888356 (10WDoranWMF)
[12:37:54] <wikibugs>	 06serviceops: steady increase in 503s from mw-api-ext-ro.discovery.wmnet since 5 UTC - https://phabricator.wikimedia.org/T367401#9888430 (10fgiunchedi) 503s are gone as of ~12:20 UTC
[12:53:13] <claime>	 godog: check with Amir1, it's probably the circuit breaking
[12:53:39] <godog>	 claime: we thought so too, though that should yield 500s :| see also -sre
[12:53:48] <Amir1>	 claime: it is not circuit breaking :D
[12:53:48] <claime>	 yeah, catching up now
[13:21:40] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888659 (10Papaul) @Clement_Goubert thank you.
[13:40:57] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888707 (10Jhancock.wm) rails, power, and network cables prepped for mw2282 move.
[13:41:15] <wikibugs>	 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888709 (10hnowlan) >>! In T361835#9712223, @SGupta-WMF wrote: > @WDoranWMF Ye...
[14:12:28] <elukey>	 hi folks!
[14:12:34] <elukey>	 Is https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-parsoid/tags/ needed?
[14:12:55] <elukey>	 or can we drop it? IIUC this runs Stretch, seems abandoned
[14:14:28] <elukey>	 same for https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-restbase/tags/
[14:14:50] <elukey>	 https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-geoshapes/tags/
[14:17:04] <claime>	 they're not in deployment-charts at least
[14:17:48] <hnowlan>	 wtf, til there's a .pipeline/blubber.yaml file for restbase. But it's not building? O_o
[14:19:39] <hnowlan>	 restbase definitely can be dropped 
[14:19:46] <hnowlan>	 blubber support in parsoid was dropped in 2020
[14:21:29] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9888856 (10Clement_Goubert)
[14:30:11] <claime>	 There's still errors connecting to shellbox, but it doesn't seem overloaded
[14:31:30] <claime>	 I'm going to bump it by a couple of replicas, see what shakes
[14:34:38] <claime>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043087
[14:52:15] <cdanis>	 claime: not sure if relevant but fwiw some mediawiki calls will (serially) make 10-50 calls to shellbox_constraints
[14:56:12] <claime>	 cdanis: looks like errors have gone with the bump in resources
[14:56:23] <cdanis>	 neat
[14:56:58] <cdanis>	 claime: what does a shellbox_constraints actually run?  how many php-fpm workers
[14:57:39] <claime>	 2 workers afaict
[14:57:45] <cdanis>	 oh
[14:58:02] <cdanis>	 are the requests very short usually?
[14:58:15] <claime>	 yes
[14:58:16] <cdanis>	 if so that might be ok
[14:58:34] <claime>	 mean is around 1.2ms
[15:15:36] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[15:21:23] <_joe_>	 elukey: the parsoid stuff can be burnt with fire
[15:21:44] <_joe_>	 same as for restbase
[15:21:52] <_joe_>	 but we should remove the blubber files there
[15:22:27] <_joe_>	 claime, cdanis shellbox-constraints is used to safely evaluate user-provided regexes away from mediawiki
[15:22:42] <cdanis>	 yeah I know of it from seeing it in traces :)
[15:22:45] <_joe_>	 to ensure they're both converging and they don't run too long
[15:22:47] <elukey>	 I just removed eventgate-ci (Andrew confirmed it wasn't used)
[15:23:13] <_joe_>	 If it's overwhelmed, I'd suggest increasing the workers per podd probably
[15:25:21] <wikibugs>	 06serviceops: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427 (10elukey) 03NEW
[15:27:14] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[15:27:22] <wikibugs>	 06serviceops: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9889308 (10elukey) Dropped eventgate-ci as well (Andrew Otto confirmed that it is not used anymore since ages).
[15:28:12] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[15:35:15] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE-OnFire, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9889387 (10Papaul) @kamila no problem we can move that one. Once done we will update the task.
[15:38:00] <wikibugs>	 06serviceops, 10MediaWiki-Configuration, 06MediaWiki-Engineering, 10MW-on-K8s, and 2 others: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T346971#9889404 (10brennen) Still seen in 1.43.0-wmf.9.  Recent versions:  {F55289448}   ==== Error ====  * servi...
[15:40:40] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889411 (10Jdforrester-WMF) Looks like this is now done except for "some straggling traffic" for the api-gateway?  {F55289507}
[15:54:26] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889560 (10Clement_Goubert) Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters.
[15:57:56] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889584 (10hnowlan) I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway refers to traffic to the mediawiki API as "mwapi_cluster"...
[17:41:03] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9890137 (10VRiley-WMF) @Clement_Goubert I believe it would be better to open a new task for any servers that need to be relabeled.
[18:14:16] <jhathaway>	 wikikube-ctrl1003 alerted as being down, uptime though is 2hrs, _kamila was re-imaging, the box does not puppet clean. We think it is okay to leave as is
[19:10:36] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890462 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ffb1c0b-d404-4615-accd-65085d64f738) set by kamila@c...
[19:16:48] <kamila_>	 ^ wikikube-ctrl1003 should be fixed-ish once puppet finishes 
[19:16:59] <cdanis>	 thanks kamila_ 
[19:27:18] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE-OnFire, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[21:11:00] <wikibugs>	 06serviceops, 13Patch-For-Review: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932#9890861 (10Scott_French) Next steps: * Let this soak for a bit and check the noise level of KubernetesDeploymentUnavailableReplicas [0] (currently warning, so it matches no receivers). * Follow up...