[06:32:56] <_joe_> ottomata: given the cuts to the hardware provisioning, I'm not really keen in overloading kafka-main with new stuff [06:33:19] <_joe_> basically we'd be throwing a lot more stuff to process to an already aging cluster [06:33:42] <_joe_> so please let's review numbers before you move to the next step (prod clusters and kafka-main) [06:57:33] 10serviceops: Publish Wikimedia bookworm base Docker image - https://phabricator.wikimedia.org/T335560 (10Joe) While I've added bookworm to the build process, I think I'll revert that part of my change. Every time I try to run a build there is a different symlink broken on `snapshots.debian.org` making the proc... [07:24:54] 10serviceops: Publish Wikimedia bookworm base Docker image - https://phabricator.wikimedia.org/T335560 (10Joe) It's much simpler than that - debuerreotype would work correctly - the problem is that snapshots are full of broken links for bookworm - again: http://snapshot.debian.org/archive/debian/20230515T030231... [07:47:51] 10serviceops, 10Patch-For-Review: Publish Wikimedia bookworm base Docker image - https://phabricator.wikimedia.org/T335560 (10Joe) 05Open→03Stalled After discussion with @MoritzMuehlenhoff - it makes sense that snapshots might be broken right now that things change hectically for a testing distro before th... [08:14:49] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:16:43] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:19:19] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:24:15] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [09:26:10] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10hnowlan) [09:30:05] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (They Live 🕶️🧟): Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10jnuche) 05Open→03Resolved Change is now in prod. Scap should now complete deployments when gitlab is not available. [10:03:16] 10serviceops, 10Observability-Metrics, 10SRE, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) [10:03:40] hi folks, an heads up re: cadvisor upgrade in ^ [10:04:48] I'll start the rollout this afternoon, from mwdebug, I've tested the upgrade and there's no significant change [10:05:00] in terms of actions required that is [10:18:32] heads-up, I will be bumping memory limits for thumbor in codfw and eqiad in a few minutes: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/919808 [10:27:38] once switch maintenance is done I'll be returning thumbor-k8s eqiad to 100% by depooling the metal instances [10:30:07] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D sw... [10:46:45] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D sw... [10:51:36] there's an unapplied change for certmanager in codfw, adding dnsNames for 15.wikipedia.org and annual.wikimedia.org. Probably safe to apply? [10:51:43] er codfw-staging [10:52:15] staging-codfw is our playground, I 'd assume it's safe [11:04:47] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [11:55:37] _joe_: we decided that we'd be producing to kafka-jumbo, ya? [12:02:47] <_joe_> ottomata: d'oh sorry, I went on vacation afterwards and I forgot [12:05:36] 10serviceops, 10Service-deployment-requests: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) [12:09:28] haha :) [12:09:56] I think its good we decided that. maybe one day we'll want the content events multi DC for other stuff, but for now, especially since we're new at this, jumbo is good. [12:17:45] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh) [12:49:13] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:00:00 on 18... [12:53:19] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [12:54:11] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:09:53] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) [13:10:18] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) [13:26:03] 10serviceops, 10Observability-Metrics, 10SRE, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed, we're running cadvisor `0.44.0+ds1-1~wmf1` on buster and bullseye [13:28:05] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:54:38] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switc... [14:10:27] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switc... [14:10:43] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10jijiki) Looks like we need to put more servers to the problem, even if it is not this specific job that is adding on utilisation, since we have the hardware to... [14:13:37] 10serviceops, 10MW-on-K8s, 10Performance-Team (Radar), 10Wikimedia-production-error: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes - https://phabricator.wikimedia.org/T336025 (10jijiki) Added `librsvg2-bin` to php-multiversion-base [14:24:51] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10herron) [14:26:40] 10serviceops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went very well. Thanks everybody! That was the last one! [17:06:57] interesting accidental stress test - thumbor in eqiad was k8s at weight 10, and metal at weight 5 while codfw was depooled and there were no issues. Queues jumped a little, but not much of an increase in errors that wasn't proportionate to the traffic served [17:07:11] I've since depooled metal in eqiad now that codfw is back in - looking okay so far [17:17:15] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase, 10SRE, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10FJoseph-WMF) I've scheduled a meeting this week for followup [17:24:49] <_joe_> hnowlan: we should test depooling codfw anyways - one dc needs to be able to keep up with all the traffic [21:12:07] 10serviceops, 10Security-API-Service, 10Kubernetes: Create helm chart for iPoid - https://phabricator.wikimedia.org/T336163 (10jijiki) [21:13:56] 10serviceops, 10Security-API-Service, 10Kubernetes: Create helm chart for iPoid - https://phabricator.wikimedia.org/T336163 (10jijiki)