[00:05:28] 10serviceops, 10Phabricator, 10collaboration-services, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) @Aklapper Well... let me take a look, something in between. We don't have to delete the entire cla... [03:28:17] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) Okay, let me know if https://gerrit.wikimedia.org/r/983963 plus the most recent iteration of https://gitlab.wikimedia.org/repos/sre/k8s-controller-sidec... [08:16:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10akosiaris) For the record, in SEL this host had logged ` ------------------------------------------------------------------------------- Record: 2 Date/Time... [08:31:41] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10akosiaris) Thanks @taavi for setting this host to inactive. The CPU 1 machine check error was also logged one more time, ` ----------------------------------------------------------------------------... [08:52:06] 10serviceops, 10Maps, 10SRE, 10Traffic: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10MSantos) @Nicolas_Raoul thanks for reaching out. I am one of the main maintainers of Maps and maybe the person that can help the approval process, however I wil... [10:51:53] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10JMeybohm) >>! In T348284#9413568, @RLazarus wrote: > Oh, I misunderstood what you meant by "enable the controller on a per namespace level" [[ #9392506 | above ]]... [11:08:06] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 (10Clement_Goubert) [11:08:08] 10serviceops, 10Prod-Kubernetes: Rethink kubernetes etcd storage - https://phabricator.wikimedia.org/T348466 (10Clement_Goubert) [11:40:14] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10JMeybohm) 05Open→03Resolved Resolving this as the immediate problem is resolved and remaining follow-ups have their own tasks [11:43:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 (10JMeybohm) >>! In T353464#9413167, @bking wrote: > Forgive me for the drive-by comment, but would it be possible to create high IOPS tiers for Ganeti (RAID-0?) I'd re... [11:46:27] 10serviceops, 10Dumps-Generation, 10Infrastructure-Foundations, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Clement_Goubert) >>! In T271142#9413333, @akosiaris wrote: >>>! In T271142#9382040, @Volans wrote: >> Another... [11:53:26] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Allow to address Kubernets API servers from NetworkPolicy - https://phabricator.wikimedia.org/T287491 (10JMeybohm) kube-state-metrics successfully introduced the pattern of using a calico networkpolicy with service selector to match masters... [12:01:40] 10serviceops, 10MediaWiki-DjVu, 10Shellbox, 10Structured-Data-Backlog, and 4 others: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 (10Clement_Goubert) >>! In T352515#9407128, @brennen wrote: > Seeing this also for PdfHandler: > > > ==== Error ====... [12:19:45] 10serviceops, 10envoy, 10observability, 10Patch-For-Review: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10akosiaris) I 've tried to change the default in 1 of the users above, namely the services proxy (I haven't even looked at the tlsproxy that does TLS termination pr... [13:16:51] Hi! I am trying to deploy a version bump on proton but helm failed. From kubectl logs I got: `0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.` Can somebody help me with that ? [13:16:55] This is staging btw. [13:32:15] 10serviceops: Proton deployment fails on staging kubernetes - https://phabricator.wikimedia.org/T353699 (10Jgiannelos) [13:32:17] nemo-yiannis: I was wondering about that, staging is quite full wrt CPU requests [13:33:16] we need to do some form of cleanup, either adding nodes or reevaluating CPU requests (the actual usage seems to be low) [13:34:42] Would the same happen on eqiad/codfw prod ? [13:37:25] no, prod has enough space [13:37:52] https://grafana.wikimedia.org/goto/5edX9WdIz?orgId=1, see CPU requested on staging vs prod (cluster k8s-staging vs k8s) [13:39:11] we keep an eye on prod, but apparently staging slipped our attention and just kept growing, sorry about that, I'll fix it today [13:39:35] Any suggestions on how to proceed? Should i try prod directly, should we wait until staging is resolved? [13:39:45] (we should probably have alerts too, those metrics are new and I haven't got to that yet '^^) [13:40:30] I can unblock staging in <1h, does that work for you? [13:40:45] (probably) [13:40:51] sure [13:40:55] ok, on it [13:40:58] thanks! [14:22:24] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 (10akosiaris) >>! In T353464#9412778, @JMeybohm wrote: >>>! In T353464#9412761, @akosiaris wrote: >> I am not so sure we actually do scratch that memory limit now. Look... [14:23:14] 10serviceops, 10Kubernetes: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 (10JMeybohm) [14:23:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Alert on calico components being down - https://phabricator.wikimedia.org/T353463 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:28:05] nemo-yiannis: I freed a couple CPUs, it's still quite tight but your deployment should go through I think [14:28:18] ok trying now [14:30:00] thanks! it picked up the last image [14:30:36] yay [14:30:48] now somebody should do something so it doesn't happen again in two weeks :D [14:34:12] 10serviceops: Proton deployment fails on staging kubernetes - https://phabricator.wikimedia.org/T353699 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [14:35:28] 10serviceops, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) @elukey, we have an updated estimate of the expected topic size increment per wiki we publ... [14:40:29] 10serviceops, 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) [15:53:17] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Volans) Should the icinga alert for host down and related service alerts in icinga and alertmanager be silenced given it's known and there is a task? [15:56:03] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c526ca54-768b-461b-9bc7-1666a80b4153) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hw fail... [16:14:06] thanks claime :) [16:14:13] yw [16:45:36] 10serviceops, 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) [17:17:49] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Ottomata) [20:18:38] 10serviceops, 10Phabricator, 10collaboration-services, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) >>! In T296022#9411591, @Aklapper wrote: > The `operations/puppet` repo stil...