[07:20:26] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [08:49:48] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) 05Open→03Resolved I'm gonna close this for now. I used the following tooling to create the necessary in eqiad/codfw for recent expansion. An i... [08:49:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [08:51:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) Indeed blocked on the optics arriving, but to clarify the cable runs have been done we just need the optics to slot in and connect. @Jclark-ctr correct... [09:41:11] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) p:05Triage→03Low [09:45:03] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10Volans) SGTM, I would even consider a shorter time span :) [10:39:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [10:48:59] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:38] * jbond looking [12:37:06] 10netops, 10Infrastructure-Foundations, 10SRE: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) 05Open→03Resolved [13:27:24] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10ayounsi) I thought that was not possible but it got introduced recently (in 16.1). +1 [13:37:18] 10netops, 10Infrastructure-Foundations, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) [13:39:01] 10netops, 10Infrastructure-Foundations, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) Surely not related but I noticed that the conf2xxx nodes hold a ton (8/9k) sockets in TIME_WAIT, most of them related to nginx -> etcd local traffic.... [13:50:02] (SystemdUnitFailed) firing: (3) kube-controller-manager.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:44] (SystemdUnitCrashLoop) firing: kube-controller-manager.service crashloop on aux-k8s-ctrl1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:53:59] (SystemdUnitFailed) firing: (3) kube-controller-manager.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:44] (SystemdUnitCrashLoop) resolved: kube-controller-manager.service crashloop on aux-k8s-ctrl1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:56:43] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Vgutierrez) having this in place would have prevented a ncredir related page already. I'm happy to have this opt-in per cookbook (personally I'... [15:23:01] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10Volans) In terms of feasibility the only way to "resume" is to install a signal handler for SIGINT that asks the user to either resume or continue **but ther... [15:35:02] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) So, `ethtool -G eno1 rx 1000` apparently did the [trick](https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=con... [16:20:01] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10BCornwall) TBH I think that cookbooks should be *less* verbose, which will help punctuate the more important information. IMO the current output of co... [16:24:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10cmooney) We removed switch asw-b1-codfw as it no longer had any servers connected (they were moved to cloudsw1-b1-codfw). The correlation between th... [16:34:03] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10BCornwall) The cookbooks should already stop/prompt the user when built with `confirm_on_failure()`. Anything more interactive is probably not a good UX and... [19:13:59] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:17] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10BCornwall) @Vgutierrez Is this something that should be addressed in the cookbook? Your idea of automatically including it in cookbooks with d... [19:26:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10BCornwall) It looks like confirmation is already shown in the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/hea... [20:13:59] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:52] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) Thank you for putting the summary together. Another scenario I was thinking about while reading the document is up...