[09:56:27] fyi im going to test a reimage of sretest1002 [09:56:53] slyngs: i see you on there are youdoing something? [09:57:15] * jbond can use 1001 instead [09:57:31] I'm off. I was just using it for referencing some files [09:57:45] cool thanks [09:57:45] pick any for me :) [09:58:05] cheers ill continue with 1002 [09:58:14] A file... I mean /bin/false is a personal favorite [12:16:58] Something like this to rerun puppet on failed hosts cumin -p0 -b 40 '*' 'run-puppet-agent --failed-only -q' ? [12:20:58] slyngs: yes https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [12:21:19] the -p though seems a bit weird do you expect them to fail? [12:21:59] Oh, right.. [12:22:42] Running something on 2000+ devices is still a little scary :-) [12:24:08] are they all broken? [12:24:19] if so just wait 30m and it will be quicker than your run [12:24:29] if only few hundreds are broken then go ahead :D [12:24:56] Someone may or may not have done a puppet patch that broke ... a lot of devices. [12:25:37] puppetboad says 548 :D [12:25:42] going down [12:26:14] Yeah, I removed a resource the wrong way [12:26:31] it happens, you fixed it :) [12:26:43] blame also the reviewer(s) :-P [12:27:11] That's trigger, because what if I review something that breaks :-) [12:27:16] Tricky [12:34:13] I was joking but there is indeed some shared "ownership" of a reviewed change even if in a different way:) [12:55:54] Don't we have a "We [12:56:08] "We're in this together" slogan or something [13:02:33] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) [13:06:09] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) Some of our transits like Lumen use MEDs so we need to make sure that a global knob doesn't impact those negatively. Another idea is to use BG... [13:18:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi) Thanks for the task and feedback. If the issue is abuse from a limited number of providers (like in {T163312} it seems better to filter out that kin... [13:27:34] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10Papaul) We(dc-ops) have been receiving a lot of interface alerts error in the pass 1 month or so. Will it be possible to si... [13:53:12] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T333007#9238718, @Papaul wrote: > We(dc-ops) have been receiving a lot... [14:00:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [15:39:16] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) [15:49:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [18:07:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60fd6a7d-c8e6-49a7-96ff-ccbed13297a2) set by cmooney@cumin1001 f... [18:10:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01394557-10ca-4b57-b8c9-c263e86708ec) set by cmooney@cumin1001 f... [19:29:20] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1acb901c-b161-4437-8a77-d11252fb6315) set by cmooney@cumin1001 for 2:00:00 on 6 host(s... [19:29:42] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e1738b5-8479-4892-843b-26ddc9d964ea) set by cmooney@cumin1001 for 2:00:00 on 18 host(... [20:36:41] Heya. I'm trying to reimage ncredir5001 but it's failing to come back up and the cookbook is stuck in the wait_reboot_since forever. ncredir5002 worked fine last week (two weeks ago?) Is this a known issue? [20:37:05] I'm having a hard time getting any output for troubleshooting as gnt-instance refuses to connect due to host key failures [20:40:23] oh nice, it finally rebooted. Nevermind, I guess [20:40:36] 10netops, 10Infrastructure-Foundations, 10SRE: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney) p:05Triage→03Low [21:26:53] Hm, it's happening again [21:27:04] on 212/240 [21:29:24] Looks like the host is up as I'm getting responses from ssh ncredir5001.eqsin.wmnet [21:54:09] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10cmooney) Thanks @Jclark-ctr, I can confirm things look good (including light levels and pings I've not added here). ` cmooney@ssw1-f1-eqiad> show int... [22:37:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:38:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:41:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:46:11] I give up for today. Would love some feedback when you have the time! [23:43:33] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:33] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed