[09:56:27] <jbond>	 fyi im going to test a reimage of sretest1002
[09:56:53] <jbond>	 slyngs: i see you on there are youdoing something?
[09:57:15] * jbond can use 1001 instead
[09:57:31] <slyngs>	 I'm off. I was just using it for referencing some files
[09:57:45] <jbond>	 cool thanks
[09:57:45] <volans>	 pick any for me :)
[09:58:05] <jbond>	 cheers ill continue with 1002
[09:58:14] <slyngs>	 A file... I mean /bin/false is a personal favorite
[12:16:58] <slyngs>	 Something like this to rerun puppet on failed hosts  cumin -p0  -b 40 '*' 'run-puppet-agent  --failed-only -q'   ?
[12:20:58] <volans>	 slyngs: yes https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed
[12:21:19] <volans>	 the -p though seems a bit weird do you expect them to fail?
[12:21:59] <slyngs>	 Oh, right..
[12:22:42] <slyngs>	 Running something on 2000+ devices is still a little scary :-)
[12:24:08] <volans>	 are they all broken?
[12:24:19] <volans>	 if so just wait 30m and it will be quicker than your run
[12:24:29] <volans>	 if only few hundreds are broken then go ahead :D
[12:24:56] <slyngs>	 Someone may or may not have done a puppet patch that broke ... a lot of devices.
[12:25:37] <volans>	 puppetboad says 548 :D
[12:25:42] <volans>	 going down
[12:26:14] <slyngs>	 Yeah, I removed a resource the wrong way
[12:26:31] <volans>	 it happens, you fixed it :)
[12:26:43] <volans>	 blame also the reviewer(s) :-P
[12:27:11] <slyngs>	 That's trigger, because what if I review something that breaks :-)
[12:27:16] <slyngs>	 Tricky
[12:34:13] <volans>	 I was joking but there is indeed some shared "ownership" of a reviewed change even if in a different way:)
[12:55:54] <slyngs>	 Don't we have a "We
[12:56:08] <slyngs>	 "We're in this together" slogan or something 
[13:02:33] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi)
[13:06:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) Some of our transits like Lumen use MEDs so we need to make sure that a global knob doesn't impact those negatively. Another idea is to use BG...
[13:18:35] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi) Thanks for the task and feedback. If the issue is abuse from a limited number of providers (like in {T163312} it seems better to filter out that kin...
[13:27:34] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10Papaul) We(dc-ops) have been receiving a lot of interface alerts error in the pass 1 month or so. Will it be possible to si...
[13:53:12] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T333007#9238718, @Papaul wrote: > We(dc-ops) have been receiving a lot...
[14:00:52] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh)
[15:39:16] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr)
[15:49:55] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero)
[18:07:18] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60fd6a7d-c8e6-49a7-96ff-ccbed13297a2) set by cmooney@cumin1001 f...
[18:10:11] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01394557-10ca-4b57-b8c9-c263e86708ec) set by cmooney@cumin1001 f...
[19:29:20] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1acb901c-b161-4437-8a77-d11252fb6315) set by cmooney@cumin1001 for 2:00:00 on 6 host(s...
[19:29:42] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e1738b5-8479-4892-843b-26ddc9d964ea) set by cmooney@cumin1001 for 2:00:00 on 18 host(...
[20:36:41] <brett>	 Heya. I'm trying to reimage ncredir5001 but it's failing to come back up and the cookbook is stuck in the wait_reboot_since forever. ncredir5002 worked fine last week (two weeks ago?) Is this a known issue?
[20:37:05] <brett>	 I'm having a hard time getting any output for troubleshooting as gnt-instance refuses to connect due to host key failures
[20:40:23] <brett>	 oh nice, it finally rebooted. Nevermind, I guess
[20:40:36] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney) p:05Triage→03Low
[21:26:53] <brett>	 Hm, it's happening again
[21:27:04] <brett>	 on 212/240
[21:29:24] <brett>	 Looks like the host is up as I'm getting responses from ssh ncredir5001.eqsin.wmnet
[21:54:09] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10cmooney) Thanks @Jclark-ctr, I can confirm things look good (including light levels and pings I've not added here).  ` cmooney@ssw1-f1-eqiad> show int...
[22:37:23] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney)
[22:38:02] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney)
[22:41:34] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney)
[22:46:11] <brett>	 I give up for today. Would love some feedback when you have the time!
[23:43:33] <jinxer-wm>	 (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:48:33] <jinxer-wm>	 (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed