[00:38:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [01:08:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [06:13:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:13:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:19:25] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: SwitchCoreInterfaceDown (instance ssw1-f1-codfw:9804) - https://phabricator.wikimedia.org/T404946 (10LSobanski) 03NEW [10:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:08] ^ expected, install1004 is being shut down, I'll silence it [11:19:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193185 (10cmooney) So draining traffic from the node did not go as planned. This config was applied: ` set protocols bgp graceful-shutdown sender set routing-instanc... [11:42:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959 (10cmooney) 03NEW p:05Triage→03High [11:43:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193246 (10BTullis) Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivity on the dse-k8s cluster, which may well ha... [11:59:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193277 (10cmooney) >>! In T400783#11193246, @BTullis wrote: > Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivit... [14:55:17] topranks: how difficult is it to setup port mirroring on one of our juniper switches? Supermicro suspects our network is at fault for https://onsite.supermicro.com/index.php?/Tickets/Ticket/View/185807, which I think is highly unlikely. I was thinking a packet caputure from the host port would be pretty definitive. [14:56:48] jhathaway: I've not done it before tbh so I'll need to investigate. Shouldn't be too tricky I guess. A remote capture might be harder but probably if we get a test host connected to the same switch, mirror to that and do tcpdump on it we can see [14:56:57] I'm just about to jump on a call I'll investigate after [14:57:32] no rush at, happy to chat about other options when you have a moment [14:58:08] s/at/at all/ [15:24:48] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194244 (10Papaul) {F66055737} [15:24:54] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194245 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6fac31b1-92f6-4bf9-bf95-d9862483e9b6) set by cmooney@cumin1003 f... [15:31:11] FIRING: PfwCoreBGPDown: ... [15:31:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:46:20] RESOLVED: PfwCoreBGPDown: ... [15:46:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [16:04:09] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194406 (10Papaul) out put of todays' troubleshooting Last login: Tue May 20 13:04:15 on ttyu0 --- JUNOS 23.4R2.13 Kernel 64-bit JNPR-12... [16:12:04] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:16:09] 10Mail, 06Infrastructure-Foundations, 06SRE Observability, 07Epic: Parse DMARC reports and create a dashboard from data - https://phabricator.wikimedia.org/T404888#11194504 (10jhathaway) [16:33:08] 10Mail, 06Infrastructure-Foundations, 06SRE Observability, 07Epic: Parse DMARC reports and create a dashboard from data - https://phabricator.wikimedia.org/T404888#11194606 (10jhathaway) Observability, Would it be acceptable to store the data from the parsed DMARC reports in OpenSearch? My initial estimat... [16:41:17] jhathaway: what host is it you might need to do the port mirror for? [16:41:28] I don't have a super-micro account so I can't actually see the ticket [16:41:48] sretest2001 [16:42:52] There is a smidge more info in https://phabricator.wikimedia.org/T383173 [16:43:08] happy to get you a supermicro account, or forward the thread to you [16:44:34] I sent you the *humongous* thread, no need to read it, but if you want to search for anything in it [16:45:09] I'm open to other ideas, or feel free to say no, just trying to figure out the best way to exonerate our network stack