[05:52:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:57:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:47:56] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:57:56] (HAProxyEdgeTrafficDrop) resolved: 53% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:58:16] (VarnishTrafficDrop) firing: Varnish traffic in esams has dropped 62.70058743580331% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [07:58:56] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:03:16] (VarnishTrafficDrop) firing: Varnish traffic in esams has dropped 18.2759318653131% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [08:28:56] (HAProxyEdgeTrafficDrop) resolved: 2% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:33:05] (PurgedHighEventLag) firing: (6) High event process lag with purged on cp3050:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:33:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in esams has dropped 1.4280236772736803% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [09:01:33] (PyBalBGPUnstable) firing: (6) PyBal BGP sessions on instance lvs3005 are failing - TODO - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:01:38] good day, FYI I've modified https://grafana.wikimedia.org/d/000000464/varnish-aggregate-client-status-code to select also drmrs by default, it was not selected and could create confusion [09:01:58] if you have other dashboards with a similar filter it might be worth checking them ;) [09:09:01] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) Reboot completed sucessfully, currently router not showing any alarms: ` root@re0.cr2-esams> show system alarms No alarms currently active `... [09:19:56] (HAProxyEdgeTrafficDrop) resolved: 57% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:20:00] (PyBalBGPUnstable) firing: (3) PyBal BGP sessions on instance lvs3005 are failing - TODO - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:24:11] (PyBalBGPUnstable) resolved: (3) PyBal BGP sessions on instance lvs3005 are failing - TODO - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:24:25] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:25:28] (HAProxyEdgeTrafficDrop) resolved: 51% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:26:34] (VarnishTrafficDrop) firing: Varnish traffic in drmrs has dropped 69.1325390550757% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [09:26:42] (HAProxyEdgeTrafficDrop) firing: 63% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:30:16] (VarnishTrafficDrop) firing: Varnish traffic in drmrs has dropped 45.802439588325996% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [09:30:56] (HAProxyEdgeTrafficDrop) firing: (2) 41% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:31:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) >>! In T319184#8288137, @cmooney wrote: > [..] > Anyway thought I'd mention just in case you weren't aware. Thanks, double checking this now.... [10:00:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in drmrs has dropped 57.77685043500769% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [10:00:56] (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:16:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Plan of action: General overview before/after. Red: deactivated/removed. Green: activated/added. {F35550079} We're... [10:37:42] 10Traffic, 10SRE, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10jbond) While implementing the the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723/31/modules/varnish/templates/upload-fr... [12:39:14] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) [12:53:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) [12:59:43] 10Traffic, 10Analytics-Radar, 10SRE, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez) [13:14:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) I don't think it's true to say the VRRP is over VXLAN here, the VRRP... [13:36:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) cableid c220756659 fpc2 - fpc8. [14:11:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Puppet, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [14:26:04] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) So, we have a need to move on this pretty quickly, as we have 16 new cache hosts in ulsfo pending installs on this, and then 16 more in eqsin righ... [14:29:20] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow. [14:30:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) Ok yeah I see what is going on. Cloudnet1005 is running VXLAN over U... [14:31:10] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) >>! In T319067#8290850, @MoritzMuehlenhoff wrote: > I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow. Thank you! [15:04:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 65.58152932710347% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:05:56] (HAProxyEdgeTrafficDrop) firing: 52% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:09:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 29.8698876527127% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:17:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Row C got moved to the new linecards with no issues, but moving cr1<->row D caused an outage. As row C cleanup, @Jclark-ctr can you rem... [15:19:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqiad has dropped 55.93238599822019% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:20:56] (HAProxyEdgeTrafficDrop) resolved: 63% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:29:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero thanks. Reading briefly through the docs I have a better u... [15:32:56] (HAProxyEdgeTrafficDrop) firing: 49% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:36:11] (HAProxyEdgeTrafficDrop) firing: (2) 37% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:36:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) >>! In T319539#8291916, @cmooney wrote: > I gather the hypervisor ho... [15:38:41] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) > * Add support for it (it being whatever it takes to switch to 5.10) to the reimage cookbook stuff @BBlack the above patch should have all that's... [15:40:16] (VarnishTrafficDrop) firing: Varnish traffic in codfw has dropped 25.602016599255517% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:45:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in codfw has dropped 25.602016599255517% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:45:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) > But we do have keepalived running on cloudgw servers. So we may wan... [15:47:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:51:11] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:05:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) This has been completed smoothly! I deleted the following VC cables from Netbox: 0315 0316 0317 0318 0320 Please... [16:07:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) 05Open→03Resolved a:03ayounsi Sub-task completed successfully nothing more to do here. [16:27:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [16:45:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Also looks like the optic or fiber needs to be replaced, error rate is high: https://librenms.wikimedia.org/device/device=162/tab=port/p... [16:50:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Diff if the above patch is merged (running from my laptop with updated template): ` Changes for 8 devices: ['c... [18:35:59] volans: Thanks for doing that! [18:47:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) dns4003 appears to be pushed fully into service (thanks @ssingh!) With that now seeming all green in icinga & confirmed with @BBlack , I'll move ahead and take down/decom dns4002 next tim... [18:49:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) Can this be changed at any time? I will work on netbox updates when not in data center [22:18:17] brett: anytime :)