[05:49:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:59:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:28:51] great work folks for ATS9 [06:28:53] <3 [06:59:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:09:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:12:56] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:22:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:23:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:28:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:39:46] 10Traffic, 10SRE: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10hnowlan) Late responding on this one but thanks a lot for adding this feature! [09:23:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:24:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) The problem described on {T319300} may block work on some servers, but we have plenty of others to migrate, so we should have enough work to do. [09:33:10] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Routing loop for unused WMCS IPs in 185.15.56.0/24 - https://phabricator.wikimedia.org/T315956 (10cmooney) 05Open→03Resolved [09:34:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:35:52] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) a:03cmooney Thanks @ayounsi. Yeah 9216 was default max I had used for the VXLAN stuff originally, but 9192 is more than enough to support a 9,000 byte IP packet and allow for the VXLA... [09:36:51] 10netops, 10Infrastructure-Foundations, 10SRE: Validate new (anycast) IPv6 /48 announcement being accepted by transits - https://phabricator.wikimedia.org/T301900 (10cmooney) 05Open→03Resolved Thanks @ayounsi. I didn't finish checking every single one but it was accepted by all our major transits and is... [10:27:48] 10Traffic, 10SRE, 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) ATS is supposed to perform a cache_sync_dir every 60 seconds per the undocumented config setting `proxy.config.cache.dir.sync_... [10:33:01] 10Traffic, 10SRE, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) [10:39:16] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2036:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [10:47:12] 10Traffic: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394 (10Vgutierrez) [10:52:04] 10Traffic: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394 (10Vgutierrez) `------------------------------------------------------------------------------- Record: 35 Date/Time: 10/05/2022 10:35:19 Source: system Severity: Ok Description: A problem was detected related to t... [10:59:16] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2036:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:02:19] 10Traffic: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez cp2036 came back to life after a powercycle, system looks good. I'll reopen the task if this happens again [11:06:55] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @wiki_willy could you help us prioritizing the remaining work on eqiad? this needs to be fixed ASAP [11:32:26] 10netops, 10Ganeti, 10Infrastructure-Foundations: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [12:43:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [14:15:05] Hello traffic :) I'm going through the SRE training checklist, and I'm falling a little short on finding info about edge traffic. Would one of y'all be available sometime to give me a rundown? [14:16:51] can we assume you have already watched the related onboarding chats? :) [14:17:19] Depends on which onboarding chat :p Let me make sure [14:18:27] 00, 01, 15 [14:18:54] although 15 might be a bit outdated, I don't recall [14:19:13] even outdated ones are probably pretty instructive in the big picture [14:19:51] 00, 01 have been watched [14:19:58] 15 not yet [14:20:36] then as a reference (and not by any means to replace 1:1 interactions) all the traffic stack starts more or less here [14:20:39] https://wikitech.wikimedia.org/wiki/Global_traffic_routing [14:20:54] and the right menu walks you a bit around [14:21:02] btw, I've not yet seen any scheduled Q&A session [14:21:07] for your and the other new hires... [14:21:15] you should ask your onboarding buddy ;) :D [15:18:37] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [15:18:56] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10Papaul) [15:19:05] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) 05Open→03Resolved @ayounsi this is complete [15:51:15] 10Traffic, 10SRE, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) cache_dir_sync issues reported to upstream in https://github.com/apache/trafficserver/issues/9124 [16:20:26] claime: Any missing info that you glean from the videos is welcome in the wiki! That's not to say that it's your responsibility, just that it's welcome if you feel the urge :) [16:20:52] brett: ack :) [17:05:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) the 40G port in cr1-eqiad to connect asw-c2 and asw-d2 are reqady ` papaul@re0.cr1-eqiad> show interfaces terse | match et-1/1/ et-1/1/0... [17:22:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 9.022832290219943% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:22:56] (HAProxyEdgeTrafficDrop) firing: 49% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:27:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqiad has dropped 47.587513091386846% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:27:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:43:36] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster [17:47:03] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) I will reboot this tomorrow morning, Oct 6th at 08:00 and we can take it from there. [18:17:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @aborrero Just something I noticed, you may already be aware in which case ignore. I was testing out an updated puppet to netbox import script... [18:17:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez Would these 2 changes work for what is needed? If not we would have to order replacement cables longer lengths to r... [18:31:11] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster completed: - dns4003 (... [18:43:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @jclark-ctr as long as both lvs1017 and lvs1020 don't get connectivity from the same switch on a single row is ok. So those look...