[06:56:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:01:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:35:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10MoritzMuehlenhoff) >>! In T295767#7529602, @ayounsi wrote: > All 3 VMs got rebuilt with larger disks, but with the default Debian Buster. > > @MoritzMuehlenh... [08:59:17] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2041.codfw.wmnet with OS buster [09:05:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2041:9331 is unreachable - https://alerts.wikimedia.org [09:11:33] (host being reimaged) [09:15:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2041:9331 is unreachable - https://alerts.wikimedia.org [09:53:06] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2041.codfw.wmnet with OS buster c... [09:57:56] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:02:55] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster [10:09:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3064:9331 is unreachable - https://alerts.wikimedia.org [10:14:39] (host being reimaged) [10:18:27] vgutierrez: o/ qq about atskafka - what is the plan for haproxy? Does it work out of the box? [10:18:52] no plans yet TBH [10:19:04] as we haven't picked between envoy, haproxy or even stay with ats :) [10:19:46] ack makes sense, if we keep varnishkafka for longer time if may be worth to stop the actual test on cp3050 and simplify things for yoy [10:19:49] *you [10:19:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3064:9331 is unreachable - https://alerts.wikimedia.org [10:20:11] that's e.m.a's land [10:20:18] (he is currently on vacation) [10:20:48] yep yep I know, we can take a decision when he's back [10:45:11] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster e... [10:46:02] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster [11:13:16] 10netops, 10Infrastructure-Foundations: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) p:05Triage→03Low [11:21:37] 10netops, 10Infrastructure-Foundations: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) Ok so on the switches I can see requests hitting the dns servers and they are responding: ` cmooney@dns1001:~$ sudo tcpdump -i ens2f0np0 -l -p -nn host 10.136.128.4 tcpdump: ver... [11:22:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3064:9331 is unreachable - https://alerts.wikimedia.org [12:00:18] 10netops, 10Infrastructure-Foundations, 10SRE: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) Ok yes it seems to be the loopback filter alright, testing the change on asw1-b13-drmrs adding a new term as advised in the KB article fixed it: ` cmooney@asw1-b13-drmrs... [12:02:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3064:9331 is unreachable - https://alerts.wikimedia.org [12:09:58] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster c... [13:51:18] 10Traffic, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Gehel) [13:57:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) ^^ apologies ignore above used incorrect task ref. [13:59:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) ^^ ignore above - pasted wrong task ID. and sorry for spam. [15:20:19] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10cmooney) Seems like a sane proposal. The use of sflow and a different pipeline will keep a clean separation between it and data fr... [18:18:19] 10Traffic, 10DNS, 10SRE, 10WMF-Communications: Setup subdomain for Foundation messaging site - https://phabricator.wikimedia.org/T296570 (10Varnent) We will be sharing this site with all staff around December 1. Domain is not necessary per se as we have a temporary domain - but do we have a sense of when i... [18:18:56] (EdgeTrafficDrop) firing: 33% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org [18:48:57] (EdgeTrafficDrop) resolved: 58% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org [18:57:56] (EdgeTrafficDrop) firing: (2) 37% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [19:02:56] (EdgeTrafficDrop) resolved: (2) 37% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [19:21:26] (EdgeTrafficDrop) firing: (2) 44% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [19:26:26] (EdgeTrafficDrop) firing: (2) 47% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [19:36:26] (EdgeTrafficDrop) firing: (2) 38% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [19:41:26] (EdgeTrafficDrop) resolved: (2) 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [20:56:51] I'm going to start doing firmware updates on cp6* systems in drmrs. My understanding is while they are online and calling into icinga, they aren't serving anything so i can just rol lthrough them at will [20:57:00] only dns require one to stay online at all time [20:59:09] first a test update on cp6001, and if goes fine batches of 5 or so. [22:30:38] ok, all the drmrs cp hosts have newest firmware and are back calling into puppet