[06:37:57] (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:07:57] (EdgeTrafficDrop) resolved: 65% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [08:25:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster [09:02:15] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4029.ulsfo.wmnet with OS buster [09:27:06] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster com... [09:43:09] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4029.ulsfo.wmnet with OS buster com... [10:08:05] FYI, there is an active icinga LVS alert about miscweb: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2010&service=PyBal+IPVS+diff+check [10:09:12] that service is gone, jayme ^^ [10:09:42] but it doesn't seem it agrees to get killed, is fighting back since yesterday ;) [10:09:54] hm... [10:09:56] looking [10:10:15] I was pretty sure that I removed it [10:11:16] ofc. I failed to give the port to ipvsadm - sorry [10:13:18] {{done}} [10:13:46] thx [10:14:23] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster [11:08:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster com... [11:54:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3057.esams.wmnet with OS buster [12:13:06] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4023.ulsfo.wmnet with OS buster [12:47:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) 05Open→03Resolved DoH is advertised from drmrs, I'll leave it to Traffic to decide about the anycast NS. [12:49:55] XioNoX: eh sorry! I should have updated the ticket [12:51:01] sukhe: no pb :) [12:51:04] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3057.esams.wmnet with OS buster com... [12:54:07] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4023.ulsfo.wmnet with OS buster com... [13:51:03] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) On that second part, we discussed it within Infrastructure Foundation. With the webproxies (and url-downloade... [14:04:17] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster [14:06:00] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster exe... [14:09:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster [14:25:12] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Isaac) Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly as... [14:57:39] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster [15:09:19] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > malware accidentally downloaded (compromised library dependency, infected executable, etc) could easily "ph... [15:10:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster com... [15:36:05] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) >>! In T300977#7821926, @Ottomata wrote: > > I appreciate the intention here, but I'm not sure if the combo o... [15:40:20] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > its primary goal is limiting the capability of any such malware to 'phone home' to a command & control endp... [15:42:07] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster com... [16:31:58] XioNoX: if you're around - rebooting dns3002 (with dns3001 still alive and fine) seems to have caused loss of ns2 authdns service in esams. [16:32:15] not a huge deal for these short windows, but in the cr2-esams router config, I know we had it going to both dns3001+dns3002 [16:32:59] I guess we never really tied that to bird, it's just a static hashing to two destination IPs, so makes sense. It's probably only missing from half the world's IPs, basically. [16:35:11] so yeah I guess I don't have a real question [16:35:15] it makes sense :) [18:57:57] (EdgeTrafficDrop) firing: 59% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [19:00:58] ^ seems to be spurious? [19:01:34] oh I see now [19:01:42] it must be tracking ats-tls and not haproxy heh [19:01:44] https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text [19:22:57] (EdgeTrafficDrop) firing: (2) 62% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [19:24:58] so basically, edgetrafficdrop is now even more-useless than ever heh [19:27:57] (EdgeTrafficDrop) firing: (2) 62% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [19:28:45] lol [19:37:57] (EdgeTrafficDrop) firing: (2) 64% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [19:47:57] (EdgeTrafficDrop) firing: (2) 69% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [20:07:57] (EdgeTrafficDrop) resolved: 0% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [21:00:46] 10Traffic, 10SRE, 10ops-eqsin: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10BBlack) 05Open→03Resolved This seems to have resolved itself. There's no current SMART error, and all disks seem present and working at a glance. We can revisit if i... [21:12:21] bblack: I've found something weird when testing the new roll-restart-varnish cookbook in dry-run mode, it seems that the upload hosts don't have the varnish_main_threads_limited [21:12:41] is that known/expected? because the assumption for the cookbook was that we had that metric for all varnishes [21:14:53] volans: we are currently manually repooling the hosts so looking there but thanks, will check [21:16:04] don't worry about debugging this today fwiw :) [21:16:11] sukhe: I'm not sure I get it, are you just meaning that you're busy or the repooling might have something to do with this? [21:16:34] sorry, I meant we are busy with that so will look at what you wrote after finishing that [21:16:42] (just wanted to ack what you wrote) [21:16:50] sure, no hurry at all, thanks [21:17:30] AFAICT from grafana explore tab that metric is present only for text hosts [21:25:32] ok, nevemind, that was a total pebcak on my side [21:25:56] I'm sending the fix with a very loud :facepalm: :D [21:27:33] hmmm [21:27:45] I just checked varnishstat and it's called MAIN.threads_limited on both there, at that layer [21:27:48] not sure about prom [21:28:00] don't waste time on this, the fix will make you laugh [21:28:04] ok :) [21:28:43] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/775968 [21:29:00] * volans hides behind a very large double :facepalm: [21:29:04] lol [21:29:38] sometimes things work out exactly as you coded them... [22:02:47] ok, with the current state all seems to be good to go for live testing for the sre.cdn.roll-restart-varnish cookbook [22:03:01] I'm off tomorrow but in case you want to proceed with live-testing feel free [22:03:34] if not I'll ping back on Monday to choose a host where to test it [23:07:57] (EdgeTrafficDrop) firing: 57% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [23:57:57] (EdgeTrafficDrop) resolved: 69% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop