[07:12:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster [07:20:56] (EdgeTrafficDrop) firing: 32% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:25:56] (EdgeTrafficDrop) resolved: 62% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:26:56] (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:28:47] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster [07:31:56] (EdgeTrafficDrop) firing: (2) 62% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:56:56] (EdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [08:20:48] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster com... [08:21:44] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster com... [08:21:52] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster exe... [09:30:39] 10Traffic, 10netops, 10Infrastructure-Foundations: 2022-04-06 esams saturation leading to traffic outage/slowdown (increased latency) - https://phabricator.wikimedia.org/T305532 (10jcrespo) [09:56:57] (EdgeTrafficDrop) firing: 58% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:58:31] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3052.esams.wmnet with OS buster [09:59:38] godog: what's the merge/deployment procedure on operations/alerts repo? [09:59:48] I want to go forward with https://gerrit.wikimedia.org/r/c/operations/alerts/+/776890 [10:01:00] ^^^ expected, currently reimaging cp3052 [10:01:23] vgutierrez: merge the patch, and it will be deployed at the next puppet run on prometheus hosts [10:01:37] awesome :) thanks [10:01:44] sure np! [10:10:23] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4027.ulsfo.wmnet with OS buster [10:11:57] (EdgeTrafficDrop) firing: (2) 58% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:16:57] (EdgeTrafficDrop) firing: (2) 61% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:26:57] (EdgeTrafficDrop) resolved: (2) 66% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:01:42] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3052.esams.wmnet with OS buster com... [11:12:14] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4027.ulsfo.wmnet with OS buster com... [11:32:43] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster [12:17:16] hello, I've a question regarding pybal's puppetization [12:17:34] in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/wmflib/types/service/lvs.pp#10 it says that bgp deaults to true [12:18:10] but in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pybal/templates/pybal.conf.erb#59 it seems it doesn't have a default [12:18:40] except for the global one on line 5, that is unrelated [12:19:22] so I was wondering if the comment is not correct or there is some other layer of indirection that sets the default [12:19:49] or it just wants to mean that each service inherits the global bgp value unless overriden [12:22:12] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster com... [13:07:01] volans: yeah the daemon itself defaults the config item "bgp" to true if it's not present at all, that's the real default. But I agree, the way it's worded in the puppet repo is unclear (sounds like a puppet-level default) [13:07:49] ok, thanks bblack! I'll add a note there, I'm modifying those files's comments anyway [13:08:10] so it doesn't inherit the global bgp value from line 5, correct? [13:09:18] oh it wasn't clear to me the two links were to two different layers of "bgp" [13:09:52] but either way, I don't think it defaults through in the puppet world, just within pybal itself, with both of them being true-by-default (the global being false will override any service-level True as well) [13:10:06] yeah I meant via pybal [13:10:13] the bgp in lvs.pp is at service level [13:10:28] in pybal.conf.erb at line 59 is the service-level one [13:10:42] right [13:10:43] but there is another one at line 5 that comes from the higher level pybal's hiera [13:11:00] and I think enables/disables bgp globally in pybal [13:11:07] my question can be rephrased to: [13:11:14] yes [13:11:25] if a service doesn't have the bgp key, pybal would use the global value above? [13:11:52] well, it's not a default in that strict sense, but operationally that's how it is from a black box perspective [13:11:52] (that defaults to no in puppet's ERB but is set to yes in hiera, to avoid confusions :D ) [13:12:13] the global bgp key, if set to false (default true), disables bgp for the whole daemon, regardless of what any individual lvs service config says. [13:12:34] if the global bgp key isn't false, then the per-lvs-service bgp keys matter, and they also default true. [13:12:57] ok perfect that was the bit that was not clear to me [13:12:59] thanks a lot [13:14:31] the reason for the ERB vs hiera thing (no/yes) is that any lvs we bring up that's not (yet, or ever) in production should have a false here to avoid interfering with live router<->lvs traffic. [13:14:46] so the ERB default is appropriate if we create an alternate lvs profile like lvs::testing::something::new [13:15:25] (or we could vary that hiera per-host when rolling out replacement LVS hardware in a DC, to control which are talking to routers through the transition process) [13:15:37] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4021.ulsfo.wmnet with OS buster [13:19:17] (which, I admit, is kind of a dark pattern from the POV of refactoring/cleanup and repo-wide things. When you have some kind of data variance that's important but only occasionally used and there's no live examples of it, it's easy for the next person to come along and re-design everything and remove that axis of configuration entirely) [13:19:40] this has happened multiple times, and specifically with lvs-related things, I think :) [13:39:15] eheheh indeed [14:01:30] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4021.ulsfo.wmnet with OS buster com... [16:59:56] (HAProxyEdgeTrafficDrop) firing: 56% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:14:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:16:56] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:26:56] (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:57:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10cmooney) Unfortunately the uRPF exception command is not supported on the QFX platform, which means configuring it on top-of-rac... [18:11:56] (HAProxyEdgeTrafficDrop) firing: 55% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:56:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:06:31] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Volans) If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin h... [20:27:18] 10Traffic, 10SRE: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) [20:58:54] 10Traffic, 10SRE: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) [20:59:00] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [22:26:05] 10Traffic, 10SRE: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10Dzahn) My 2 cents: cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with installs because it will just happen like once every 2 years or less a...