[01:29:57] (EdgeTrafficDrop) firing: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:34:57] (EdgeTrafficDrop) resolved: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:53:57] (EdgeTrafficDrop) firing: 57% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:58:57] (EdgeTrafficDrop) resolved: 53% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [08:31:18] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5003.eqsin.wmnet with OS buster [08:42:22] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster [09:25:42] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5003.eqsin.wmnet with OS buster com... [09:26:56] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster com... [09:27:57] (EdgeTrafficDrop) firing: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:34:28] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Peter) Hi @vgutierrez the performance team continuously runs synthetic tests where we test the performance of a couple of Wikipedia p... [09:37:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:40:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) >>! In T290005#7828034, @Peter wrote: > Hi @vgutierrez the performance team continuously runs synthetic tests where we te... [09:49:56] (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:54:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:00:33] 10Traffic, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) [10:03:12] 10Traffic, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) p:05Triage→03High [10:44:57] (EdgeTrafficDrop) firing: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:49:56] (EdgeTrafficDrop) resolved: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:05:08] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3054.esams.wmnet with OS buster [11:20:57] (EdgeTrafficDrop) firing: 56% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:20:58] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4028.ulsfo.wmnet with OS buster [11:35:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:39:57] (EdgeTrafficDrop) firing: 59% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:49:56] (EdgeTrafficDrop) resolved: 65% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [12:01:19] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3054.esams.wmnet with OS buster com... [12:05:05] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4028.ulsfo.wmnet with OS buster com... [12:05:56] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [12:10:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [12:38:52] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3055.esams.wmnet with OS buster [12:49:03] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4022.ulsfo.wmnet with OS buster [13:31:14] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4022.ulsfo.wmnet with OS buster com... [13:34:09] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3055.esams.wmnet with OS buster com... [14:08:46] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5008.eqsin.wmnet with OS buster [14:24:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster [15:03:37] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5008.eqsin.wmnet with OS buster com... [15:07:34] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster com... [15:21:11] hello, I'm back for the live-testing of the roll-restart-varnish cookbook... [15:21:29] which host could we use for that? maybe one of those newly reimaged one that has already an empty cache? [15:22:40] hi volans, mmandere is taking care of the reimages so he could point you to a suitable host [15:22:56] volans: probably the best way in terms of impact, would be to hit an underutilized datacenter which isn't having reimages lately. [15:23:13] codfw is already fully-converted and tends to run light on load, so there's almost no impacting testing on one instance there [15:24:07] of course if you want to take over for the testing as you know better what to check... feel free ;) [15:27:58] yeah [15:28:10] plus it's a better review, to have someone who didn't author it do it :) [15:32:08] indeed [15:32:13] example usage at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/cdn/roll-restart-varnish.py#8 [15:34:52] yeah I'm playing with it now, in dry-run mode, testing some arguments and things to see if I get it all [15:36:20] ack, I'm here I can be of any assistance [15:38:01] volans: when in cookbook dry-run mode, does it really pull the threads_limited value or just always fails the threshold? [15:38:14] I can't artificially create the conditions to know and I'm being lazy about reading the code [15:38:23] it does pull it from prometheus for real [15:38:31] the irate is usually at 0 most of the time [15:38:36] you can trigger it passing -1 [15:38:52] ah, thanks [15:39:11] let me check if there is any host with a positive value [15:39:51] not currently I don't think [15:39:55] even zero didn't give me anything [15:40:14] because it does float(metric['value'][1]) > self._args.threads_limited [15:40:29] we can do >= if preferred :) [15:41:03] eh, doesn't much matter in practice [15:41:57] yeah I typical bad value AFAIUI is like 100k [15:42:04] so I didn't care much about the > vs >= [15:42:05] one feedback I have, is probably the default value of --grace-sleep should be higher. 60 seconds or so, just to give a little more overlap for the last one/batch to get some initla traffic burst going before pulling the next. [15:42:27] make people that feel they need and are able to do it faster for some case be explicit about it [15:43:24] so currently it would sleep (in dry-run sleeps are skipped but logged) 20s after depool before restart and 15s after restart before repool [15:44:18] the default for grace_sleep is defined in the upper layer of SREBatchBase, let me check the code [15:44:23] right, but that leaves only the 1s grace (plus some small runtime to execute) between "repool of previous host/batch" and "depool of next host/batch", which is not much for everything else to react and start getting traffic into it and get through its initial inrush [15:45:10] there's probably no correct value for all clusters, or even all situations when dealing with one cluster [15:45:23] yep, let me make it customizable [15:45:24] I'm just arguing for a more-conservative default, for this case [15:47:17] I'm guessing your brain is mainly thinking of the "threads_limited" case when writing this [15:47:39] but if it works well, we might use it for broader and more-generic maintenance cases, too [15:48:32] yes, one thing to consider, the current implementation was focused on varnish frontend only, so it accepts only restart_daemons and would not accept reboot as commands and doesn't touch ATS at all [15:48:42] yeah [15:48:46] we can easily have another one for more generic CP hosts restart all daemons / reboot [15:49:08] there are cases where we just want to restart all varnishes, too (e.g. a config param change which can't be set at runtime) [15:49:26] but there are cases like that for all the daemons, probably [15:50:09] it's kind of hard to think through how abstract or specific to be there, in writing cookbooks like these [15:51:07] with the current abstraction, is much easier to have multiple cookbooks, it's just few lines, so each one can do what's needed well without having to cover all cases [15:51:14] our messy reality is every cluster/service/system/situation is different and full of corner cases [15:51:22] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/776965 and 66 final for the grace sleep [15:51:53] but philosophically, I lean in the direction that the ideal we should aim at (but will probably never reach!) is to standardize the clusters' operations and practices to where fewer custom scripts work across a broader array of scenarios [15:52:39] eh in an ideal world™ yes :) [15:53:14] well, ideal worlds are never real [15:53:27] but aiming towards idealness as an ideal yields benefits :) [15:56:03] anyways, along those lines, I could imagine in the future morphing roll-restart-varnish into a future sre.cdn.roll-restart which can operate on any subset of daemons (and their attendant pooling), or even do reboots as well. A generic "roll some kind of reboot/restart on the CDN clusters" [15:56:39] and maybe if we had a standardized way of associating daemons with etcd service names, etc... that could even be genericized for any conforming cluster [15:58:05] anyways [15:58:11] on to testing, will try cp2027 [15:58:28] we kinda have both options already, there is sre.hosts.reboot-cluster that tries (and fail) to be a one size fit all and then there is the abstraction with SREBatchBase & Co. that allows to create very easily custom behaviours around a base common flow [15:59:30] do you want me to merge and deploy first the sleep at 60? [16:01:54] volans: sure [16:02:06] volans: found one other edge case to contemplate, by chance! [16:02:39] the puppet agent cronjob happened to fire while my attempt was running. we ended up 'sucessful', but that could've easily confused the process/script [16:02:54] should probably disable-puppet (and wait for exiting run to finish like it does) around things like this [16:03:01] or puppet will try to restart stopped daemons, etc [16:10:16] ack, I'll check it asap (based on how the meeting goes) [16:14:46] anyways, otherwise everything seemed fine with the test run [16:17:46] I'll add the disable/enable puppet later [17:20:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [17:20:57] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) @Jclark-ctr - just following up Cathal's last comment >>! In T292095#7801403, @cmooney wrote: > @Jclark-ctr I'm not getting any... [18:35:53] 10Traffic, 10SRE: Resolve issues with cp hosts and the reboot-single cookbook - https://phabricator.wikimedia.org/T305275 (10ssingh) 05Open→03Resolved We have tested the above two changes with six cp host reboots and there are no concerns, confirming that this issue has been fixed. Thanks to everyone for... [18:36:50] volans: ^ thanks for your input! [18:37:49] sorry, it's late for you I guess, I forgot :( [18:37:52] nothing to see here! [18:50:19] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) the two junipers are up now. @cmooney [18:56:37] sukhe: lol, glad it's all fixed! [19:09:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1078:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:39:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1079:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:54:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1079:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [20:10:57] (EdgeTrafficDrop) firing: 54% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [20:20:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop