[00:26:57] (EdgeTrafficDrop) firing: 33% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [00:41:57] (EdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:28:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5002:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [01:57:51] 10Traffic, 10DC-Ops: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) [09:04:57] (EdgeTrafficDrop) firing: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:14:57] (EdgeTrafficDrop) resolved: 60% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:37:57] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:47:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:00:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10Vgutierrez) [10:52:56] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:57:56] (EdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:20:57] (EdgeTrafficDrop) firing: 60% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:24:04] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5015.eqsin.wmnet with OS buster [11:38:38] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster [11:50:56] (EdgeTrafficDrop) resolved: 65% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [12:46:53] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) >>! In T300977#7821702, @Isaac wrote: > Chiming in as a heavy user of the stat boxes. It's difficult for me to... [12:49:45] bblack: FYI the change to the default sleep to 60s and disable puppet for the varnish cookbook are live. [12:50:10] I'm not sure if the puppet disable is really needed, as the sysctl restart shouldn't trigger any bad behaviour, but that's up to you :) [13:00:22] volans: yeah the problem comes in with all the strange service dependencies. e.g. we're restarting service "B", and in systemd and/or puppet terms, A depends on B depends on C or whatever, and puppet wants to start the currently-stopped ones in a different order than systemd is doing it and causes some mayhem [13:01:08] as a general case, given our complexity, we've found that almost anything else we do operationally outside of puppet doesn't blend well with puppet runs [13:02:38] ack :) [13:20:05] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) @BTullis > I realize that this suggestion increases the scope if the task considerably yup :) We unfortunately... [14:08:28] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster [14:16:56] (EdgeTrafficDrop) firing: 26% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [14:22:58] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster [14:31:56] (EdgeTrafficDrop) resolved: 0% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [14:33:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5013:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [14:50:34] FYI re: roll-restart varnish and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/777353 you can also include the threshold in the query for prometheus/thanos to get filtered results [14:53:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5013:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [14:57:36] godog: if you have handy the query feel free to comment it into the CR and I'll adapt [14:57:54] that will get rid of some of the logic after [14:58:03] and return a smaller set of items [14:59:00] I had looked to optimize the query when an alias is passed to set the job/site params, but it was too delicate if let's say tomorrow we'll have an alias cp-text_esams+drmrs for example :D [14:59:38] volans: ack, yeah I didn't want to muck the waters of the review! but essentially adding " > threshold" would DTRT [14:59:58] I believe, I don't know if that's been tried before and failed for some reason tho [15:00:20] agreed the alias map to job/site seems fragile [15:00:22] I have still a shell where I was testing it, let me try [15:00:43] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster com... [15:01:05] ack thanks, in theory the semantics are exactly the same, can be postponed too as a change of course [15:01:15] yeah seems to work, fixing, thx [15:02:16] volans: ok! happy to review another change too if that's easier [15:17:33] {sent} [15:19:40] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster com... [15:44:09] volans: ack, checking [17:46:56] (EdgeTrafficDrop) firing: 57% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [17:51:56] (EdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [18:14:34] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10User-Daniel: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922 (10Umherirrender) [18:43:58] 10Wikimedia-Apache-configuration, 10SRE: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176 (10Umherirrender) [20:35:45] 10Wikimedia-Apache-configuration, 10SRE, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) [21:45:11] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) [21:55:03] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) > Modify the puppet code to no longer download the databases from MaxMind and then propagate to other servers/destinations. This is done. puppet c... [21:57:35] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) a:05Dzahn→03None [22:02:56] 10Traffic, 10Analytics, 10SRE, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10Dzahn)