[06:51:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:56:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:59:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:04:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:33:25] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10ayounsi) This regularly alerts and is not actionable as it's a monitoring glitch. The CPU usage on the device is for example: `Cpu(s): 0.3%us, 0.0%sy, 0... [08:54:11] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster [09:00:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:08:39] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10fgiunchedi) Agreed the librenms patch is the way to go, I won't have the bandwidth any time soon but happy to assist [09:15:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:16:11] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:17:03] mmandere: o/ [09:17:04] around? [09:21:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:26:41] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:28:32] elukey o/ [09:29:31] hello :) [09:29:57] I am doing some restarts related to purged and varnishkafka on cp6* nodes, I can loop you in if you want [09:30:31] Hi there :) no problem I am happy to help [09:31:11] so there are a couple of things in icinga that are alerting [09:31:15] 1) purged lag state [09:31:24] 2) varnishkafka delivery errors [09:31:41] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable - https://alerts.wikimedia.org [09:32:03] both are using, behind the scenes, librdkafka to communicate with kafka [09:32:21] ok [09:32:52] we have already seen this in the past, but with our current version when some network event happens (or new nodes are set up etc..) librdkafka may get into a weird state, ending up in timeouts etc. [09:33:14] if you check logs for the purged instance on (for example) cp6014 you'll see what I mean [09:33:33] the "delivery errors" for varnishkafka are failures to deliver a message to kafka [09:33:40] on it [09:33:40] but since they are not live, it is very weird [09:33:47] mmandere: one important thing [09:34:08] for purged I restarted up to cp6005 [09:34:13] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&from=now-3h&to=now&var-datasource=drmrs%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp6005 [09:34:32] if you check the graphs, purged tries to pull a lot of data to recover [09:34:39] (from kafka I mean) [09:35:06] so to avoid any possible bw issues on kafka main, if possible let's stagger the restarts [09:35:31] (basically restarting new instances when the others already recovered, keeping 2/3 instances pulling data at the same time) [09:35:40] for varnishkafka you can roll restart freely [09:36:00] I already tried varnishkafka-* on cp6009, it cleared the delivery error state [09:36:12] (lemme know if you have questions et.c.) [09:45:38] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster c... [09:47:56] (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [09:48:38] elukey: So we are trying to restart purged on either 2 or 3 cache instances in drmrs starting with cp6006... whileas checking that indeed the messages are being processed, right [09:49:35] mmandere: exactly yes, just checking that we don't pull too much from kafka as precautionary/paranoid measure (since we are not really in a hurry) [09:49:46] for varnishkafka it can be done faster [09:51:42] ok... so in what order, purged then varnish kafka will do [09:51:59] for the instances actively being worked on [09:57:10] mmandere: they can be restarted independently, you can proceed with both at the same time [10:02:39] elukey: ack! I'll proceed with the remaining cp6006 -16 [10:02:56] (EdgeTrafficDrop) resolved: 60% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:40:54] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster [10:47:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [10:52:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [11:22:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [11:30:45] godog: hmm it looks like I messed up back in the day when I added trafficserver_tls_client_ttfb_bucket [11:31:14] so le is provided in seconds but sum is in ms /o\ [11:32:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [11:41:28] godog: also.. something funny is going on here https://phabricator.wikimedia.org/P21619 [11:41:41] Count = 1 but every bucket seems to be empty? [11:42:26] hmm not empty.. 0 + 0 = 0 :) [11:44:03] hmm scratch that, shouldn't be 0.045 = 1 in there? [11:46:02] right.. [11:46:30] if I replace the TTFB value on the test input from 0ms to 1ms, then the 0.045 bucket gets the expected value [11:46:56] https://phabricator.wikimedia.org/P21620 [11:47:00] godog: mtail_store.py bug? [11:47:12] or am I missing something pretty obvious here? [11:56:05] this is the offending CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069 :) [11:56:30] now I'm wondering how the change it's going to the affect the metric and the dashboards [12:35:30] vgutierrez: you lost me, what's the bug ? [12:37:30] godog: on https://phabricator.wikimedia.org/P21619 [12:37:41] godog: see how Count is set to 1, but all the buckets seem to be empty [12:38:31] vgutierrez: yeah, count is the number of observations [12:38:46] so one observation of 0 would be count == 1 and all buckets 0 [12:39:00] hmm gotcha [12:39:58] and the seconds VS milliseconds issue? [12:40:04] same currently happens for atsbackend [12:41:08] checking [12:41:37] basically buckets are defined in seconds but sum is provided in milliseconds [12:43:30] indeed :| [12:44:52] hmmm [12:44:53] also [12:45:01] and I'm sorry for going back to the count issue [12:45:15] if you're right, the current implementation is awfully wrong [12:45:49] https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/atstls.mtail#L12-L52 [12:46:01] that's not adding the value on the bucket [12:46:34] but incrementing every time that it sees a value that fits on the bucket [12:47:48] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster c... [12:48:29] doh, my bad you are right at least one bucket should be one [12:48:49] vgutierrez: ^ [12:49:16] ack [12:50:40] so https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069 should fix the seconds VS milliseconds issue on the TTFB metrics for ats-tls [12:51:20] and we need to do the same for https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/atsbackend.mtail [12:58:11] looks like it yeah [12:58:57] so how messed up is the current metric on prometheus? [12:59:26] I guess it's hard to tell [13:01:22] yeah also depends whether we're using it in 'sum' in any dashboard, afaict histogram_quantile() on the buckets is correct at least [13:05:06] there's a tool to search grafana dashboards, I couldn't find it committed anywhere [13:05:10] so there's now https://gerrit.wikimedia.org/r/c/operations/software/+/767118 [13:06:17] histogram_quantile definitely looks sane [13:17:55] neat [13:18:02] * godog meeting, bbiab [13:23:37] 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) [13:32:04] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster [13:33:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) Change has now been rolled out. All seems ok, aggregate route is still being created at POPs where it was previously, and announced exter... [13:37:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) @cmooney thanks! @ssingh let me know when we're good to advertise DoH from drmrs @bblack let me know hwen we're good to advertise nsa.wiki... [13:38:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [13:43:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [13:49:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [14:04:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [14:23:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6014:9331 is unreachable - https://alerts.wikimedia.org [14:36:06] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster c... [14:38:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6014:9331 is unreachable - https://alerts.wikimedia.org [14:47:16] 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10ssingh) a:03ssingh [14:59:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [15:00:37] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [15:01:16] 10Traffic, 10SRE, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) 05Open→03Resolved We've migrated the cp servers using envoy to HAProxy so this shouldn't be an issue anymore. [15:13:02] godog: any precautions that I should take before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069? [15:15:09] vgutierrez: I'd say perhaps a quick audit of the dashboards with the tool I linked before to see where we use the metric(s), and a git grep in puppet.git / alerts.git to double check we're not using _sum, LGTM other than that [15:16:21] modules/profile/files/prometheus/rules_ops.yml: expr: sum by (cluster, http_status_family, cache_status, le) (rate(trafficserver_tls_client_ttfb_bucket[2m])) [15:16:29] modules/profile/files/prometheus/rules_ops.yml: expr: sum by (cluster, http_status_family, cache_status) (rate(trafficserver_tls_client_ttfb_count[2m])) [15:22:54] seems safe yeah [15:55:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org [16:03:03] godog: ack, I'll merge it tomorrow morning [16:09:45] SGTM vgutierrez [16:15:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org [19:06:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6011:9331 is unreachable - https://alerts.wikimedia.org [19:11:56] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp6011:9331 is unreachable - https://alerts.wikimedia.org [19:16:56] (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [19:41:57] (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [19:56:57] (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [20:11:57] (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [20:26:57] (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [20:47:56] (EdgeTrafficDrop) firing: 50% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [20:52:56] (EdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [21:16:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6013:9331 is unreachable - https://alerts.wikimedia.org [21:35:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6009:9331 is unreachable - https://alerts.wikimedia.org [21:56:17] 10netops, 10Infrastructure-Foundations: SingTel transport circuit ELINEGWR00001716 down - https://phabricator.wikimedia.org/T302841 (10CDanis) [22:10:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6009:9331 is unreachable - https://alerts.wikimedia.org [22:14:26] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp6009:9331 is unreachable - https://alerts.wikimedia.org [22:39:04] (EdgeTrafficDrop) firing: 50% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [22:39:17] ^^ FWIW none of the above cp alerts exactly correspond with the blip on the GTT VPLS service in drmrs I seen earlier at 19:27. [22:40:33] I ran a few tests from alert1001 and it right now seemingly can connect to them. Either way if I gave the impression it was all due to GTT I don't believe that is the case. [22:41:57] Traffic is also going via Telxius 10G wave and thus that GTT blip shouldn't have affected it either way. [22:41:57] https://phabricator.wikimedia.org/P21627 [23:04:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org