[06:51:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[06:56:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[06:59:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[07:04:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[08:33:25] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10ayounsi) This regularly alerts and is not actionable as it's a monitoring glitch. The CPU usage on the device is for example: `Cpu(s):  0.3%us,  0.0%sy,  0...
[08:54:11] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster
[09:00:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:08:39] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10fgiunchedi) Agreed the librenms patch is the way to go, I won't have the bandwidth any time soon but happy to assist
[09:15:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:16:11] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:17:03] <elukey>	 mmandere: o/
[09:17:04] <elukey>	 around?
[09:21:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:26:41] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:28:32] <mmandere>	 elukey o/
[09:29:31] <elukey>	 hello :)
[09:29:57] <elukey>	 I am doing some restarts related to purged and varnishkafka on cp6* nodes, I can loop you in if you want
[09:30:31] <mmandere>	 Hi there :) no problem I am happy to help
[09:31:11] <elukey>	 so there are a couple of things in icinga that are alerting
[09:31:15] <elukey>	 1) purged lag state
[09:31:24] <elukey>	 2) varnishkafka delivery errors
[09:31:41] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2039:9331 is unreachable   - https://alerts.wikimedia.org
[09:32:03] <elukey>	 both are using, behind the scenes, librdkafka to communicate with kafka
[09:32:21] <mmandere>	 ok
[09:32:52] <elukey>	 we have already seen this in the past, but with our current version when some network event happens (or new nodes are set up etc..) librdkafka may get into a weird state, ending up in timeouts etc.
[09:33:14] <elukey>	 if you check logs for the purged instance on (for example) cp6014 you'll see what I mean
[09:33:33] <elukey>	 the "delivery errors" for varnishkafka are failures to deliver a message to kafka 
[09:33:40] <mmandere>	 on it 
[09:33:40] <elukey>	 but since they are not live, it is very weird
[09:33:47] <elukey>	 mmandere: one important thing
[09:34:08] <elukey>	 for purged I restarted up to cp6005
[09:34:13] <elukey>	 https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&from=now-3h&to=now&var-datasource=drmrs%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp6005
[09:34:32] <elukey>	 if you check the graphs, purged tries to pull a lot of data to recover 
[09:34:39] <elukey>	 (from kafka I mean)
[09:35:06] <elukey>	 so to avoid any possible bw issues on kafka main, if possible let's stagger the restarts 
[09:35:31] <elukey>	 (basically restarting new instances when the others already recovered, keeping 2/3 instances pulling data at the same time)
[09:35:40] <elukey>	 for varnishkafka you can roll restart freely
[09:36:00] <elukey>	 I already tried varnishkafka-* on cp6009, it cleared the delivery error state
[09:36:12] <elukey>	 (lemme know if you have questions et.c.)
[09:45:38] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster c...
[09:47:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[09:48:38] <mmandere>	 elukey: So we are trying to restart purged on either 2 or 3 cache instances in drmrs starting with cp6006... whileas checking that indeed the messages are being processed, right 
[09:49:35] <elukey>	 mmandere: exactly yes, just checking that we don't pull too much from kafka as precautionary/paranoid measure (since we are not really in a hurry)
[09:49:46] <elukey>	 for varnishkafka it can be done faster
[09:51:42] <mmandere>	 ok... so in what order, purged then varnish kafka will do
[09:51:59] <mmandere>	 for the instances actively being worked on
[09:57:10] <elukey>	 mmandere: they can be restarted independently, you can proceed with both at the same time
[10:02:39] <mmandere>	 elukey: ack! I'll proceed with the remaining cp6006 -16
[10:02:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 60% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[10:40:54] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster
[10:47:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3062:9331 is unreachable   - https://alerts.wikimedia.org
[10:52:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3062:9331 is unreachable   - https://alerts.wikimedia.org
[11:22:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3062:9331 is unreachable   - https://alerts.wikimedia.org
[11:30:45] <vgutierrez>	 godog: hmm it looks like I messed up back in the day when I added trafficserver_tls_client_ttfb_bucket
[11:31:14] <vgutierrez>	 so le is provided in seconds but sum is in ms /o\
[11:32:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3062:9331 is unreachable   - https://alerts.wikimedia.org
[11:41:28] <vgutierrez>	 godog: also.. something funny is going on here https://phabricator.wikimedia.org/P21619
[11:41:41] <vgutierrez>	 Count = 1 but every bucket seems to be empty?
[11:42:26] <vgutierrez>	 hmm not empty.. 0 + 0 = 0 :)
[11:44:03] <vgutierrez>	 hmm scratch that, shouldn't be 0.045 = 1 in there?
[11:46:02] <vgutierrez>	 right..
[11:46:30] <vgutierrez>	 if I replace the TTFB value on the test input from 0ms to 1ms, then the 0.045 bucket gets the expected value
[11:46:56] <vgutierrez>	 https://phabricator.wikimedia.org/P21620
[11:47:00] <vgutierrez>	 godog: mtail_store.py bug?
[11:47:12] <vgutierrez>	 or am I missing something pretty obvious here?
[11:56:05] <vgutierrez>	 this is the offending CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069 :)
[11:56:30] <vgutierrez>	 now I'm wondering how the change it's going to the affect the metric and the dashboards
[12:35:30] <godog>	 vgutierrez: you lost me, what's the bug ?
[12:37:30] <vgutierrez>	 godog: on https://phabricator.wikimedia.org/P21619
[12:37:41] <vgutierrez>	 godog: see how Count is set to 1, but all the buckets seem to be empty
[12:38:31] <godog>	 vgutierrez: yeah, count is the number of observations
[12:38:46] <godog>	 so one observation of 0 would be count == 1 and all buckets 0
[12:39:00] <vgutierrez>	 hmm gotcha
[12:39:58] <vgutierrez>	 and the seconds VS milliseconds issue?
[12:40:04] <vgutierrez>	 same currently happens for atsbackend
[12:41:08] <godog>	 checking
[12:41:37] <vgutierrez>	 basically buckets are defined in seconds but sum is provided in milliseconds
[12:43:30] <godog>	 indeed :|
[12:44:52] <vgutierrez>	 hmmm
[12:44:53] <vgutierrez>	 also
[12:45:01] <vgutierrez>	 and I'm sorry for going back to the count issue
[12:45:15] <vgutierrez>	 if you're right, the current implementation is awfully wrong
[12:45:49] <vgutierrez>	 https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/atstls.mtail#L12-L52
[12:46:01] <vgutierrez>	 that's not adding the value on the bucket
[12:46:34] <vgutierrez>	 but incrementing every time that it sees a value that fits on the bucket
[12:47:48] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster c...
[12:48:29] <godog>	 doh, my bad you are right at least one bucket should be one
[12:48:49] <godog>	 vgutierrez: ^
[12:49:16] <vgutierrez>	 ack
[12:50:40] <vgutierrez>	 so https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069 should fix the seconds VS milliseconds issue on the TTFB metrics for ats-tls
[12:51:20] <vgutierrez>	 and we need to do the same for https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/atsbackend.mtail
[12:58:11] <godog>	 looks like it yeah
[12:58:57] <vgutierrez>	 so how messed up is the current metric on prometheus? 
[12:59:26] <vgutierrez>	 I guess it's hard to tell
[13:01:22] <godog>	 yeah also depends whether we're using it in 'sum' in any dashboard, afaict histogram_quantile() on the buckets is correct at least
[13:05:06] <godog>	 there's a tool to search grafana dashboards, I couldn't find it committed anywhere 
[13:05:10] <godog>	 so there's now https://gerrit.wikimedia.org/r/c/operations/software/+/767118
[13:06:17] <vgutierrez>	 histogram_quantile definitely looks sane
[13:17:55] <godog>	 neat
[13:18:02] * godog meeting, bbiab
[13:23:37] <wikibugs>	 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere)
[13:32:04] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster
[13:33:47] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) Change has now been rolled out.  All seems ok, aggregate route is still being created at POPs where it was previously, and announced exter...
[13:37:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) @cmooney thanks! @ssingh let me know when we're good to advertise DoH from drmrs @bblack let me know hwen we're good to advertise nsa.wiki...
[13:38:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1087:9331 is unreachable   - https://alerts.wikimedia.org
[13:43:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1087:9331 is unreachable   - https://alerts.wikimedia.org
[13:49:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1087:9331 is unreachable   - https://alerts.wikimedia.org
[14:04:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1087:9331 is unreachable   - https://alerts.wikimedia.org
[14:23:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6014:9331 is unreachable   - https://alerts.wikimedia.org
[14:36:06] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster c...
[14:38:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6014:9331 is unreachable   - https://alerts.wikimedia.org
[14:47:16] <wikibugs>	 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10ssingh) a:03ssingh
[14:59:49] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[15:00:37] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez)
[15:01:16] <wikibugs>	 10Traffic, 10SRE, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) 05Open→03Resolved We've migrated the cp servers using envoy to HAProxy so this shouldn't be an issue anymore.
[15:13:02] <vgutierrez>	 godog: any precautions that I should take before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/767069?
[15:15:09] <godog>	 vgutierrez: I'd say perhaps a quick audit of the dashboards with the tool I linked before to see where we use the metric(s), and a git grep in puppet.git / alerts.git to double check we're not using _sum, LGTM other than that
[15:16:21] <vgutierrez>	 modules/profile/files/prometheus/rules_ops.yml:    expr: sum by (cluster, http_status_family, cache_status, le) (rate(trafficserver_tls_client_ttfb_bucket[2m]))
[15:16:29] <vgutierrez>	 modules/profile/files/prometheus/rules_ops.yml:    expr: sum by (cluster, http_status_family, cache_status) (rate(trafficserver_tls_client_ttfb_count[2m]))
[15:22:54] <godog>	 seems safe yeah
[15:55:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org
[16:03:03] <vgutierrez>	 godog: ack, I'll merge it tomorrow morning
[16:09:45] <godog>	 SGTM vgutierrez 
[16:15:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org
[19:06:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6011:9331 is unreachable   - https://alerts.wikimedia.org
[19:11:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp6011:9331 is unreachable   - https://alerts.wikimedia.org
[19:16:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[19:41:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[19:56:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[20:11:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[20:26:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[20:47:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 50% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org
[20:52:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org
[21:16:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6013:9331 is unreachable   - https://alerts.wikimedia.org
[21:35:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6009:9331 is unreachable   - https://alerts.wikimedia.org
[21:56:17] <wikibugs>	 10netops, 10Infrastructure-Foundations: SingTel transport circuit ELINEGWR00001716 down - https://phabricator.wikimedia.org/T302841 (10CDanis)
[22:10:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6009:9331 is unreachable   - https://alerts.wikimedia.org
[22:14:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp6009:9331 is unreachable   - https://alerts.wikimedia.org
[22:39:04] <jinxer-wm>	 (EdgeTrafficDrop) firing: 50% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[22:39:17] <topranks>	 ^^ FWIW none of the above cp alerts exactly correspond with the blip on the GTT VPLS service in drmrs I seen earlier at 19:27.
[22:40:33] <topranks>	 I ran a few tests from alert1001 and it right now seemingly can connect to them.  Either way if I gave the impression it was all due to GTT I don't believe that is the case.
[22:41:57] <topranks>	 Traffic is also going via Telxius 10G wave and thus that GTT blip shouldn't have affected it either way.
[22:41:57] <topranks>	 https://phabricator.wikimedia.org/P21627
[23:04:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org