[08:45:10] 10Traffic, 10SRE: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10fgiunchedi) [09:10:25] sukhe, bblack, fyi https://gerrit.wikimedia.org/r/c/operations/homer/public/+/771438/1#message-3cfefb903ff0eecabf49eba39bcc2a31528bffdd so the varnishkafka issue is not related [09:16:55] Typoed BGP session on cr2-drmrs removed [09:18:35] I'm still looking into BFD, note that it doesn't impact BGP as BFD never came up, so I don't see it as a blocker to prod usage [09:20:18] About kafka this works fine: [09:20:25] ayounsi@cp6002:~$ nc -4zv kafka-jumbo1001.eqiad.wmnet 9092 [09:20:25] Connection to kafka-jumbo1001.eqiad.wmnet 9092 port [tcp/*] succeeded! [09:20:25] ayounsi@cp6002:~$ nc -6zv kafka-jumbo1001.eqiad.wmnet 9092 [09:20:25] Connection to kafka-jumbo1001.eqiad.wmnet 9092 port [tcp/*] succeeded! [09:22:10] so not sure what the issue is [12:32:34] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) Would prefer to proceed if possible, just want to finish so... [13:43:37] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10ayounsi) Sounds good, it's quite quick to apply, but first: ` promet... [14:02:11] 10Traffic, 10SRE: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) yep.. it's a threshold issue... OCSP warning is triggered mostly at the very same time that acme-chief fetches a new OCSP response from Let's Encrypt. The alert remains active till puppet r... [14:14:48] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) a:05Ottomata→03None [14:18:36] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10fgiunchedi) For sure, these are the IPs that the pushgateway could po... [14:31:49] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [14:34:59] I'm still at a loss on the varnishkafka drerr data missing [14:35:18] (for drmrs) [14:36:40] at some point it briefly sent some data, while I was trying various things to kick it [14:36:43] [2022-03-17 14:07:54] SERVICE ALERT: alert1001;cache_text: Varnishkafka eventlogging Delivery Errors per second -drmrs-;UNKNOWN;SOFT;1;NaN [14:36:46] [2022-03-17 14:05:00] SERVICE ALERT: alert1001;cache_text: Varnishkafka eventlogging Delivery Errors per second -drmrs-;OK;HARD;5;(C)5 ge (W)1 ge 0 [14:36:55] I had restarted various daemons [14:37:14] I've tried generating some local curl traffic in case it just doesn't send stats when no data is flowing, but that doesn't seem reliable either. [14:39:59] I think it was when I restarted varnish-frontend itself that we briefly got some drerr data [14:40:15] there's a lot of lag in the pipeline trying to line up actions and effect [14:46:50] yeah, restarting varnish-frontend on a node gets it to send drerr stats for ~2-3 minutes [14:47:26] this may really be a reqrate thing, and my test reqs on one node aren't enough to trip it? [14:52:24] ottomata: any idea is that's a reasonable theory? [that lack of sufficient real user traffic would cause varnishkafka drerr to not be reported for statsv + eventlogging, but still work fine for webrequest?] [15:00:02] nevermind [15:00:18] so the two that are missing drerr (statsv and eventlogging) are because there's no input data [15:00:23] because they filter on specific URIs: [15:00:28] eventlogging has the filter: [15:00:29] varnish.arg.q = ReqURL ~ "^/(beacon/)?event(\.gif)?\?" and ReqHeader:user-agent !~ "^Fuzz Faster U Fool" [15:00:47] and statsv has: [15:00:48] varnish.arg.q = ReqURL ~ "^/beacon/statsv\?" [15:00:59] these are presumably URIs that are hit at some rate by real clients via JS stuff [15:01:18] so DCs with real users see some traffic on these, and drmrs doesn't, and that explains the UNKNOWN state on drerr [15:01:26] oh [15:01:41] there is def traffic sent by varnishkafka by them [15:01:55] so there are real requestts hitting those URIs [15:02:05] yeah but not in drmrs, because it has no real users yet [15:02:25] (i just looked up drmrs) [15:02:30] riihgtt [15:02:40] my test curls don't execute the subrequests that would generate those, or the JS or whatever [15:02:46] ahhh i see [15:03:09] ok, so now they're known UNKNOWNs, and we can ignore that problem as a turn-up blocker :) [15:23:56] (EdgeTrafficDrop) firing: 40% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [15:47:34] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10BBlack) 05Open→03Resolved With the addition of the drmrs to the dns config in https://gerrit.wikimedia.org/r/c/operations/dns/+/771342 we're basically done w... [15:47:37] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10BBlack) [15:48:56] (EdgeTrafficDrop) resolved: 60% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [16:02:16] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [16:16:01] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Aklapper) Half a year later, is there more to do here? Anything I could (maybe) help with? [16:24:04] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) Lots left to do here, we've just been pummeled by several layers of ever-increasing high-priority things that take precedence over each other. What we're blocked on here is making time to do the tri... [16:35:14] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10bd808) >>! In T273956#7437046, @Vgutierrez wrote: > @dcaro I've implemented systemd's watchdog support on acme-chief. This is... [17:10:34] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10bd808) Toolforge admins got a notice today from Let's Encrypt that *.toolforge.org, *.tools.wmflabs.org, mail.tools.wmcloud.or... [17:17:03] Congrats on getting drmrs live [17:24:59] 10Acme-chief, 10User-bd808, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10bd808) 05Open→03Resolved a:03bd808 Hosts to check for/update to acme-chief 0.34-1 from https://openstack... [22:26:35] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:35:12] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)