[00:39:35] 10Traffic, 10SRE, 10conftool, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10CDanis) [03:10:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:20:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:43:50] jbond: thanks for all your patches! +1 from me on one and a couple comments on the other. would love a +1 from either vgutierrez or bblack on https://gerrit.wikimedia.org/r/c/operations/puppet/+/832268 [12:47:06] cdanis: thanks for the review, coments make senses ill update later today and have another crack at the vtc test. and also _1 for getting a +1 from vgut.ierrez or bbl.ack :) [12:47:22] * +1 for getting a +1 [12:47:49] I'll do a bit of poking at the webrequest history mining and show you my work after :) [12:49:47] cdanis: <3 that would be great. are you doing a junyper/data sluthing talk next week? if not then id love to attend one. i have done some hbase/hadoop stuff before but it was all about 10 years ago so very out of date and not played with any of the moderntooling [12:50:04] *if not next week id love to attend one eventually [12:50:27] but also appreciate you are overloaded like many [12:50:36] I have a mini-version of that talk next week -- it is somewhere between a full Jupyter/hive/spark talk and the quick intro to the various places to get live webrequest data I gave at an SRE team meeting some weeks ago [12:51:08] oh i may have missed the sre team meeting one do you know if it was recorded? [12:51:19] I'm not sure actually! [12:51:26] the session at the summit will be laptops required, we'll do some exercises jointly and then separately in turnilo, centrallog /srv/weblog + jq, and then jupyter [12:51:43] I can send you slides, sec [12:51:59] please and ack thats sounds good ill tgry to attend remotly [12:52:07] https://docs.google.com/presentation/d/1cUNqYO1NU6d_mZ69zgexLrgxiboKxJVw9AVQEU-OLyE/edit#slide=id.g15105b408d_0_287 [12:52:13] from the sre team meeting [12:52:19] thanks [12:52:47] oh make sure to turn on speakers notes, there's content there [12:53:07] ack thanks [17:28:16] (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 69.00963618324154% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:29:56] (HAProxyEdgeTrafficDrop) firing: 55% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:33:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqsin has dropped 53.43758929426661% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:34:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:35:03] we have talked about this in the past but we should look into either improving or retiring these alarms [17:35:13] there is real alert fatigue otherwise [18:54:45] I suggest just deleting it, AFAIK the only thing that it consistently detects is "half an hour ago, someone finished running a scraper" [18:55:14] I am not the most objective person on this but I do think that the NEL alerts capture everything that a traffic drop alert is intended to do and more [19:26:30] hey traffic folks, puppet is disabled with no note on cp1081 for a bit over 48 hours -- is that intended? [19:27:05] not sure who to highlight, brett maybe? :) [19:27:07] (and the ocsp update has monitoring alerts) [19:27:13] but just that host [19:27:15] right yes, thanks [19:27:56] was cp1081 a testing host in some form? the number feels familiar but now I dont see it [19:28:46] also ran manual /usr/local/sbin/update-ocsp-all there in an attempt to fix the alert [19:29:55] what's even weirder tbh is the lack of entries in `last` on cp1081 or on either cumin host's cumin.log for cp1081 [19:30:33] lastlog | grep -v Never showed me nobdoy on it since July until now [19:31:09] it would not be the first time that a puppet is disabled by itself somehow with empty message [19:31:20] really? [19:31:24] wow [19:31:25] that's spooky [19:31:36] I would not say super common but have seen it before [19:31:42] then we just enabled it again [19:32:32] dunno. maybe twice a year among all the hosts.. is my feeling [19:33:32] or it was disabled via cumin and then when re-enabling via cumin it was down [19:33:47] or did not match the message used when disabling [19:41:49] can't find anything in logs, SAL, phab, email, IRC... I'm inclined to follow mutante's lead and go ahead and re-enable, hope that doesn't mess up anyone's ongoing work :/ [19:45:20] would be curious if that fixes the OCSP alerts or not [19:46:00] the bonus us that broken puppet did not alert as CRIT broken puppet but: [19:46:05] UNKNOWN - NRPE: Unable to read output [19:46:32] other NRPE checks working though, so it's not nagios-nrpe needs restart [19:48:06] hu [19:48:09] hi reading [19:48:21] sukhe: hello:) was about to re-enable puppet but waiting now:) [19:48:32] yes please, checking [19:50:49] yeah I was confusing this with another host, but no [19:50:55] last shows tstarling [19:50:58] on July 5 [19:51:34] weird indeed [19:52:29] host is pooled as well [19:55:03] still checking [19:55:06] puppet just stopped 2 days ago [19:55:13] yeah I see [19:55:14] Sep 14 16:34:46 cp1081 puppet-agent[1649]: Disabling Puppet. [19:55:17] and the empty message looks like it was not a human [19:55:18] and then Sep 14 16:46:34 cp1081 puppet-agent[8234]: Enabling Puppet. [19:55:20] a human would use "foo" [19:55:21] :) [19:55:24] ha [19:56:17] wondering if the "unknown" in Icinga will go away first [19:56:55] sukhe: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1081&scroll=330 [19:57:26] There are no records available to display. [19:57:39] /admin1-> racadm getsel [19:57:56] I know why this number 1081 is so familiar [19:57:58] it's the lemon host [19:58:06] the one that had 10 hardware tickets or something [19:58:09] wasnt it [19:58:25] yeah I think if you mean what I first thought, but that was cp1089 [19:58:37] T310387 [19:58:38] T310387: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 [19:58:51] I mean that wouldn't still explain the failed puppet runs but yeah [19:59:17] you are right, 1089 [20:00:01] --reason 'solar flares' [20:00:04] haha [20:00:55] hmm interesting [20:01:01] requires an extremely unlikely series of bit flips to type that out, but a very honest one [20:02:15] ;) [20:03:45] I do see a period on Sep 14 12:30 ET (so 16:30 UTC) where we disabled Puppet for https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8e063e91af1e71ef4c00e27ccba891d13ea55d2b [20:03:50] but then it was enabled [20:03:56] and the logs confirm that [20:03:57] Sep 14 16:46:34 cp1081 puppet-agent[8234]: Enabling Puppet. [20:04:05] but then it says, [20:04:05] Sep 14 16:46:34 cp1081 puppet-agent[8238]: Skipping run of Puppet configuration client; administratively disabled (Reason: ''); [20:04:21] those timestamps [20:04:23] 👀 [20:04:29] yep :D [20:04:55] and cp1081 is an ATS8 host and it would have been disabled here (which is confirmed by the logs anywaY) [20:05:07] it's a rare race condition where puppet fails to re-enable [20:05:09] some kind of race condition maybe??? if you enable it right as it checks [20:05:11] probably that [20:05:12] yeah [20:05:14] and you get it once every couple momtnhs [20:05:16] that's wild [20:05:17] so in any case, let's just enable it [20:05:22] 👍 [20:05:25] +1 [20:05:26] I am going to go ahead [20:05:27] ok? [20:05:31] go [20:05:32] ok thanks [20:06:04] there are more things in heaven and earth apparently! never would have guessed that was a possibility [20:06:23] but puppet not running normally does not mean timers are stopping [20:06:50] yep [20:06:52] -CONFIG proxy.config.ssl.client.verify.server INT 2 [20:06:52] +CONFIG proxy.config.ssl.client.verify.server INT 1 [20:07:16] so wonder if it's normal that broken puppet means broken ocsp update [20:07:22] ah [20:07:25] https://puppetboard.wikimedia.org/report/cp1081.eqiad.wmnet/8c6d4b0ecc2fe3afd15f8947ed299bcf21a9db3e [20:07:39] how long do we save Puppet runs for? [20:07:49] I would have expected to see runs prior to Sep 14 here [20:07:56] after a week they fall out of puppet db [20:08:00] anyway, it's enabled again and I will keep an eye out for surprises [20:08:10] thanks rzl for pointing it and mutante and cdanis as well [20:08:23] thanks for the detective work! [20:08:27] mutante: right, but even here there is one entry!? https://puppetboard.wikimedia.org/node/cp1081.eqiad.wmnet [20:09:12] sukhe: puppet only stopped on the 14th it looked [20:09:31] ocsp recovery ✅ [20:09:39] yea..hmm. let's see if it clears.. heh;) nice [20:12:04] sukhe: we keep only 24h of reports on puppetdb [20:12:54] ah! TIL. thanks volans