[08:32:01] Can someone start a public incident report for Friday's switch issue? [08:39:56] is there a mechanism in puppet/CI to assert things like "this file must be valid json/yaml/whatever, and fail otherwise" ? [08:42:02] case in point, I'd like to validate [08:42:09] modules/profile/files/rsyslog/lookup_table_output.json as json [08:42:50] godog: sort of, see testenv:adminschema in tox.ini [08:43:08] in that case w specify also the schema [08:43:18] and perform additional validation in ./modules/admin/data/data_validate.py [08:44:00] volans: thank you, I'll start from that [08:56:23] * godog opens taskgen.rb [08:56:27] hold my ruby, I'm going in [08:56:39] adios! [08:56:48] :-P [08:57:32] hehehe [08:59:53] ok that's easier than I thought for just generic json parsing [09:15:41] sent out a couple of reviews which should DTRT, the rabbit hole wasn't that deep actually, let me know what you think [15:35:38] Anyone around for a +2 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/703912? There's a bit of manual cleanup I need to do on webperfX002 after it lands. [15:39:15] dpifke: I can take a look in 30m or so [15:40:11] No rush, any time in the next ~2.5 hours is perfect. Thanks! [16:00:30] apergos: noticing some 12,000 errors in a single minute from snapshot1002 about fputs() unable to write to non-file boolean [16:00:49] > PHP Warning: fopen(/tmp/svwiki/20210718/svwiki-20210718-stubs-meta-hist-incr.xml): failed to open stream: No such file or directory > PHP Warning: fputs() expects parameter 1 to be resource, boolean given [16:00:54] 1002? [16:00:59] there is no 1003 [16:01:01] snapshot1008 [16:01:06] or 2... [16:01:16] I'm doing testing over there, please ignore [16:01:17] my bad :) [16:01:20] okido [16:01:43] and I'mprobably writing to some nonexistent file and don't even care [16:01:55] sorry for the noise [16:19:45] dpifke: done! and a reminder that https://wikitech.wikimedia.org/wiki/Puppet_request_window is always available for patches like this too [16:20:06] Thanks! [16:20:07] Krinkle: now writing to a pth that actually exists, for my test :-P [17:24:18] Anyone around for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/704567? Fixes systemd timer to use `OnUnitActiveSec` instead of `OnActiveSec` so that it fires every 30 minutes as desired, instead of erroneously only firing once [17:24:38] this is the entire diff: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704567/2/modules/profile/manifests/elasticsearch/cirrus.pp [17:27:44] ryankemper: LGTM [17:28:01] ty [17:29:56] jbond: looks like there's an unmerged puppet change to add a dummy logoutd pw to `hieradata/common/profile/gerrit.yaml` [17:32:44] ryankemper: among other hosts elastic2038 is back now that the switch is back, but there are few prometheus-related units failed, maybe you could have a look? [17:33:22] volans: sure, usually they're a bit finicky with restart order, i'll try kicking the units over [17:33:36] I forced a puppet run didn't help [17:33:51] but fixed ferm fwiw :D [17:36:07] at least we got something out of it then :P [17:36:21] in wdqs streaming-updater meeting now so will take a proper look in ~25 mins [17:40:26] ryankemper: all good now, it was ES on localhost not yet started [17:40:37] I guess it took a while to be available again [17:40:59] Yup I just restarted them couple mins ago (about to log in #ops) [17:41:09] ah sorry, missed it [17:41:12] Had `Jul 16 18:31:20 elastic2038 elasticsearch[957]: 2021-07-16 18:31:20,657 main ERROR Unknown GELF server hostname:udp:logstash.svc.eqiad.wmnet`, looks like it wouldn't self heal tiull restart [17:41:14] till* [17:41:21] not nice [17:41:24] So I restarted the es services themselves then the prom stuff [17:42:49] got it, maybe worth some investigation on ways to prevent that, if we had a dns resolver issue we shouldn't need to restart the whole ES fleet :) [17:47:00] volans: yeah definitely agreed. off the top of your head is there anything obvious? i'd think maybe a health check that does the equivalent of `GET _cluster/health` and restarts the service if it hears nothing back? [17:48:24] I was more thinking on the prevention side, if there is a way to tell ES to not die/get stuck [18:03:26] for the alarm part we could also add a check, but I guess the failed units for prometheus might do the same [18:03:59] if we want to do auto-remediation I'd rather do it centralized so that we can orchestrate it in a safe way [18:30:08] new record: 4 pending changes on puppetmaster [18:30:22] someone please merge it all :p [18:31:28] cc rzl, ryankemper ^^^ [18:32:14] ah my bad I put "yes" instead of "multiple" earlier [18:32:16] doing now [18:32:40] thanks ryankemper :) [18:32:42] done [18:33:12] ryankemper: thanks! [18:36:46] FYI, just tested https://gerrit.wikimedia.org/r/704861 and it works, empty commit messages in the private repo should be a thing of the past [18:37:22] (and I'm glad it works, that would have been a very silly mail to ops@ if my commit had gone through) [18:37:41] :) nice [18:42:56] rzl: people who don't set good commit messages drive me insane. I should steal that. [19:05:20] thanks ryankemper sorry missed your ping earlier [19:05:57] jbond: no worries it was clear from the diff it was a no-op but was just being extra paranoid! (also forgot you're in eu :P) [19:06:16] no worries and thanks