[06:35:50] Hi everybody, I am going to start the reimage of ores2001 to Buster in a few. Ping me in case something weird happens and I am not seeing it :D [07:12:00] * kormat prepares to ping elukey if no alerts fire [07:12:41] :) [07:12:47] thanks for the trust, as always [07:21:58] 💜 [07:26:01] <3 [08:54:07] ores2001 reimaged, it seems working fine with Buster and Python 3.7, nothing horrible in logs/logstash/etc.. registered [08:54:19] I'll keep everything tripled checked today [09:10:12] nice :-) [09:32:55] Hi, I'm Simon. Today is my first day with Wikimedia :-) [09:33:28] Hi Simon! Welcome! [09:33:42] (Luca, Machine Learning) [09:34:23] hello, Simon, which team are you on? :-) [09:36:21] Oh, yeah. I'm on the infrastructure team [09:36:36] nice to hear! Welcome [09:37:48] Thank you :-) [09:42:41] Hey! Welcome! [09:46:10] slyngshede: welcome :) [09:47:44] Welcome aboard slyngshede! [09:51:07] indeed, welcome aboard slyngshede ! [10:01:41] welcome slyngshede [10:07:18] RhinosF1: you scared them off [10:08:11] kormat: be glad I'm only on irc, actually seeing me half awake probably would [10:08:20] haha [10:09:22] dcaro: let's continue here without all the bot msgs around :) [10:10:21] okok [10:10:53] dcaro: so.. reload-acme-chief-backend.timer needs to be running for you to get timely renewals of your certs [10:11:09] it's active [10:11:21] enabled and loaded [10:11:32] that's deployed by https://github.com/wikimedia/puppet/blob/production/modules/acme_chief/manifests/server.pp#L139-L149 [10:11:47] and it should be triggered once every hour [10:11:56] and that's what happens in our environment [10:12:33] yep, all that seems to match what's there, except it getting triggered [10:13:36] https://www.irccloud.com/pastebin/MfVt8bEO/ [10:13:41] any major difference with yours? [10:14:53] no difference, same content [10:20:52] restarting the underlying service (reload-acme-chief-backend.service) makes the timer have trigger times now [10:21:25] (OnUnitInactive seems to work) [10:21:54] but for some reason it did not do OnActive (that's the one that should bootstrap it no?) [10:22:47] that service just triggers a reload of acme-chief and stops [10:22:48] so it should be inactive most of the time [10:25:41] yep, but the first time it's activated (I'm guessing when installing it/changing it) then should trigger itself again after 1s, to start restarting the service no? [10:26:28] otherwise if the service is never activated by hand, it will never get triggered (as the OnUnitInactive only happens when the unit becomes inactive, not when it is inactive iirc) [10:29:17] dcaro: yep.. that's why OnActiveSec is there [10:29:34] and that's what bootstraps the timer after a system reboot for instance [10:29:53] and AFAIK that's working as expected.. we rebooted the acme-chief instances 1 month ago and the timer worked as expected [10:30:27] well, that one was last triggered 5 moths and 8 days ago [10:31:27] mabye the OnActive has a race condition or something? (maybe the accuracy being higher than the OnActive means sometimes it gets missed?) [10:31:42] specially on VMs with flaky time accounting xd [10:32:37] our instances are VMs as well [10:34:03] dcaro: as a follow up let's patch that snippet of puppetization to allow injecting a custom email instead of the hardcoded sre-traffic@wikimedia.olg [10:34:06] *org [10:34:10] reading the docs it does not seem that accuracy should matter much in this case [10:34:58] that's already done I think [10:35:00] Process: 18477 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper -T root@tools-acme-chief-01.tools.eqiad.wmflabs --only-on-error /bin/systemctl reload acme-chief (code=exited, status=0/SUCCESS) [10:35:07] that email? [10:35:13] nope, that's the FQDN [10:35:47] if you run a systemctl cat reload-acme-chief-backend.service [10:35:55] you should see Environment="MAILTO=sre-traffic@wikimedia.org" [10:36:05] yep, I see [10:36:53] but... [10:37:01] parser.add_argument('-T', '--mail-to', default='root@{}'.format(getfqdn())) [10:37:11] that's from systemd-timer-mail-wrapper [10:37:45] oh.. the code prefers MAILTO to -T [10:37:57] so it gets overridden? [10:38:00] indeed [10:38:07] https://www.irccloud.com/pastebin/WTm9ifqN/ [10:43:43] So no need to change anything [10:43:45] ? [10:46:21] uh? [10:46:40] " as a follow up let's patch that snippet of puppetization to allow injecting a custom email instead of the hardcoded sre-traffic@wikimedia.olg" [10:46:40] dcaro: so tools instances shouldn't email sre-traffic@wikimedia.org [10:46:57] but it's not no? the email on the env is overriden by the parameter [10:47:05] the other way around [10:47:15] environment variable trumps the -T parameter [10:47:22] oh, that's a weird behavior [10:52:53] so you want it to never use the env var? (so it uses the parameter instead), make the env var configurable and set sre-traffic as default? [10:53:04] (and keep the parameter useless) [10:54:14] we could remove the env variable and let root@ take those [10:54:37] or add a profile variable that allows customizing that env variable [10:55:06] dcaro: also.. I'm wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/788310 is useful or not in your environment [10:57:15] it probably is, maybe not paging but alerting should be good [10:58:37] if paging works as in our environment it won't page.. as the check isn't set as critical [10:59:06] sounds good to me then [10:59:08] :) [10:59:17] sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/78831 for the email [10:59:41] * dcaro going to get some food 🍝 [11:20:08] folks re-iterating in here too - I accidentally ran rm -rf under /srv/deployment on deploy1002 (I know sorry), so please be very carefull if you need to run scap commands [11:29:21] urbanecm, taavi: as likely deployers ^ [11:29:35] thanks RhinosF1 [11:29:43] affects only services deployment though AFAIK [11:29:43] np urbanecm [11:31:21] elukey: there are backups of that [11:31:31] should we run a recovery? [11:58:11] BTW, unrelated to all the above, I am running a transference from backup1002 to backup1008, hopefully it is at reasonable speeds- it will take ~24 hours to complete [13:56:05] andrewbogott: hi! if you are around today, I would like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779936. I am happy to take care of the dnsrecursor and Wikidough part, and you can do that for the cloudservices one. [13:56:13] if not today, then tomorrow also works :) [13:56:32] tomorrow is probably better, we're upgrading openstack today which interacts closely with DNS [13:56:48] yep, let's do it tomorrow then. thanks :) [16:31:41] hashar (or others) can I shut off integration-castor04 for a few minutes? I want to migrate it to another host but it seems too busy to migrate live :) [16:32:23] I don't see that host on https://integration.wikimedia.org/ci/ for some reason [16:46:36] castor isn't a jenkins builder, the jobs rsync to/from castor (idk the answer to your actual question about shutting it off), so it's not in the jenkins UI [17:50:20] legoktm: ok, thanks [17:50:27] * andrewbogott already shut it off [21:03:41] Assuming not every incident is going to be in an incident review ritual, how are incidents selected to be 'eligible for ritual'? [21:53:34] Anyone around who has operator status in #wikimedia-operations? [21:55:09] jhathaway: o/ need something? [21:55:31] cwhite: can you mark me as being on clinic duty? [21:56:07] cwhite: thanks [21:56:07] done [22:14:56] the incident response wiki template says "TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks." The second one, #SRE-OnFire .." though does not exist anymore and instead we have the tag "Wikimedia-Incident". Not sure how exactly to edit the template.