[00:18:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:18:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:41:28] <marostegui>	 Going to switch s6 codfw master
[07:12:29] <moritzm>	 I'd like to reboot cumin2002 tomorrow, is the anything DB or backup related which would prevent that?
[07:14:02] <marostegui>	 I don't think so
[07:14:09] <marostegui>	 but wait for jaime to confirm the backups side
[07:14:45] <moritzm>	 sure thing
[07:43:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:38] <Emperor>	 Could I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041163 please? Jesse did the most recent version of the patchset, fixing my bugs :) It's templating out a bunch of spec files for cephadm to use.
[08:58:15] <jynus>	 s1 (eqiad): 1.0 TB The previous backup had a size of 1.2 TB, a change larger than 5.0%. 
[08:58:27] <jynus>	 -14.3 %
[08:59:43] <jynus>	 the same, on dumps: -5.7 % . The previous backup had a size of 184.5 GB, a change larger than 5.0%.
[09:12:04] <moritzm>	 jynus: see my question from above, are you okay with a cumin2002 reboot tomorrow morning? then I'd send a notification to the sre-at-large list
[09:12:42] <jynus>	 let me see what time
[09:13:38] <jynus>	 what time, moritzm? lately backups have been taking a while
[09:13:59] <jynus>	 e.g. not finished today for example
[09:15:10] <jynus>	 I wonder if it could be 11:30 UTC+
[09:17:20] <jynus>	 I need to reorganize those to get them faster, but for that I need first to finish the bullseye upgrade
[09:19:06] <jynus>	 actually for codfw, they should finish by 9:45 UTC
[09:20:04] <Amir1>	 jynus: yup, those are pagelinks \o/  I wrote some numbers in the task description of https://phabricator.wikimedia.org/T352010
[09:27:51] <moritzm>	 I'm totally flexible with the time, noon or later would also be fine with me
[09:29:45] <jynus>	 yeah, that would be ideal if it can be chosen. Not a big worry if it has to be earlier, but if we could choose that would create the least amount of work for me
[09:41:32] <dhinus>	 I'm completing the clouddb reboots (only 3 left: 1018,1019,1020)
[09:49:27] <dhinus>	 hmm s2 is stuck on a big alter table, so "STOP SLAVE" fails, shall I wait for that tx to complete?
[09:49:53] <dhinus>	 (that's trying to reboot clouddb1018)
[09:51:50] <jynus>	 I am slowly getting out of the sinkhole that is mediabackup issues, and eqiad is almost fully healty now
[09:57:01] <dhinus>	 I repooled clouddb1018 and will reboot it later when s2 catches up
[10:22:57] <marostegui>	 dhinus: we got alerts on -operations
[10:25:22] <dhinus>	 sorry
[10:25:49] <dhinus>	 I'm following the same procedure as yesterday, but this time the alerts must have worked
[10:25:53] <dhinus>	 let me figure out why
[10:26:21] <moritzm>	 jynus: I'll send an announcement for 13h UTC; and I'll check before I start whether all is completed
[10:26:50] <jynus>	 thank you moritzm for your time
[10:28:35] <dhinus>	 marostegui: it's possible I simply waited a few seconds more between "systemctl stop" and the reboot
[10:29:21] <marostegui>	 dhinus: yes, yesterday the hosts were about to alert, but you downtimed it before that happened
[10:29:29] <marostegui>	 I'd suggest you downtime them before doing anything
[10:29:34] <marostegui>	 That is how I do it
[10:29:39] <jynus>	 +1
[10:29:44] <dhinus>	 I was surprised I didn't trigger any alerts yesterday, so it's probably a good thing I triggered one today :)
[10:30:10] <marostegui>	 It was just coincidence :)
[10:30:15] <marostegui>	 (that you didn't trigger any)
[10:31:07] <dhinus>	 yes it makes sense to downtime before starting the operation
[10:31:40] <dhinus>	 I'm updating https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host where that was recommended
[10:32:07] <dhinus>	 (I'm updating other sections, but I will keep that one)
[10:32:14] <arnaudb>	 let me know if you want a reviewer!
[10:32:55] <dhinus>	 arnaudb: that would be great, I'll ping you when I've done all the changes
[10:33:03] <Emperor>	 ...should we have a cookbook for rebooting database hosts?
[10:33:45] <dhinus>	 Emperor: I was also thinking about it, but I think having good docs is a good first step :)
[10:34:06] <marostegui>	 Emperor: We are working on the foundations of that, arnaudb is on it :)
[10:34:41] <Emperor>	 Cool
[10:35:29] <arnaudb>	 :D
[11:17:10] <marostegui>	 dhinus: is clouddb1018:s2 replication stopped expected?
[11:17:15] <marostegui>	 or you maybe forgot to start it?
[11:45:08] <dhinus>	 nope I did try "stop slave" but killed it
[11:45:17] <dhinus>	 so I did not expect to stop
[11:45:59] <marostegui>	 dhinus: ok I just started it
[11:47:29] <dhinus>	 thanks. can you think of a reason why it stopped? the last time I checked "show processlist" it was still running a big "alter table"
[11:47:55] <marostegui>	 you tried stop slave, and then you killed that?
[11:49:16] <dhinus>	 yes
[11:49:22] <marostegui>	 then that explains it
[11:49:39] <marostegui>	 even if you killed it it kept waiting and once the alter table finished, it went through and stopped replication
[11:49:46] <dhinus>	 ok, my bad, I didn't expect that
[11:49:50] <marostegui>	 no worries
[11:50:02] <dhinus>	 I will add a note to the wiki :)
[11:50:19] <marostegui>	 :)
[14:53:44] <swfrench-wmf>	 hi folks, FYI: I plan to update conftool (including dbctl) starting at around 15:30 UTC today for T365123. let me know if you have any questions / concerns. highlighting here, as the main change is behavior is in dbctl (now validates changes to "external" sections just like it does for "regular").
[14:53:45] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[16:11:23] <swfrench-wmf>	 FYI, holding off on this for now, after some discussion in -sre on how to sequence the update.
[16:13:13] <marostegui>	 swfrench-wmf: would this affect any of the dbctl on-going commits/changes? As we have several schema changes running, it is likely you'll get activity from them
[16:16:38] <swfrench-wmf>	 my understanding / expectation is that it should not, but if you could expand a bit about the failure-mode you have in mind, that might be helpful
[16:17:35] <marostegui>	 swfrench-wmf: So what I meant is...will those commits from dbctl affect your maintenance? And on the same note, would the maintenance affect the commits (eg: the commits not going through)
[16:24:22] <swfrench-wmf>	 marostegui: got it, thanks for explaining! so, this is "just" a debian package update, so my expectation is that concurrent invocations of dbctl shouldn't cause any problems (and the other way around as well - I don't foresee a scenario where, e.g., a previously saved section/instance edit becomes non-committable by the new dbctl, setting aside the specific external section misconfiguration case this release 
[16:24:23] <swfrench-wmf>	 addresses).
[16:26:00] <marostegui>	 Ah cool! Thanks swfrench-wmf :)
[16:26:31] <swfrench-wmf>	 thanks for asking :)
[16:27:28] <swfrench-wmf>	 I'll keep you all posted on timing. if you'd prefer I wait until the current round of schema change scripts are done, please do let me know and I can work around that if there's an ETA for that.
[17:15:20] <marostegui>	 swfrench-wmf: I guess you went ahead already. If not, please go ahead!
[17:16:11] <swfrench-wmf>	 marostegui: thank you! I'm still holding, pending a discussion w/ v.olans :)
[17:16:29] <swfrench-wmf>	 I'll mention it here before I take any action, possibly targeting this time tomorrow