[00:18:25] FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:40] FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:28] Going to switch s6 codfw master [07:12:29] I'd like to reboot cumin2002 tomorrow, is the anything DB or backup related which would prevent that? [07:14:02] I don't think so [07:14:09] but wait for jaime to confirm the backups side [07:14:45] sure thing [07:43:25] RESOLVED: SystemdUnitFailed: podman-auto-update.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:38] Could I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041163 please? Jesse did the most recent version of the patchset, fixing my bugs :) It's templating out a bunch of spec files for cephadm to use. [08:58:15] s1 (eqiad): 1.0 TB The previous backup had a size of 1.2 TB, a change larger than 5.0%. [08:58:27] -14.3 % [08:59:43] the same, on dumps: -5.7 % . The previous backup had a size of 184.5 GB, a change larger than 5.0%. [09:12:04] jynus: see my question from above, are you okay with a cumin2002 reboot tomorrow morning? then I'd send a notification to the sre-at-large list [09:12:42] let me see what time [09:13:38] what time, moritzm? lately backups have been taking a while [09:13:59] e.g. not finished today for example [09:15:10] I wonder if it could be 11:30 UTC+ [09:17:20] I need to reorganize those to get them faster, but for that I need first to finish the bullseye upgrade [09:19:06] actually for codfw, they should finish by 9:45 UTC [09:20:04] jynus: yup, those are pagelinks \o/ I wrote some numbers in the task description of https://phabricator.wikimedia.org/T352010 [09:27:51] I'm totally flexible with the time, noon or later would also be fine with me [09:29:45] yeah, that would be ideal if it can be chosen. Not a big worry if it has to be earlier, but if we could choose that would create the least amount of work for me [09:41:32] I'm completing the clouddb reboots (only 3 left: 1018,1019,1020) [09:49:27] hmm s2 is stuck on a big alter table, so "STOP SLAVE" fails, shall I wait for that tx to complete? [09:49:53] (that's trying to reboot clouddb1018) [09:51:50] I am slowly getting out of the sinkhole that is mediabackup issues, and eqiad is almost fully healty now [09:57:01] I repooled clouddb1018 and will reboot it later when s2 catches up [10:22:57] dhinus: we got alerts on -operations [10:25:22] sorry [10:25:49] I'm following the same procedure as yesterday, but this time the alerts must have worked [10:25:53] let me figure out why [10:26:21] jynus: I'll send an announcement for 13h UTC; and I'll check before I start whether all is completed [10:26:50] thank you moritzm for your time [10:28:35] marostegui: it's possible I simply waited a few seconds more between "systemctl stop" and the reboot [10:29:21] dhinus: yes, yesterday the hosts were about to alert, but you downtimed it before that happened [10:29:29] I'd suggest you downtime them before doing anything [10:29:34] That is how I do it [10:29:39] +1 [10:29:44] I was surprised I didn't trigger any alerts yesterday, so it's probably a good thing I triggered one today :) [10:30:10] It was just coincidence :) [10:30:15] (that you didn't trigger any) [10:31:07] yes it makes sense to downtime before starting the operation [10:31:40] I'm updating https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host where that was recommended [10:32:07] (I'm updating other sections, but I will keep that one) [10:32:14] let me know if you want a reviewer! [10:32:55] arnaudb: that would be great, I'll ping you when I've done all the changes [10:33:03] ...should we have a cookbook for rebooting database hosts? [10:33:45] Emperor: I was also thinking about it, but I think having good docs is a good first step :) [10:34:06] Emperor: We are working on the foundations of that, arnaudb is on it :) [10:34:41] Cool [10:35:29] :D [11:17:10] dhinus: is clouddb1018:s2 replication stopped expected? [11:17:15] or you maybe forgot to start it? [11:45:08] nope I did try "stop slave" but killed it [11:45:17] so I did not expect to stop [11:45:59] dhinus: ok I just started it [11:47:29] thanks. can you think of a reason why it stopped? the last time I checked "show processlist" it was still running a big "alter table" [11:47:55] you tried stop slave, and then you killed that? [11:49:16] yes [11:49:22] then that explains it [11:49:39] even if you killed it it kept waiting and once the alter table finished, it went through and stopped replication [11:49:46] ok, my bad, I didn't expect that [11:49:50] no worries [11:50:02] I will add a note to the wiki :) [11:50:19] :) [14:53:44] hi folks, FYI: I plan to update conftool (including dbctl) starting at around 15:30 UTC today for T365123. let me know if you have any questions / concerns. highlighting here, as the main change is behavior is in dbctl (now validates changes to "external" sections just like it does for "regular"). [14:53:45] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [16:11:23] FYI, holding off on this for now, after some discussion in -sre on how to sequence the update. [16:13:13] swfrench-wmf: would this affect any of the dbctl on-going commits/changes? As we have several schema changes running, it is likely you'll get activity from them [16:16:38] my understanding / expectation is that it should not, but if you could expand a bit about the failure-mode you have in mind, that might be helpful [16:17:35] swfrench-wmf: So what I meant is...will those commits from dbctl affect your maintenance? And on the same note, would the maintenance affect the commits (eg: the commits not going through) [16:24:22] marostegui: got it, thanks for explaining! so, this is "just" a debian package update, so my expectation is that concurrent invocations of dbctl shouldn't cause any problems (and the other way around as well - I don't foresee a scenario where, e.g., a previously saved section/instance edit becomes non-committable by the new dbctl, setting aside the specific external section misconfiguration case this release [16:24:23] addresses). [16:26:00] Ah cool! Thanks swfrench-wmf :) [16:26:31] thanks for asking :) [16:27:28] I'll keep you all posted on timing. if you'd prefer I wait until the current round of schema change scripts are done, please do let me know and I can work around that if there's an ETA for that. [17:15:20] swfrench-wmf: I guess you went ahead already. If not, please go ahead! [17:16:11] marostegui: thank you! I'm still holding, pending a discussion w/ v.olans :) [17:16:29] I'll mention it here before I take any action, possibly targeting this time tomorrow