[07:19:53] [09:19:38] <+icinga-wm> PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:56] checking [07:20:16] seems to be x1 [07:20:47] Oct 28 07:15:17 cumin2001 remote-backup-mariadb[27023]: [07:15:17]: ERROR - Transfer failed! [07:21:00] maybe a case of transfer.py failing [07:21:04] we have a task for that [07:21:08] I am going to retry that backup [07:24:58] if I may add my 2 cents I think that we should exclude from the systemd generic alert the backup units and have them alert themselves in a more granular way so that they could directly alert saying which backup failed and why [07:26:11] volans: mind commenting on https://phabricator.wikimedia.org/T293975? [07:28:14] sure will do [07:30:15] thank you [07:59:20] marostegui: {done} [08:03:51] thanks volans [08:05:04] anytime! [08:27:39] so the backup finished fine this time [08:28:02] I am unsure about what to do now, as reloading the unit will trigger all the backups again, so I am going to do a reset-failed this time [08:28:08] And hopefully the next run will be fine [08:28:13] ack [08:28:37] I don't know though why the backups weren't cleaned up from the ongoing directory [08:28:44] if the process finished correctly [08:32:53] root@cumin2001:~# systemctl list-units --failed [08:32:53] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [08:33:03] we'll see if the next snapshot run goes fine this time [08:33:11] not sure if reseting failed units is the expected action or not [08:33:16] We'll see! [08:33:56] seemed fair given you had re-run the failed one [08:34:00] quick question [08:34:32] if one fails does it stop or it did do all the backups and one failed? [08:34:40] that might affect the actions to take on failure :) [08:35:16] I am not sure :) [08:35:44] So x1 failed, and now s2 and s5 are on the on-going directory, which is only supposed to be for on-going ones [08:35:53] but I don't see them running [08:36:39] and the s2 and s5 backups are marked as finished from yesterday [08:36:42] so no idea why those are there [08:36:52] maybe puppet tried to start the unit and created another run of those two? [08:36:59] but they are not being showed on the backups table [08:37:02] as on-going [08:37:04] So I have no idea [08:37:29] so my guess is that puppet tried to start the unit and created an additional run for those [08:37:41] even if they are already done (and marked as such on the table and on the latest/ directory) [08:38:48] I definitely don't see any s2 and s5 as ongoing for today on the table, cumin or dbprov, so those might be leftovers from that run [08:38:54] But I don't know [08:43:06] urbanecm: do you expect to create the other wikis today too? or should I go ahead and take T292418? [08:43:07] T292418: Prepare and check storage layer for pwnwiki - https://phabricator.wikimedia.org/T292418 [08:43:15] marostegui: yes, I'm creating them now! [08:43:20] ah sweet [08:43:21] should i ping you when done? [08:43:26] yes please! [08:43:32] will do [08:43:35] I am off tomorrow but I want to leave the sanitized today [08:43:37] thank you [08:43:46] any time [09:05:09] marostegui: db creation done for all three. Unfortunately, I missed the 'lmowiktionary' => 's5' rule when doing lmowiktionary. It looks like the creation script did create the DB in the right place anyway (and then errored out, because things didn't match), but I'm letting you know anyway [09:07:34] urbanecm: ah ok, do you want me to double check anything specifically? [09:07:49] marostegui: ideally, that the lmowiktionary DB exists only in s5 [09:07:56] ok, let me see [09:09:20] urbanecm: yep, only s5 (and x1 as expected) [09:09:57] thanks for checking marostegui, that's good to know. [09:10:08] no problem! [09:10:45] I will get those wikis ready for the cloud views [09:11:03] thanks!