[08:43:32] Amir1: ideally there would be parameter to establish a 2h buffer on the icinga check, but it has not been implemented on the mariadb puppet codebase yet [08:43:42] so I have to disable the checks manually [08:44:34] something like delay_seconds => 7200 on the replication check [08:46:24] on a separate, but somewhat related note, I also saw 1 backup source host getting at times network overloaded, and I will soon work to rebalance that [08:47:32] I think it was db2101 generating noise on icinga, but it autorecovers after load is back to normal - will move some of the instances there outside [08:55:41] Amir1: the other thing I can offer, other than filing a tiket with that puppet request, is a session to discuss how backup and recovery work, if that would be helpful to you? [10:44:50] Thanks. that sounds like a good idea, the problem is that this week I'm drowning with work. I put a reminder to ask you next week [11:10:00] oh, both things (more puppet work and a session) are definitely not urgent :-) [13:48:13] kormat: I am seeing that db-check-health works on some hosts (ie: db1163) but it fails on others (ie: db1100) [13:48:36] This is the command I am seeing failing: db-check-health --port=3306 --icinga --check_read_only=true --process [13:48:50] was that file changed recently? [13:57:44] kormat: it does seem related to the new release of wmfmariadbpy-common: https://phabricator.wikimedia.org/P18944 [13:59:31] ah crap. this is the problem with the combination of our apt repo only carrying a single version, and upgrading all packages :/ [14:09:47] marostegui: i've re-uploaded 0.7.2 to the apt repo (with much cursing) [14:11:55] kormat: yep, db1124 worked again [14:12:06] kormat: we should probably check how many hosts got 0.8 and downgrade them [14:12:15] as otherwise the check will fail [14:12:16] currently glaring at https://debmonitor.wikimedia.org/packages/python3-wmfmariadbpy [14:12:53] db1100 confirmed fixed too by downgrading [14:13:40] https://phabricator.wikimedia.org/T299406 was the task for deploying 0.8, which got aborted due to that bug [14:13:56] but i didn't count on all packages getting upgraded on hosts before i could get a fix out [14:14:23] gotta love automation eh!! [14:14:44] Automation lets you f*** up all the computers at once effortlessly :) [14:15:02] Emperor: btw, saw your mail, planning to reply tomorrow if that's ok [14:15:42] marostegui: fine, yes, thank you :) [14:15:46] :) [14:16:45] `cumin failed to execute: hosts must be a non-empty ClusterShell NodeSet or list, got '': ` [14:16:51] * kormat stares at volans [14:17:08] kormat: no hosts selected? [14:17:24] ahhh. sorry :) [14:20:31] let me guess, debdeploy won't downgrade :/ [14:27:44] marostegui: fixed everywhere except es1022, which is unreachable [14:28:02] yeah, it is down for maintenance [14:28:06] I will fix it there once it is back [14:28:07] we may want to rethink the "upgrade all packages" approach.. [14:28:07] thanks! [14:28:20] ok. going offline again.