[07:50:17] I've hit https://phabricator.wikimedia.org/T394883 on bacula [07:51:25] I've disabled puppet on backup1001, but giving a heads up because, despite not backups working normaly, while puppet is disabled there, people won't be able to create new or remove old backups jobs [07:51:32] *now backups [07:52:05] ping me if you need help with that until the cloud team attends the issue [08:01:00] ok, taavi very quickly responded to the issue, we are now back to healty on backups (although expect some delays until the queue clears) [08:36:42] hi oncallers, I'm moving the thanos stats_reporter_host, which may cause a few alerts to go. [08:39:52] thanks for the heads up [08:48:02] {{done}} [08:54:07] very nice. Whoever review those patches did a great job :-P [08:54:16] *reviewed [08:55:42] :) [08:57:18] jynus, jayme, I'm going to upgrade cr2-eqdfw, no impact expected, but probably some alerting noise [08:57:30] ack, thanks [08:59:15] not important, but I saw some alerts on several router. I am guessing those are known/ongoing work, right? [09:00:29] Connect - kubernetes-codfw and Active - kubernetes-codfw [09:00:57] yup, unrelated/not important [09:01:01] thanks [09:01:26] sorry, I meant Active - kubernetes-ml-eqiad on the last. Thanks for the confirmation. [09:23:19] cr2-eqdfw is rebooting, should be back up in ~10min or less [09:23:53] splunk is happy so far [09:36:05] back up [09:41:21] all done! [09:45:57] nice [12:10:28] brouberol: I know you are busy, but I wonder if I should resolve the incidents of kafka-jumbo so they dont alert again? [12:10:37] is that ok? [13:28:06] marostegui, papaul, are you still interested in doing this https://phabricator.wikimedia.org/T378715#10524038 ? [14:05:54] jynus: yes please! We have successfully reimaged 2 of them, and the last gave us additional complication [14:06:02] but the 3 of them are back in in-setup mode [14:25:41] XioNoX: I am [18:15:14] mutante I'm seeing your puppet changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148902 , OK if I merge? [18:15:27] inflatador: yes please [18:15:45] mutante ACK, merging now [18:15:56] thanks! those nodes are not in prod yet [18:18:11] damn, I forgot to depool the hosts before removing them from conftool. Does anyone know if that will set off alerts? I'll keep an eye out [18:23:47] I just depooled, hopefully no issues [18:28:23] not sure. the one one thing I know though is if the order is off you can end up with some of those .err files and then to make monitoring recover the fix is to delete them