[07:41:46] hnowlan: o/ I see a lot of alerts for tilerator on maps hosts, expected or something ongoing? [07:42:28] arturo: o/ I see a lot of crits for cloud* nodes, mostly silenced (but popping up in the icinga page), can we ack them? [07:43:13] (just to reduce the noise on the page, it seems a Christmas tree at the moment) [07:44:48] <_joe_> elukey: the tilerator stuff I think is expected because we're dismissing cassandra and tilerator on maps [07:44:57] <_joe_> yeah I've given up on that [07:45:08] <_joe_> it's clear people are uninterested [07:47:19] elukey: cloud1* or cloud2*-dev nodes? [07:47:56] taavi: cloudcontrol*-dev mostly [07:48:14] _joe_ ack perfect, I'll follow up with Hugh to ack those alerts then [07:48:53] those are known, feel free to ack them (I don't have access for that myself) [07:49:04] fallout from our bullseye upgrades [07:49:37] taavi: super, is there a task that I can use? [07:50:29] https://phabricator.wikimedia.org/T300254 [07:51:06] <_joe_> taavi: I don't think it's elukey's duty to ack them, frankly [07:51:19] <_joe_> but rather of the people operating and maintaing them [07:51:48] <_joe_> a tad more responsibility in managing icinga by everyone is needed. [07:58:51] taavi: acked thanks :) [08:00:58] once in a while we should clean the unhandled alerts, just to reduce the noise [08:01:07] it is very annoying but needed :) [08:02:31] thanks elukey taavi [08:02:50] yes we're in the middle of bullseye upgrades [08:39:10] I fixed a few unacked wmcs alerts where I had enough access to do so [08:48:40] I am switching m2 master in 10 minutes [08:48:55] Affected services at: https://phabricator.wikimedia.org/T300329 [08:55:49] * akosiaris around [10:31:30] elukey: weird, those were previously acked - thanks for the heads up [10:33:47] maybe they flapped? to warning or unknown for example [10:36:30] hnowlan: <3 [10:52:44] BTW, if you see me today doing weird stuff, I have switched only today's clinic duty (but cannot update topic) [10:54:11] I read there is channel groups, I wonder if with that we could distribute topic rights more widely than with the flag limitations [10:58:24] jynus: want me to update the topic? if it's only for 1 day not sure it's worth though :) [10:58:44] yeah, only if you promise to change it back tomorrow [10:58:50] if not, it is ok [10:59:07] but I am more "worried" for updates on an outage or something else [10:59:57] and if it is not possible just on irc, maybe we can call our software developer to create a form [11:01:31] https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/irc/ircservserv-config/+/refs/heads/master/channels/wikimedia-operations.toml is the canonical list these days, I think you should be able to just send a patch to that repo [11:02:37] I was told there was a limit on the number of people that could be added to operatos since the libera transition [11:02:44] *operators [11:03:33] I'm not aware of any limits on operators (but there is a limit of 4 "founders") [11:04:42] jynus: topic changed, I can change it back tonight or tomorrow morning [11:05:02] let me see if I can fix it myself for the future [11:18:57] https://gerrit.wikimedia.org/r/c/wikimedia/irc/ircservserv-config/+/759312 [15:01:16] the lack of alphabetical order makes me sad [15:01:28] (why yes, I do need to get out more. Why do you ask?) [15:11:12] {{sofixit}} ;P [15:13:20] I figured I should let j.nus' CR through first otherwise I'll only have to fix a merge conflict :) [15:15:24] Emperor: you can just do your patch on top of ja.me's one ;) [15:18:24] <_joe_> volans: when picking a python library for spicerack, should I look at what's available on bullseye or on buster? [15:21:26] _joe_: we're currently on both buster and bullseye (cumin1001 upgrade is pending some unrelated work) [15:21:52] but if you need something in bullseye backporting it to buster with python is usually super trivial [15:22:34] (T276589#7420124 for reference) [15:22:34] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [15:24:10] <_joe_> volans: heh I'm considering how to tackle writing a spicerack kubernetes module [15:24:19] <_joe_> one option I just shellout to kubectl [15:25:15] <_joe_> another is I choose any of these python libraries that allow talking to kubernetes [15:25:37] <_joe_> debian has one, which I fear is horribly outdated in buster [15:25:49] I can imagine [15:26:03] _joe_: btw, speaking from experience, backporting a newer version of python-kubernetes than 12.0.1 (aka support for kubernetes 1.16) to bullseye is a pain since you need tons of newer dependencies too [15:26:18] feel free to open a task for detailed discussion following https://wikitech.wikimedia.org/wiki/Spicerack#Adding_new_module_or_change_in_core_behaviour [15:26:22] if you decide to do that for whatever reason, https://salsa.debian.org/taavi/python-kubernetes/ [15:26:55] <_joe_> taavi: yeah no, I was thinking of packaging pykube_ng in case [15:28:18] <_joe_> volans: I'm a bit confused, I have tons of tasks that need such a library in spicerack; do I need to open another one about what? the implementation strategy? Isn't a CR the right place to discuss such things? [15:29:06] <_joe_> I'm not asking your team to implement it [15:29:11] I know [15:30:04] experience has thought us that when starting directly from a CR the friction to contribute to spicerack is higher because is easier to agree on the api and integration into spicerack before hand than going back and re-implement something after the review [15:30:30] <_joe_> so you want to discuss the api in a task? [15:30:43] <_joe_> you're worried I might not properly overengineer it? [15:33:26] how to structure the spicerack side of the api, how it's exposed to the cookbooks and such [15:33:43] I'm not worried, I'm saying that there is a process :) [15:35:27] I can add you to the next office hours with john and me too if you prefer to chat live about it [17:30:15] I had a homer timeout on a decommission operation (forgot about it in a terminal and didn't type "yes" to the prompt for a while), got a "ncclient.operations.errors.TimeoutExpiredError: ncclient timed out while waiting for an rpc reply. [17:30:52] hnowlan: ok, which host was it? [17:31:01] volans: restbase2011 [17:31:15] there's a lock left by the operation afaict [17:31:41] we can just re-run homer for the switch to wich restbase2011 is attached [17:32:11] and that's asw-c-codfw [17:32:27] I can run it for you if you want [17:33:12] volans: that'd be great, thank you! [17:33:43] volans: let me know if you need help [17:33:49] hnowlan: FYI I'm just runnign from a cumin host: homer 'asw-c-codfw*' commit "Decommission restbase2011" [17:34:07] let's see if it works :D [17:34:11] volans: nice, thanks! [17:34:16] very impressed by how the script copes with repeated runs after a failure though, lots of nice handling of edge cases <3 [17:34:50] XioNoX: actually in this case it fails saying that the terminal is locked [17:35:03] configuration database locked by [17:36:22] asw-c-codfw> request system logout pid 57293 [17:36:38] volans: you're good to go [17:36:45] ack, re-run [17:38:31] hnowlan: you should be good to go, all done, and the homer step is the last one [17:40:41] volans: great, thanks! no need to re-run the cookbook so? [17:41:16] at this point no need I'd say, in general yes just re-running would do the trick [17:41:26] it should be fully idempotent (and if not feel free to open a task!) [17:41:40] if you want you can also re-run it :D [17:42:00] but it should be a total noop at this point [17:42:13] great, thanks! [17:45:44] anytime :) [18:33:39] could someone please merge this deployment-prep only hieradata change please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/759559/ [18:35:14] taavi: done [18:35:29] thanks!