[07:26:05] hello folks, as FYI I am trying to move kafka logging in deployment-prep to PKI [07:26:13] and I am of course failing [07:49:35] reverted, need to talk with John about it [08:26:48] jbond: o/ on Stretch nodes I see "-bash: /usr/lib/systemd/user-environment-generators/30-systemd-environment-d-generator: No such file or directory" when I log in [09:11:41] elukey: working on a fix ATM: https://gerrit.wikimedia.org/r/c/operations/puppet/+/771821 [09:12:53] ack thanks! not a big deal, was just reporting it since it seemed werid [09:12:55] *weird [09:17:45] this is part of the bigger work to allow exporting environment variables via /etc/environment.d to eventually set fleet-wide defaults to accessing the web proxies [09:18:29] but unfortunately systemd on stretch hosts is too old, so those systems will miss out the feature until they get reimaged to buster/bullseye [09:25:22] jayme: elukey: I just scheduled manually a simple pod on all of the new kubernetes nodes in eqiad [09:25:42] nice! [09:25:54] deploy1002:/home/akosiaris/kube1018_22.yaml for the how [09:26:07] it's a simple ping -c 10 10.2.1.1 [09:26:11] and it worked all over [09:26:16] so networking wise, it's fine [09:26:46] I 'll pool them in pybal and I think we can start sending workloads to them [09:27:39] akosiaris: nice! [09:27:45] 1 question is why on earth I am pinging codfw appservers from eqiad, but anyway [09:28:10] 1 thing I am not clear on from yesterday, what happened and some deployment got stuck? [09:28:28] dockerd needed to be restarted or something? something about the registry password ? [09:29:20] that was just the weird docker registry hack we have for authentication [09:29:48] for access to /restricted to work, you need to run puppet once on the registries [09:29:54] (puppetdb query) [09:30:04] omg... [09:30:11] yes [09:30:38] ah yeah, I remember reviewing that part of the code and thinking that this is eventually consistent and might bite us at some point [09:30:43] I did not expect it that way [09:31:06] although tbh, those nodes shouldn't be marked as ready in the api before that simple test I did had passed [09:31:30] it's not an eventual consistency issue but more like the process of adding a new node would benefit from a small gate [09:31:46] even better if the gate is automated [09:56:19] XioNoX: o/ never seen the singtel transport, but afaics the BFD session between ulsfo and eqsin is down and I can't find scheduled maintenance (not sure if it goes by another name etc..). Laser output power looks good, ok to follow up with them? (I can send an email, I see that Chris contacted them earlier in March for the same issue) [09:56:54] nevermind just recovered [09:57:32] elukey: :) thanks though! [09:58:35] moritzm: thanks for the fix reviewing now [10:10:45] akosiaris: I think it's by design that nodes joining the cluster are not cordoned for example (for things like node autoscaling). Controllers should in theory be responsible of marking nodes as not ready for example. IIRC calico does so on first start and removes the taint once it is started (never adding it back when it fails, unfortunately) [10:13:49] jayme: yup. One thing we could do, is run an operator that watches for new nodes, marks them as not ready/cordonned keeps them that way for 30m and then uncordons them. [10:14:00] also if we could run the test in the same operator... super [10:15:09] we could also just add a taint in hiera/puppet...kubelet will apply that only when initially creating the node. It won't add it back later [10:16:46] and remove it manually you mean? [10:17:18] after whatever we define as acceptance test is ran manually by an SRE? [10:17:54] due to human speed, that would also solve the eventual consistency issue, but not by design, rather by accident [10:36:07] yes, removing manuelly I meant. It's a cheap solution, absolutely [10:37:21] but currently I only see the /restricted thing as something that prevents nodes from being usable ... networking/bgp should be fine with the check calico does [10:52:07] XioNoX: the singtel link seems flapping :( [12:08:00] jbond: I think we need to enable a mod to be able to use macros 😅 apache went kaboom when I deployed it [12:09:51] Amir1: ack one second macrod is relativly new to me i saw some one use it recently let me find there patch [12:10:37] Was it our apache magician (el.uky)? [12:11:55] was jha.thaway, but i think if we get toi take V we may have to invoke elu.key ;) [12:16:32] Amir1: is it working now, i noticed you re-did it without the macro? [12:16:49] yeah [12:16:54] ahh cool [12:18:05] Amir1: if you want to go to the macro then 771866 i think should work but can also just abandon that will leave it to you [13:53:33] <_joe_> akosiaris, jayme can't a cookbook help with all of the above? [13:54:43] jayme: achievement unlocked, _joe_ just volunteered to write it ;-) [13:54:55] <_joe_> nope :) [13:55:00] <_joe_> ENOTIME [14:34:12] elukey: I emailed out account rep to see if there is a better NOC contact to use [14:34:18] (re Singtel) [14:34:30] XioNoX: I was about to ask, thanks! [14:56:20] _joe_: yeah..but ultimately it's docker-registry's fault. So it might as well be wise to try to remove the hack there [15:18:36] akosiaris: elukey: are you working on kubernetes100[1-4] - especially 1002? [15:18:59] jayme: nope, I just seen the error msg as well, weird [15:19:23] the node is cordoned and marked not ready...(and has no pods) [15:19:32] so no real issue...but I was wondering [15:19:47] it is maybe getting sad that we are decommit it [15:19:57] so much service and then Alex just decommed it [15:20:46] yeah...I can't reach it via ssh as well [15:21:36] * jayme is inclined to run the decom cookbook rather than looking further [15:43:29] I just powercycled it instead. Will not get any workload and patches for decom of all nodes are already prepared - so I guess it's fine to do them together next week [16:03:01] mmm did we loose wikibugs on our chans? [16:03:21] 17:28:57 *** +wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) has quit (Excess Flood) [16:03:34] yeah I was about to paste that [16:07:30] trying to figure out how to restart it atm [16:07:45] I am reading https://www.mediawiki.org/wiki/Wikibugs#Restarting_wikibugs but can't access the host [16:08:01] taavi: --^ [16:08:58] you need Toolforge membership for that https://toolsadmin.wikimedia.org/tools/membership/apply [16:09:24] yep I suspected something similar :) [16:09:30] it seems to fail to connect, seeing this without any reasons why: 2022-03-18 16:08:32,351 - irc3.wikibugs - CRITICAL - connection lost (22813061700128): None [16:09:51] even after a restart? [16:09:57] yes [16:10:39] very clear msg [16:11:46] "None" is the key hint