[12:25:18] hi folks! [12:25:36] for on-callers - I am planning to move Restbase cassandra codfw instances to PKI [12:38:24] proceeding :) [12:38:52] Ack. Thanks elukey. [13:01:20] started the cookbook to roll-restart, it will take a while (cc: urandom) [13:02:12] 👍 [13:19:31] urandom: so far all good, few instances restarted and nodetool doesn't show anything werid [13:19:34] *weird [13:24:51] Who owns Pybal? Guessing Traffic? [13:24:56] yes [13:26:05] cool, tagging y'all in pybal feature request task [13:26:17] :) [13:27:59] we are mostly keeping pybal maintenance to fixing urgent issues and not doing much feature work since Liberica will supersede that. but definitely go ahead and file that task, perhaps it can be useful for Liberica as well [13:28:22] [13:28:36] fabfur was more direct :D [13:29:18] oh sorry, the correct answer should've been "yes, sukhe is the official owner" [13:29:20] :D [13:30:11] sukhe is always elegant and kind when answering on IRC, we can probably assign the ownership without causing any opposition [13:30:18] *assigning [13:30:47] lol [13:30:58] usually the way it goes, fabfur owns everything in Traffic now [13:31:49] LOL [13:31:56] lovely [13:32:01] * vgutierrez goes away [13:32:01] I think it's an easy one? Feel free to smack me down. T363697 [13:32:02] T363697: Pybal: Depool nodes outside broadcast domain - https://phabricator.wikimedia.org/T363697 [13:32:46] vgutierrez: doing s/pybal/liberica on any tasks now [13:32:48] so don't go away [13:33:07] err pybal is a RO etcd client, so it can't depool things [13:33:20] liberica will be another read only etcd client, so it won't depool things [13:33:49] administratively pooling servers is an L8 choice :) [13:33:57] also.. liberica doesn't have that limitation [13:34:07] (the L2 connectivity requirement) [13:34:21] yeah, it's a weirdness for sure [13:34:41] L2 load balancing is a cool hack though [13:34:50] PyBal could conceivably not attempt to *use* a pooled IP that it was unable to reach though [13:35:15] topranks: that won't happen :) [13:35:18] What part of the stack handles health checks, if not Pybal? [13:35:28] yeah, "don't send traffic to unreachable hosts" is usually a good start :) [13:35:29] although I'm not sure if we want to do the heavy lifting rather than focus on Liberica [13:35:49] but probably wontfix as it's "fixed" by liberica? [13:35:56] I'm making a separate ticket for monitoring/alerting on this situation [13:35:59] is there a way we can alert on this discrepancy if it happens? [13:36:02] health checks are being handled by pybal but as part of a design limitation pybal healthchecks don't follow the same network path as production traffic [13:36:03] inflatador: ok thanks [13:36:14] that's also addressed/fixed in liberica [13:36:19] ok [13:37:16] vgutierrez I guess it's stretching the traditional definition of "health check", but could Pybal could detect L2 adjacency and refuse to pool those servers? [13:52:13] OK, T363702 is up for the alerting part...guessing that will be a combo of icinga/python script? Happy to help w/that if anyone can point me in the right direction [13:52:14] T363702: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702 [13:53:10] inflatador: we are not putting more efforts on pybal besides whatever is needed to help on the migration to liberica [13:54:48] vgutierrez understood, will focus efforts on monitoring then [13:56:23] inflatador: yeah, your idea of the simple Icinga check is not a bad one; we use similar things in other places [14:45:40] urandom: codfw is almost done, do we feel adventurous and do eqiad as well? [15:02:07] elukey: I do. [15:02:33] all right :) [15:02:40] do you want to check that codfw is good? [15:02:41] and I can preside over the restarts if you like [15:03:11] urandom: I can start a root tmux session on cumin1002 so we can share and you can takeover later on, would it be ok? [15:03:27] sure, that would work [15:03:56] super, when you are ready I'll start [15:07:26] elukey: looks good to me [15:07:39] urandom: all right proceeding with puppet changes in private [15:07:47] the only errors where related to prometheus200[5-6].codfw.wmnet. [15:08:26] SSLHandshakeExceptions that tailed off a bit ago [15:08:54] is that the cql port check or something? [15:09:25] they gradually tailed off as the restart progressed, so I'm not concerned...mostly curious [15:11:13] yes yes exactly, basically those are the new prometheus blackbox alerts, they are added when puppet runs (since we need to upgrade the keystore etc..) but they fail until the instances are restarted [15:11:27] since they are tailored for the specific new cert etc.. [15:11:32] gotcha; makes sense [15:12:32] going to run puppet and restart instances only on restbase1028, if it goes well I'll do the rest [15:26:54] urandom: all good with 1028, creating the tmux on cumin1002 and starting with the rest [15:27:16] sgtm [15:35:07] started! [15:37:04] urandom: there is a tmux sesssion on cumin1002 called T352647 if you want to follow [15:37:04] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [15:39:09] elukey: attached (read-only) [15:40:00] this is the first time I've done this using tmux, it's neat how it pads the right side to deal with disparate sizes [15:43:02] tmux is great. I find the bindings esoteric but on the whole I enjoy it a log [15:43:04] a lot, even [15:46:33] urandom: https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions ;) [15:49:04] tmux is great for pairing sessions. Or connecting your virtual terminal to this guy: https://www.ebay.com/itm/275618371863 [15:49:52] Tmux is great, I like it. [15:51:48] volans: you and all of your hidden documentation! [15:51:52] :) [15:52:05] I didn't do it :D [15:52:37] I know, but you are the one that knows where all of this is [15:52:44] :D [15:52:46] I have a pile of tmux config to make tmux respond to my ancient screen muscle memory -- https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/files/home/bd808/.tmux.conf https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/files/home/bd808/.tmux/ [15:53:06] you're a better index of what's on wikitech than the search function provides [15:54:14] brett: I don't think Tmux keybindings are esoteric, I think they're as natural and intuitive as Emacs keybindings. [15:54:22] * denisse runs [15:55:59] emacs is fine if you have 20 fingers in your hands or pedals [15:56:14] use your toes? [15:56:38] emacs is the editor used by organists [15:59:37] https://paste.debian.net/plainh/4d49308f is my somewhat minimal dotfile to make the OOB experience less awful. In particular, I loathe time-based delays such as the pane switching where you have to wait x amount of time to regain control of your keyboard. Gr [15:59:59] ha, I still have xterm-termite hacks in there. Don't mind that... [16:00:47] Joking aside I think that playing the piano makes using Emacs and touch typing with all my fingers pretty straight-forward... [16:33:06] urandom: o/ I have detached from the tmux session, can you still see it etc.. ? [16:33:32] we are almost halfway through, so far all good. Is it ok if I leave the rest to you? [16:33:45] denisse: I'm better at emacs than the piano, alas :) [16:33:47] I can, yes, and I'll keep an eye on it [16:34:50] super thanks! [18:10:21] denisse: fabfur: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025436/1/hieradata/magru.yaml [18:10:40] denisse: please note hostname for the prometheus node in magru. I think this should be fine [18:11:03] Yes, it looks good to me. It follows our naming convention. [18:11:22] +1 [18:11:23] thanks [18:11:38] fabfur: sorry, I know you said hieradata/.yaml is missing. but I misread that as for puppet7 [18:12:00] any other files you observed when you were looking? [18:12:03] time to add them now :) [18:12:48] magru is single backend, we should add the configuration for nvme too? [18:13:05] yeah we will in the cp commit [18:13:11] and the other cp stuff [18:13:19] since the hosts have now been provisioned and we have their IP [18:13:49] so looks ok to me. Do you want to try to reimage cp7001? I have currently 7003 still running the d-i [18:13:55] please do [18:14:22] sorry didn't understand, I'll reimage or you? [18:14:37] you can but I think you are doing 7003 as well? [18:14:43] moving back to -traffic (sorry) [18:14:46] ack