[08:50:33] mutante: sorry about parse1002, that's my fault, I forgot to run the netbox cookbook after it was handed back to me repaired [09:59:36] marostegui, jayme, hnowlan, elukey: could you briefly switch to a different bastion other than bast3005? I'd like to reboot [09:59:51] moritzm: sure! [10:00:07] moritzm: just logged out from it [10:00:13] moritzm: clear [10:01:59] moritzm: done [10:02:40] !incidents [10:02:41] 3187 (RESOLVED) [FIRING:1] ProbeDown (2620:0:862:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 ops page esams prometheus sre) [10:02:41] 3186 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.29 ip4 sessionstore:8081 probes/service http_sessionstore_ip4 ops page eqiad prometheus sre) [10:05:43] moritzm: clear :) [10:06:16] thanks all, rebooting now [10:11:23] marostegui, jayme, hnowlan, elukey: bast3005 is back, you can use it again [10:19:13] thanks moritzm [16:53:16] claime: no problem, I just wanted to make you aware, it was also the first time for me to run that specific sync cookbook [16:53:33] mutante: I added the last step to dcops doc [16:55:08] claime: thanks, great! also thanks for the review on adding a host to docker_registry builder hosts [17:44:17] Is there some magic required to fix a PyBal backends health check? A backend flapped, I depooled it and it's since become healthy regardless, but the alert is persisting [17:45:18] jbond: can I get a +1 from you on https://gerrit.wikimedia.org/r/c/operations/puppet/+/868002 ? [17:48:54] hnowlan: under some circumstances they need pybal restarts to get fixed [17:49:02] but also they can just take some time [17:49:19] mutante: cool, I'll just wait it out so [17:50:41] hnowlan: "Please" is always an option, but computers can be very rude. ;) [17:51:14] hnowlan: what service/backend and what pybal? You can also look at raw pybal logs to get an earlier idea (than the alert) if the pybal healthchecks are still failing. [17:51:36] alias please="kill -HUP" [17:51:39] I don't know of any common scenario where a pybal restart would fix this [17:55:42] I guess hnowlan is talking about thumbor [17:55:48] yep [17:56:25] and I think that the depool threshold is biting him in the arse :) [17:56:33] yeah that's what it looks like [17:56:46] Dec 14 17:19:00 lvs1020 pybal[36374]: [thumbor_8800] INFO: Merged disabled server kubernetes1011.eqiad.wmnet, weight 2 [17:57:01] yep, thumbor it is :) [17:57:23] so.. you have 4 thumbor instances outside k8s [17:57:27] bblack@puppetmaster1001:~$ confctl select name=kubernetes1011.eqiad.wmnet get [17:57:27] (on eqiad) [17:57:30] {"kubernetes1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=kubernetes,service=kubesvc"} [17:57:33] {"kubernetes1011.eqiad.wmnet": {"weight": 2, "pooled": "no"}, "tags": "dc=eqiad,cluster=thumbor,service=thumbor"} [17:57:36] ^ it's not pooled in etcd [17:57:36] and 18 on k8s [17:57:55] and I'm assuming that the depool threshold is something like 0.5 [17:58:15] so there is no way that pybal is going to allow you to depool all the k8s instances [17:58:58] right [17:59:13] just double checked... depool_threshold is set to 0.5 [17:59:26] whatever the depool threshold fraction is, if etcd or healthchecks try to depool you past that fraction, pybal refuses and keeps them in-service, and then this alert is generated [17:59:36] aha [17:59:51] that makes sense - I was able to depool them without issue earlier, this just started when kubernetes1011 flapped [17:59:55] some k8s nodes are flagged as inactive, but still there are 5 flagged as depooled [18:00:24] vgutierrez: that's expected/desired fwiw [18:00:28] 4 thumbor nodes + 5 k8s nodes should be accounted for depool threshold purposes [18:00:40] so yeah.. you cannot depool the 5 k8s nodes [18:01:04] but you could modify the depool threshold, but that probably does require pybal restarts to take effect [18:01:11] yep [18:01:18] also take into account that the current state is quite fragile [18:01:37] would setting them back to inactive get around it? or just repooling it? [18:01:47] it = kubernetes1011 in this case [18:01:49] setting them to inactive will fix the issue yes [18:02:05] okay, I'll go with that [18:02:11] but in that case.. 4 k8s servers flagged as depooled and 4 thumbor nodes as pooled [18:02:25] if any thumbor node fails, it will be force pooled even if it's not healthy [18:02:50] that's why I was saying that's it a fragile state :) [18:02:53] ahh [18:02:55] I'll set all the k8s nodes to inactive in that case [18:04:20] that makes sense assuming that you're just doing some tests with k8s [18:05:13] yep, they're not ready for prime time quite yet so having them accidentally pooled wouldn't be great [18:05:41] yeah, flag them as inactive then [18:05:50] done. Thanks for the help! [18:06:44] {"kubernetes1014.eqiad.wmnet": {"weight": 4, "pooled": "no"}, "tags": "dc=eqiad,cluster=thumbor,service=thumbor"} [18:06:47] I think you're missing that one [18:07:09] ah, yep! [18:10:24] might be of some interest for topranks XioNoX vgutierrez brett: https://labs.ripe.net/author/kistel/five-proposals-for-a-better-ripe-atlas/ [18:11:18] *especially* the `GENERIC-HTTP` one :) [18:13:04] hmm yeah interesting stuff, just had a glance through will take a closer look later on [18:13:17] certainly you could imagine us making good use of that GENERIC-HTTP one [18:13:50] yeah I also wonder if we are going to wind up on their first list for `CDN-HTTP` [21:04:53] andrewbogott: I created https://phabricator.wikimedia.org/T325244, do you think you could find some time to help me out with it ? [21:09:46] effie: maybe? My preference would be for someone to work on https://phabricator.wikimedia.org/T237773 rather than continually playing catch-up with those outlier deployments. [21:10:07] Is there a new memcache pooler that's now the state of the art? I think I've already rotated through three different ones... [21:12:43] andrewbogott: help me understand, this is under profile::openstack::base::nutcracker [21:12:57] isnt it related to openstack ? [21:13:09] Oh! um... I may be confusing two different nutcracker issues. [21:13:14] Sorry, let me look [21:13:22] (my question about new memcached poolers stands) [21:14:06] we are using mcrouter to shard memcached if that is what you are asking [21:16:28] ok -- I need to dig a bit into what this is doing but I can claim that bug. [21:16:52] My hair trigger is because iirc wikitech is also using nutcracker? But maybe that's wrong. [21:17:37] hm, nope, I upgraded that to mcrouter a while ago. So maybe I can duplicate that work for the openstack services. [21:18:37] mcrouter is for memcached [21:18:58] nutcracker in mediawiki was used to shard redis [21:19:26] 'k [21:19:49] so if openstack is using nutcracker for redis, then I think it is as simple as copy/paste and a wee bit of fiddling [21:19:54] nutcracker /is/ still getting used on wikitech, or at least is installer there. Do you have a different task tracking that? [21:20:45] wikitech sessions may still be in redis... [21:21:06] oh yeah, that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/861807/ [21:21:35] I'm making this worse by still talking about wikitech when that's not what effie pinged me about. Apparently I already complained to her about wikitech previously :/ [21:21:51] So anyway, effie, the answer to your initial question is 'yes' :) [21:22:14] so hangon [21:22:42] T292707 would be the epic move here ;) [21:22:42] T292707: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 [21:23:40] why is it under profile::openstack::base::nutcracker [21:24:04] ? [21:24:24] because the role is wmcs::openstack::eqiad1::labweb? [21:24:36] the role for wikitech ? [21:24:57] the role for the cloudweb boxes which host wikitech, Horizon, and Striker [21:25:29] we're talking about two things at once here. bd808 is talking about wikitech, which is https://gerrit.wikimedia.org/r/c/operations/puppet/+/861807/ [21:25:45] but effie was actually asking about nutcracker's use with openstack services, which is (mostly) unrelated. [21:26:08] T325244 and https://phabricator.wikimedia.org/rOPUP971912ae9d9713eb9c592cf82b11588b7b375156 [21:26:09] T325244: cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 [21:26:38] I kind of doubt it is unrelated at all, but I could be wrong [21:26:58] the fact that it is under openstack, is quite misleading [21:27:53] andrewbogott: regardless, the easy fix here is to include whats in the profile included from mediawiki [21:28:02] to that profile, so at least, decouple them [21:28:11] OK, now I'm even more lost than before [21:28:28] I believe we were using nutcracker for memcached before the puppet change I linked above that dropped in mcrouter instead [21:28:47] I'm in the middle of something and can't give this the attention it deserves. I will look when I'm not distracted and upgrade the task. [21:32:14] bd808: it seems that mcrouter is configured in cloudweb [21:33:25] but I am not sure it is used [21:34:14] everything about wikitech is duct tape and bailing wire [21:34:44] it needs to me in cluster to stop the config and expectations drift [21:34:50] *move in cluster [21:35:18] *that* sounds like volunteering :D [21:36:42] TheresNoTime: I have been shifting these rocks since 2014. If I was allowed to put it in k8s or the main cluster config I would have done so. [21:37:24] ^^' [21:38:07] for the time being, it is not part of any bigger plan [21:38:27] but it might be easier with mw on k8s [21:38:52] anyway, this specific problem has a relatively easy solution [21:42:38] effie: can you sketch out the easy solution on task if you haven't done that? [21:44:06] sure, basically you can merge the two profiles in one [21:44:16] TheresNoTime: sorry if I barked overly loudly there. I have some PTSD around the "just volunteer harder" mindset for staff and wikitech as well [21:45:26] effie, sorry, I meant on the task vs. on irc [21:45:34] bd808: all good, apologies too — "volunteer harder" is rarely the answer and is often just a sticking plaster over a more systemic problem :) [21:50:21] andrewbogott: replied :) [21:50:49] thx [22:00:26] akosiaris, _joe_ (when back) any chance Wikitech can be next in line for k8sifying? https://phabricator.wikimedia.org/T292707 [23:20:09] legoktm: thx for the tox fixes. I've also documented from bash_history what we did last time for the new beta frontend on-wiki, in case there's room for improvement: https://www.mediawiki.org/wiki/Codesearch/Admin#Deploy_frontend