[05:43:14] hello service ops, that's for next week, but what are the depool instructions for mc2042, mc2043, wikikube-ctrl2004 ? - https://phabricator.wikimedia.org/T429861 thanks ! [07:47:39] FYI, the url downloader in codfw has been moved to the new trixie node (urldownloader2005) [11:39:19] I'm moving the URL downloader in eqiad also to the new trixie node in a bit [11:58:33] moritzm: the success ratio for citoid has gone down to 0% in the dashboard, could it be related? https://grafana.wikimedia.org/goto/afq3lqap3bdhcd?orgId=1 [12:02:30] seems most likely, I'll revert to the old servers for now [12:03:13] Mvolz: does citoid not use the standard CNAMEs ? url-downloader.eqiad / url-downloader.codfw / url-downloader [12:05:11] this broke hCaptcha integration as well, since we couldn't verify any hCaptcha tokens [12:05:16] I am not sure tbh :( [12:05:38] https://logstash.wikimedia.org/goto/90b4ee68ec6d14e7cec998faaece2926 [12:06:30] This is https://phabricator.wikimedia.org/T381372 for us again because I only found out by total coincidence when I went to deploy myself and nothing was working, haha. [12:09:59] Mvolz: reverted to the old servers for now [12:12:11] Thanks, we're getting some successful requests now [12:12:46] Mvolz: how does citoid pick the proxy? does it not query the CNAMEs [12:18:37] HCaptcha is working again too [12:21:24] moritzm: all I know is what's in the helmfile, for codfw it says http://url-downloader.codfw.wikimedia.org:8080 [12:22:20] and default is http://url-downloader.eqiad.wikimedia.org:8080 [12:24:50] ok,I'll open a task for citoid (and hcaptcha), the IPs for the new proxies were enabled via the global network policy, but clearly these two need still something else [12:37:11] Ok sounds good, same username for me on there. [12:37:44] thanks moritzm for chiming in, let us know if there's any followup task where we can continue discussing [12:59:34] XioNoX: the mc servers don't require depooling, but we have a strong preference for only one of them to be down at a time, if possible; the gutterpool should pick up the traffic (https://wikitech.wikimedia.org/wiki/SRE/Service_Operations/Documentation/Reboots#Memcache_cluster) [13:00:20] bjensen: then there shouldn't be multiple of them behind the same switch :) [13:01:54] hm, i'll have to ask Effie about that when she's back [13:02:38] my intuition would be that it would be okay for them to be simultaneously down for a brief period, because the gutterpool exists, but if that should never happen, we should probably look into moving them, if possible [13:05:05] <_joe_> bjensen: that is correct [13:05:38] <_joe_> we have designed the distribution of mc nodes so that losing a single rack isn't problematic [13:06:26] <_joe_> 2 servers are less than 10% of our total capacity, and less of the capacity of the gutter pool, which is IIRC 33% of the total capacity of the main pool [13:06:55] ah, great, thanks for the clarification _joe_ [13:08:58] for wikikube-ctrl2004, the docs here have the depool instructions (the pool-depool-node cookbook) https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#Pool/Depool [13:24:32] awesome! [13:25:09] bjensen: nice I didn't know the ctrl nodes could be depooled with the cookbook [13:25:51] bjensen: are they included in the request `cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-codfw depool --rack B2` ? [13:26:19] hmmm, let me see... [13:28:22] yes, it looks like wikikube-ctrl2004 is inculded there [13:28:28] included even [13:28:57] perfect! [13:29:39] moritzm: fwiw at least for some very old nodejs services, I seem to remember they only do DNS resolution at startup time [13:30:31] changeprop definitely suffered/suffers from that [13:31:15] I would suspect citoid of that as well [13:31:27] and depending on how the hcaptcha haproxies are set up, they might not either (certainly the default configuration doesn't) [13:33:29] yuck, are these corner cases recorded anywhere? I've been following the established procedure by amending the global network policy and re-deplying external-services [13:33:55] https://phabricator.wikimedia.org/T430045 is the task created for hcaptcha, will file a similar one for citoid in a bit [13:35:13] but also, late DNS resolution doesn't really explain it? the old URL downloaders are still around,so if this were late DNS lookups they'd have happily proceeded to use the old ones [13:35:31] that's fair :) [13:36:09] thanks. I created T430041 to document the parent incident, attaching those tasks there [13:38:11] I really hope it wasn't something like the trixie hosts were only listening on one address family [13:40:21] I had checked that before, address families are identical across the old and new nodes [13:53:52] * Raine checking whether `cookbook sre.k8s.pool-depool-node` does the right thing with ctrl [13:57:58] yep, it does