[02:29:55] 10serviceops, 10Scap: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10thcipriani) [02:30:23] 10serviceops, 10Scap: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10thcipriani) p:05Triage→03Unbreak! Setting as UBN! since it blocks deploys. [02:32:54] 10serviceops, 10Scap, 10conftool: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Ladsgroup) Tentatively [02:42:16] 10serviceops, 10Scap, 10conftool: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Ladsgroup) Copying what I wrote in the ops@ email: This really looks like something like https://wikitech.wikimedia.org/wiki/Incidents/2022-09-08_codfw_appservers_degradation, even err... [06:19:24] 10serviceops, 10Scap, 10conftool: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10taavi) One of the codfw LVS servers is down due to {T327001}, maybe that is causing this? [06:36:38] 10serviceops, 10Scap, 10conftool: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) Yes, that's definitely the reason why this is happening: lvs2008 is down since saturday (along with all rack b2, see T327001). That's what's causing the issues as the script still... [07:00:18] 10serviceops, 10Scap, 10conftool: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) Sadly, the root cause is indeed b2 being down, but the problem is that **lvs2009 at the moment cannot reach any servers in row b**. This means that lvs2009 sees a good third of al... [07:16:07] 10serviceops, 10Infrastructure-Foundations, 10conftool, 10netops, 10ops-codfw: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) a:03Joe [07:19:25] 10serviceops, 10Infrastructure-Foundations, 10conftool, 10netops, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10ayounsi) [07:50:17] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) p:05Unbreak!→03High The situation is as follows: * I depooled codfw from mediawiki; before repooling, we'll need to do a scap pull... [08:14:52] o/ [08:16:48] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) Confirmed that now scap works and we can do deployments normally. Please @papaul @ayounsi ping serviceops so that we can bring things... [09:39:08] 10serviceops, 10Wikimedia Enterprise, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Kelson) At MWoffliner we have started to study the problem. Because of this ticket but as well becaus... [10:58:49] doing some more thumbor pooling today [11:49:06] <_joe_> hnowlan: in codfw we're running at half capacity btw [11:49:22] <_joe_> see /T327041 [11:57:22] _joe_: yeah, I saw, thanks for tagging me. Looks like it's holding up okay so far but worrying regardless especially as they are the more well resourced [11:58:16] I think we're not a million miles away from being able to pool k8s in an emergency tbh [11:59:10] to clarify "they" in the first line is the hosts that aren't pooled [12:39:38] 10serviceops, 10Parsoid, 10SRE, 10Scap: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10MoritzMuehlenhoff) [12:40:56] a snag with the k8s approach of less workers on more pods is that when nodes are tied up, the healthcheck will fail from a pybal perspective, and in a busy enough situation will be a failed liveness probe for k8s leading to restarts [12:41:27] we never really hit this on metal because there are so many instances on each host [12:42:05] which kinda means the benefits of haproxy are diminished [13:04:32] <_joe_> maybe we need to change the liveness probe [13:04:54] <_joe_> and just keep a readiness probe that checks how busy a pod is [13:16:48] 10serviceops: sextant needs to purge unused vendored files - https://phabricator.wikimedia.org/T326291 (10Joe) 05Open→03Resolved [13:43:56] 10serviceops, 10Sustainability (Incident Followup): Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10jcrespo) [14:42:08] _joe_: at this point is it worth re-testing just having a standard one-instance-one-pod setup without haproxy? with lots of replicas. That way at least we don't have to worry about lvs marking k8s workers as unhealthy, and k8s is already seeing thumbor containers as un-ready [15:09:00] (also addresses the messy per-pod resource concerns) [15:40:14] <_joe_> hnowlan: the downside being, you have no queueing and we can't rely on the k8s api to pick the "right" pod at all times [15:40:34] <_joe_> but yeah, let me think about that for a sec too [15:40:36] <_joe_> :) [15:41:20] as it stands we're still at risk of k8s not picking the right pod given the un-readiness, but that's obviously something we could tweak within the existing setup [15:47:53] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1042.eqiad.wmnet with OS bullseye [15:58:50] <_joe_> let me take a look at the chart one sec [16:02:02] <_joe_> hnowlan: where are the readiness and liveness probes defined? [16:02:46] <_joe_> oh I see in values.yaml [16:03:16] <_joe_> so you're saying the pods are failing the liveness probe, which is... being able to accept connections? [16:03:43] <_joe_> that is strange, I have to assume we're not allowing haproxy to use its own queue for connections [16:04:03] <_joe_> and that makes it refuse any external connection [16:04:27] <_joe_> if that's the case, then liveness and readiness probes will fail at the same time, which isn't what we want here, right? [16:16:46] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1042.eqiad.wmnet with OS bullseye completed: - mc1042 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [16:21:44] hmm, yeah - healthcheck is passed through to thumbor's /healthcheck [16:23:52] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1044.eqiad.wmnet with OS bullseye [16:36:23] _joe_: a fix here if you have a sec https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/880498 [16:37:01] not sure if it's a matter of a quieter day but when testing earlier and pods were operating the render times I was seeing were close to acceptable [16:37:58] <_joe_> hnowlan: was it the readiness probe that failed? [16:38:13] <_joe_> not the liveness one? [16:38:41] yep, there'd be pods marked as unready in the kubectl lists [16:38:59] although now that you mention it we configure /healthcheck in LVS also sooooo... [16:41:13] then again, now that I think about it - hosts are frequently marked as unhealthy in LVS logs but it refuses to depool them because there aren't enough hosts [16:41:25] maybe that has been the wrong approach all along [16:53:27] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1044.eqiad.wmnet with OS bullseye completed: - mc1044 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [17:14:10] I'm wondering if the healthz endpoint should be in the metal instances too. [17:14:27] there's plenty of things like "Jan 15 19:32:49 lvs1020 pybal[36374]: [thumbor_8800] ERROR: Monitoring instance ProxyFetch reports server thumbor1002.eqiad.wmnet (enabled/up/pooled) down: Getting http://localhost/healthcheck took longer than 10 seconds." when k8s nodes aren't pooled [17:15:28] which I think means we're not/barely queuing at all, just using the sheer number of instances and the haproxy balancing between thumbor workers [17:16:09] for example https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-30d&orgId=1&to=now&refresh=30s&viewPanel=31 [17:18:51] that'd also explain why when lose a thumbor instance or two we just see more or less immediate failures rather than a slow degradation [17:35:58] <_joe_> well yeah, we're opting for no queueing to get the maximum "performance" [17:36:22] <_joe_> sorry I'm knee deep into coding a new scaffolding for deployment charts right now :P [17:37:01] no worries :) [17:38:01] are we consciously opting for it though? I think using the /healthcheck check is kinda leaving it up to chance and also the side effects of having a small number of (healthy) hosts in lvs [17:39:07] <_joe_> yeah which is worse than with pods [17:39:36] <_joe_> ideally we'd have some form of autoscaling for thumbor pods [20:04:56] 10serviceops, 10Parsoid, 10SRE, 10Scap: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Arlolra) > Bastionhosts are used by parsoid deployers to restart parsoid machines and they use the dsh groups that are maintained in scap::dsh. https://github.com/wikimedia/puppet/c...