[09:19:34] FYI, doing some pod rolling restarts in eqiad trying to reproduce https://phabricator.wikimedia.org/T366094 [09:19:56] cc marostegui ^ [09:20:38] ok [09:20:40] thanks for the heads up [10:37:20] vgutierrez: I restarted pybal on lvs1019 for ^ [10:37:40] I also set parse1002 to inactive. It was spamming pybal logs enough [10:42:59] not vg but ok :) [10:56:52] we just got a p4ge....https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&refresh=1m [10:57:03] can this be related to any of the issues from k8s akosiaris effie ? [10:57:08] almost certainly [10:57:24] ah, that alert? [10:57:44] yeah we need to remove that alert I think eventually. We are alerting on the wrong thing. It was a nice signal in the past [10:58:10] It looks like it is recovering though [10:58:51] I did notice slowness fwiw, may have been my mobile data though cause it's a bit rubbish [11:04:36] RhinosF1: do you notice slowness right now? [11:05:46] we did have increased latencies for a bit while debugging something [11:08:04] akosiaris: nope, not anymore [11:08:33] ok, you might notice it once more soon. It's the last test in my battery of tests [11:10:28] running my test now, it has the highest chance of all my tests of causing some issues [11:10:33] they will be transient however [11:15:05] akosiaris: it did again yep [11:19:32] mw-api-ext is the only thing still running btw [11:20:49] can someone take a look at gerrit please? [11:21:04] I am cleaning up my tests, I wanna leave things ok for mw deployers [11:21:18] akosiaris: what is up with gerrit? it seems to be working for me [11:21:46] jelto is on it. It loaded for me fine too just now, but pretty slowly [11:21:53] ok [11:23:04] !log T366094 re-undeploy otel-collector, it being around increased traffic to the API >50% [11:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:09] T366094: k8s master capacity issues - https://phabricator.wikimedia.org/T366094 [11:23:47] the gerrit machine is seeing quite some load (taffic wise and cpu) [11:27:12] I am re-enabling puppet and running it on lvs1019 [12:14:33] Say... how bad is it when an install/reimage ends the installer phase with a kernel panic? Asking for uh, a friend. [12:14:42] akosiaris: I’ll be getting online shortly, can you tell me more about your tests? [12:14:54] klausman: right before the reboot? that's normal [12:15:01] Oh, phew. [12:34:57] hello on-callers! I am going to complete the rollout of PKI TLS certs for thanos-swift [12:35:06] thanks elukey [12:35:21] in the past it affected Tegola (maps) but we already have two nodes with the new certs and nothing horrible was registered [12:59:38] thanks elukey [13:06:07] cdanis: I am gonna wrap them up and document them in the task [13:06:21] ok [13:06:29] otelcol was part of the trigger, but only part of it [13:06:32] to be clear [13:12:30] akosiaris: is our tl;dr that, we will just have a lot of traffic/load during deploys, since we have more hosts more services etc [13:12:54] so we scale horizontally if needed? [13:19:54] cdanis: it was part of the issue, but it was a very large part of the issue. [13:20:01] yes [13:20:19] I think it was responsible for something like 50%-66% of the network traffic from the original masters [13:20:38] effie: no recommendations yet, aside from figuring out a bit what exactly the otel collector is doing and definetely finishing up the wikikube-ctrl work [13:20:45] but don't kill the old masters just yet [13:21:11] I don't think we are in a rush to kiil the old masters anyway [13:23:52] akosiaris: I have a decent handle on what it is doing, fwiw, and I'm also looking at the source rn https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/k8sattributesprocessor/internal/kube/client.go [13:33:40] cdanis: am I correct to think that running it as a daemonset means we get n times the load for the same functionality as a single replica? That's how I understood the docs, but I only looked briefly [13:34:00] kamila_: no -- when running as a daemonset, each instance only queries the API for pods on its local node [13:34:49] that's basically how it's designed to be deployed on k8s [13:34:57] Ah, OK, so it does shard, thank you [13:35:02] I think we've just never had to think about horizontally scalability of the apiserver before [13:35:20] Yeah, makes sense [13:38:05] we could potentially disable some of the attributes it adds, which could reduce the number of API calls it makes [13:38:21] but otoh it's running in exactly the same configuration on codfw just fine [13:43:07] example: 2024-05-29T11:07:15.126Z info kube/client.go:113 k8s filtering {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "labelSelector": "", "fieldSelector": "spec.nodeName=kubernetes1057.eqiad.wmnet"} [13:43:38] and that gets applied to all the API calls (LIST / WATCH) that each collector makes [13:46:52] hm [13:47:03] I possibly have spotted a way to reduce the load further [13:47:31] see `"pipeline": "metrics"` up above? we don't need that, nor one for logs [13:47:38] (at least not at present time πŸ˜‡) [14:06:09] akosiaris: thanks for the writeup. I have a maybe-counterintuitive hunch that, by removing CPU from the cluster, you're making the network saturation situation better [14:09:00] Sorry, stupid questions again. I'm trying to set up a new discovery record (for apus, a new Ceph-backed S3 service), so I've got hostnames &c to use in actually setting it up. I started at https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production and step one is essentially "fill out an entire Wmflib::service entry in hieradata/common/service.yaml", which needs LVS IPs. How do I go about getting those allocated? [14:09:23] The example in the docs ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/545225 ) predates much of this, so isn't very helpful [14:09:23] cdanis: I have the EXACT same thinking [14:09:47] I am artificially limiting the TX from kubemaster1002 because it's CPU is pegged at 100% [14:09:54] yep! [14:10:00] and this is supported by that 272MB/s we saw yesterday [14:10:15] and it needs that CPU to handle the mutating operations that will cause a bunch of network TX to all the low-cpu-but-high-bw WATCH clients [14:10:25] and so you add more CPU, you increase the rate of mutating operations, which, etc [14:11:05] and re: balancing them, what you can do is, you kill them and make the client reconnect after a while ;) [14:11:14] didn't we do the same thing for ATS connections to its backends? albeit on its side [14:11:30] yet, which isn't very easy to do on k8s side without a apiserver restart [14:11:34] yes* [14:11:54] yeah that's why I was suggesting raising it with upstream [14:12:54] Ah, I think I need https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service (though there is a bit of chicken-and-egg here) [14:14:04] Emperor: you don't need LVS IPs to add a new service in discovery [14:14:10] look at helm-charts for an example [14:14:22] it's an LVSless discovery enabled service [14:14:25] there are a few more [14:14:32] Emperor: you just need the service IPs, not the LVS ones; you need the class for that [14:15:58] Emperor: happy to help in -traffic. feel free to ping us there if required [14:19:53] akosiaris: btw -- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1037083/1 [14:20:22] you are right, I hadn't been factoring in that all five control plane nodes are still active in codfw [14:20:34] if what you say on the subject is true, hurray! [14:20:45] +1ed [14:20:46] maybe. it is a hunch based on a cursory reading of the logs + code [14:20:56] if you don't mind I'll try it on codfw [14:21:00] go ahead [14:21:03] oh, yes, the documentation and type declaration in modules/wmflib/types/service.pp don't agree as to whether the "lvs" parameter is optional or not (and confused me that the 'ip' parameter had to contain LVS IPs [14:21:27] Emperor: when in doubt, trust existing instantiations over the comments ;) [14:21:51] I hear you, but I'm a bit out of my depth here and trying to RTFM rather than cloning and hacking [14:22:25] I will take sukhe up on their offer of assistance in -traffic, though. [14:45:58] akosiaris: before and after: https://grafana.wikimedia.org/goto/BjzDn7yIg?orgId=1 [14:47:51] akosiaris: if it sounds good to you, I'll sum up kube-apiserver network load seen in codfw with this configuration during the MW train deploy, and if it's under like idk 1.5Gbit/s? then we can re-deploy otelcol in eqiad as well [14:48:23] btw, something *not* mentioned on the capacity crunch bug so far: apparently we have made scap deploys significantly (like 50%!) faster [15:17:57] cdanis: SGTM [16:20:41] jhathaway: is puppet merge taking long? [16:21:28] merged [16:21:34] thanks [17:43:21] the lvs1019 thing, bunch of mwNNNN failing health probes from lvs? [17:43:31] sorry, let's talk here [17:43:35] not in #-operations [17:43:43] for mwdebug [17:44:10] yeah, the lvs hosts can't connect to port 4444 on I think probably most/all of the kube machines [17:44:22] mediawiki-pinkunicorn-tls-service NodePort 10.64.72.109 4444:4444/TCP 449d [17:44:32] the Kubernetes resource for the NodePort does exist in eqiad [17:45:02] and I've checked by hand on one random k8s host (kubernetes1012) and I don't see anything in `iptables -nL` output about dport 4444 [17:45:19] is port 4444 new? [17:45:19] whereas I do see this, for instance: [17:45:22] no [17:45:24] MARK tcp -- 0.0.0.0/0 0.0.0.0/0 /* cali:4qSdb5zElt3Q5jJs */ /* Policy mw-api-int/knp.default.mediawiki-main ingress */ multiport dports 4446,8080,9117:9118,9151,9181,9361 MARK or 0x10000 [17:45:31] in previous incident 2024-04-03 calico/typha down [17:45:41] 4444 is the mw-debug port: https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#Server_groups [17:46:04] it has been said back then "calico controller being down would only be an issue if there was also a deployment" and this was the case again? [17:46:08] I don't see anything about mw-debug or mwdebug [17:46:34] (which is the good news, this is only affecting mw-debug traffic atm, so it's blocking scap but no direct user impact) [17:47:12] well, so long as it's not impacting unrelated traffic through lvs1019 [17:47:25] fair [17:48:09] okay, apparently those kinds of iptables objects not existing everywhere is normal [17:48:37] and they do appear on exactly 2 `A:wikikube-worker-eqiad` hosts, so, that checks out [17:49:14] that'll be the hosts with the two mw-debug pods that are actually running, yeah [17:49:25] doesn't seem like it should cause anything weird though [17:50:23] I really should know the answer to this question already, but: are more than those two hosts *supposed* to be answering health checks on 4444? [17:50:48] cdanis: -t nat and there should be rules on all nodes DNATing that port [17:50:57] yeah from the basic end of the lvs config, lvs expects 4444 to work on the whole kubesvc fleet [17:51:14] I am putting kids to sleep, so can't respond, just comment on IRC [17:51:22] thanks akosiaris [17:51:28] well I guess there's another routing layer too [17:51:41] oh that DNAT rule will be the answer to my followup question then [17:51:51] ("how did that ever work") [17:52:34] And yes all nodes should answer to hits on 4444, but they aren't really answering themselves but forwarding traffic to the pods [17:52:46] nod [17:53:01] okay, those rules are on 100% of nodes [17:53:05] and they are identical across nodes as well [17:53:23] Look if those pods are so saturated or something that they are croaking and aren't able to respond [17:53:29] so: did the forwarding stop working, or did the pods stop answering -- second one sounds more likely [17:53:40] In that case they may be taken out of rotation [17:53:55] In which case connection refused will be returned [17:53:57] they look healthy: https://grafana.wikimedia.org/goto/aFxk0nySg?orgId=1 [17:54:15] workers mostly idle, low resource usage etc [17:55:19] akosiaris: there's also no events in the kubectl get events, and connections are hanging on connect, rather than being refused [17:55:34] the SYN packet goes nowhere, is what it looks like [17:56:52] Hmmm, weird [18:01:29] did parse1002 decide to wake up from the weird slumber it was in? [18:01:54] talking about https://phabricator.wikimedia.org/T363086 [18:01:56] temp and then died again. Dell is going to send a new mainboard now [18:02:49] well, it's up right now [18:02:50] well, it's probably somewhat fixed because it's sending heardbeats [18:02:57] but not entirely ok? [18:03:04] and it does host 1 pinkunicorn pod [18:03:34] oh, did we not re-cordon it? [18:03:53] I just cordoned it again [18:03:59] but it was in state READY [18:04:28] there goes the pod [18:04:33] forcefully deleting the 1 pod it had [18:04:45] !log kubectl -n mw-debug delete pods mw-debug.eqiad.pinkunicorn-6d4d68cd79-nq695 [18:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:02] annnnd that was it [18:05:04] and pybal recovered [18:05:13] thanks akosiaris <3 [18:05:57] ok, in the incident review write a line "where we got lucky? Alex did a kubectl get pods -o wide just out of habbit, and saw by chance parse1002 and put 1+2 together, said it equals to 12 and whatever from there on" [18:06:35] and if I my kids ever see the above in bash or whatever, just know that I was joking. [18:06:39] ahahaha [18:07:16] * akosiaris off again [18:07:30] other pods on parse1002, apart from daemonsets: two linkrecommendation and one toolhub, both in status error [18:07:34] I'll delete em both as well, I guess [18:08:03] is there some deeper learning here about half-dead servers taking out a bunch of stuff? [18:08:45] yeah, it might actually be worth writing this one up and discussing it [18:09:00] something about silver and garlic ? or whatever it is that repels zombies? [18:09:10] ok, enough jokes, I am off [18:09:13] the easy process answer is "make sure we cordon a machine if it's haunted" but it also shouldn't have hurt so bad when we hit ourselves [18:09:46] thanks again akosiaris, have a good night [18:11:41] rzl: this is the *second* Interesting K8s Thing worth discussing in as many days [18:11:44] thanks akosiaris [18:12:10] yeah [18:13:23] now we're just left with `PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled` from this morning [18:13:32] which is probably worth us cleaning up, rzl [18:14:05] oh, ye [18:14:14] one sec while I put these three pods out of their misery, then refill my coffee [18:15:37] hm it's running [18:16:40] !log evacuate cordoned node parse1002: kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-7gsqs; kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-xl7t4; kubectl -n toolhub delete pod toolhub-main-crawler-28616760-jrhbb # T363086 [18:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:45] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [18:25:26] okay, back [18:26:28] I am not sure what is wrong [18:30:25] the pybal management port? (9090) which is being queried by the nrpe is indeed still reporting the critical [18:35:40] meanwhile we also have this [18:35:43] πŸ’”cdanis@lvs1020.eqiad.wmnet ~ πŸ•β˜• sudo ipvsadm -l -t 10.2.2.8:6443 --stats [18:35:45] Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes [18:35:47] -> RemoteAddress:Port [18:35:49] TCP kubemaster.svc.eqiad.wmnet:6 37962 327308 0 67832843 0 [18:35:51] -> kubemaster1001.eqiad.wmnet:6 22150 171479 0 34405585 0 [18:35:53] -> wikikube-ctrl1002.eqiad.wmne 3185 31781 0 7000897 0 [18:35:55] -> kubemaster1002.eqiad.wmnet:6 3184 28055 0 5917272 0 [18:35:57] -> wikikube-ctrl1001.eqiad.wmne 3184 32038 0 6962523 0 [18:35:59] so like ... it's working and being used [18:37:05] if nothing else seems obvious, it's possible that pybal has built some bad internal state [18:37:38] we know there's some logic flaws in there, which can be exercised when there's lots of depool/repool happening while lots of health probes failing and trying to go below the depool_threshold. [18:37:48] sometimes it can just end up with bad internal state for some backends and will have to be restarted [18:37:59] it's rare, but it has happened before [18:38:06] got it [18:38:09] bblack: https://phabricator.wikimedia.org/P63596 [18:38:10] I think that this happened here [18:38:52] maybe restart lvs1020 first and wait a few, to make sure that one's clean + sane before 1019 blips real traffic [18:39:01] the error isn't happening on lvs1019 :) [18:39:03] just on lvs1020 [18:39:05] oh ok [18:39:06] https://sal.toolforge.org/log/lioyxY8BxE1_1c7sKbsl [18:39:08] just restart it then :) [18:39:09] Hi folks. Different httpbb test failure during train rollout: [18:39:09] ``` [18:39:09] 18:38:18 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet... [18:39:09] https://donate.wikimedia.org/w/index.php?title=Special:FundraiserRedirector&reloaded=true (/srv/deployment/httpbb-tests/appserver/test_foundation.yaml:28) [18:39:09] Location header: expected 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&utm_medium=spontaneous&utm_source=fr-redir&utm_campaign=spontaneous', got 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&wmf_medium=spontaneous&wmf_source=fr-redir&wmf_campaign=spontaneous'. [18:39:10] ``` [18:39:10] lvs1019 was restarted [18:39:48] `FAIL: 131 requests sent to mwdebug.discovery.wmnet. 1 request with failed assertions.` [18:39:59] dancy: so the params changed from `utm_` to `wmf_`, was that intended? [18:40:13] if not the tests caught a bug, if so the tests weren't updated :) [18:41:04] really missing the way gerrit colors diffs rn [18:41:59] Hmm.. it would be helpful if the "expected" and "got" values were aligned in the error output. [18:42:04] Like so: [18:42:14] ``` [18:42:14] Location header: [18:42:14] expected 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&utm_medium=spontaneous&utm_source=fr-redir&utm_campaign=spontaneous', [18:42:14] got 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&wmf_medium=spontaneous&wmf_source=fr-redir&wmf_campaign=spontaneous'. [18:42:14] ``` [18:42:48] patches welcome :D [18:42:56] πŸ‘πŸΎ [18:43:20] (or assign me a task if you'd rather, I just can't promise it'll happen right away) [18:43:38] Seems like a simple enough one to take a stab at myself. [18:43:45] sweet, happy to review [18:44:12] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/httpbb/+/refs/heads/master/httpbb/main.py#185 [18:44:53] rzl: do you know if any clients are depending upon one output line == one error right now? [18:45:17] cdanis: seems like the restart fixed it? [18:45:21] sukhe: indeed [18:45:46] cdanis: I don't know of anything that parses the output at all [18:45:51] cdanis: ok! [18:45:55] it's all human-facing afaik [18:46:05] (except the exit status) [18:46:06] sgtm :) [18:48:56] rzl: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037149 [18:49:18] dancy: utm_campaign too, yeah? [18:49:30] oh and utm_source, it's all three [18:49:50] ah yes. thanks.. fixing [18:56:21] rzl: Fixed. I was way off on the first attempt. [18:57:06] dancy: lgtm, as soon as jenkins is happy I'll +2 and merge it [18:57:13] thx [18:57:48] normally we'd do a dance around making sure the tests are passing so the alert doesn't pop -- in this case we'll just break the test since you're about to deploy the thing that fixes it, and if it yells in the meantime we'll ignore it [18:58:12] but if you end up rolling back the train, we'll revert this too and then be more thoughtful :) [18:59:00] nod.. I was considering that possibility. :-/ [19:00:36] (sorry, I mean, not that you were thoughtless, this is exactly what I'd do) [19:01:07] Nod.. understood. [19:08:22] dancy: merged and ran puppet on deploy1002 -- httpbb now passes on mwdebug and fails on mw-web, as expected and the opposite of before, you should be good [19:09:05] Thanks!! Proceeding. [19:22:50] rzl: dancy: I got mildly sniped https://i.imgur.com/M8BBux8.png [19:23:24] Nice.. I was just thinking about adding in color diff. [19:23:33] I'll let you do that after my changes. [19:23:51] whoa, neat [19:24:01] I'm unsure about the spacing thing [19:24:05] gonna drag this tool kicking and screaming into the late 20th century [19:24:21] it is potentially confusing every way I've thought of to do it [19:24:26] so I might take that bit out [19:25:02] I think I would drop it in part for copypastability reasons [19:25:08] yeah [19:25:12] it's a cool idea though, I'm not 100% against it [19:25:39] https://i.imgur.com/10ogGAC.png [19:29:17] hmmmm what if the unit was words instead of characters [19:29:31] it'd be nice to highlight `wmf` or even `wmf_source` rather than just `w` and `f` [19:29:41] maybe I'm overfitting on this specific example though [19:30:00] do we have historical logs of httpbb test failures rzl [19:30:44] not centralized ones afaik -- you could try the systemd timer logs on the cumin hosts [19:31:12] I think they mostly won't be semantically-interesting failures like this one though, mostly connectivity hiccups [19:31:22] yeah [19:51:46] rzl: FYI httpbb tests don't work as-is right now: https://integration.wikimedia.org/ci/job/tox/657/console [19:52:59] that's odd, I guess that error message changed with a version update or something [19:53:06] I'll fix that today, thanks for the heads up [19:53:54] rzl: if it's good with you I'd like to do a rolling restart of a bunch of mw pods in codfw once the admin_ng/otelcol change is deployed and stable [19:54:50] sgtm, maybe we wait until after the backport window coming up [19:55:30] (or just use the backport as a rolling restart, depending) [19:55:36] perrrrfect [20:16:55] rzl: well, both patches I did today to otelcol config in codfw certainly did something: https://grafana.wikimedia.org/goto/y1f487sSR?orgId=1 [20:17:59] mmmm yes I think I can almost see the change there if I squint [20:45:52] 20:43:42 <+jinxer-wm> FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:46:07] this happened during a deployment but I think it is a leftover of h.nowlan's work earlier https://sal.toolforge.org/production?p=0&q=kubernetes2032&d= [20:46:40] yeah was just about to point you to that rename [20:47:03] I put in a silence [20:47:13] not sure offhand what the lingering artifact is but I agree there must be one somewhere [20:47:38] at some point I'm going to have to block off a week, decline all my meetings, quit IRC, and just properly learn how calico works [20:47:52] I learned a little of that today [20:47:55] ... a little [21:30:16] denisse: per this task, https://phabricator.wikimedia.org/T274377, do you know if the sre-email-bot account is in use? [22:27:37] jhathaway: forwarded you an email about sre-email-bot [23:00:18] cwhite: thanks!