[05:47:14] wikibugs has been down for a few hours now I think? [05:49:29] I have just restarted it [05:51:04] Aaaand it is back [07:58:31] <_joe_> marostegui: yeah clearly tonight there were connectivity issues [07:58:45] <_joe_> I'm happy to see sirenbot detected that and committed seppuku [07:58:50] <_joe_> thus it got restarted [07:58:55] <_joe_> means my "fix" works [09:58:15] GitLab needs a short restart at around 11:00 UTC [11:13:33] GitLab restart done [12:22:39] let's chat in there topranks jayme claime ? [12:22:43] yeah [12:22:45] jynus: ^ [12:23:03] I just redeployed mw-on-k8s, but didnĀ“t touch calico [12:23:10] ok [12:23:24] I'm looking at the calico-node logs rn [12:23:35] I have a pretty clear start time [12:23:44] whole bunch of active for the K8s cluster [12:23:45] all down [12:23:58] bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory [12:24:13] 2023-12-14 12:23:50.316 [FATAL][1290] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port [12:24:42] We can try to redeploy the calico-node pods, jayme wdyt? [12:24:58] <_joe_> I'm not sure it's just k8s, I see issues in non-k8s appservers too [12:25:11] this is happening in Juniper logs [12:25:17] https://www.irccloud.com/pastebin/COnH0Ess/ [12:25:31] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=codfw&var-cluster=api_appserver&var-method=GET&var-code=200&var-php_version=All&viewPanel=9&from=now-30m&to=now [12:25:49] claime: typha is down, that's why the calicos are failing [12:25:58] redeploy would not help I think [12:26:14] <_joe_> jayme: can we try to bring typha back up? [12:26:19] _joe_: you sure bare metail is not just picking up the extra load? [12:26:20] on it [12:26:31] it's back [12:26:37] <_joe_> kamila_: possible, but we'd see in increase in rps [12:26:41] <_joe_> we see the opposite [12:26:42] true [12:27:42] <_joe_> jayme, claime what's the status of the calico pods now? [12:28:03] topranks: godog I will create a status page incident [12:28:10] 2023-12-14 12:25:12.825 [INFO][2175] confd/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha [12:28:13] Number of hosts down is decresing [12:28:15] jynus: ack, thank you [12:28:44] For instance kubernetes2048 still rejecting BGP connection request from CR [12:28:56] <_joe_> things are back afact [12:28:56] https://www.irccloud.com/pastebin/rClNIpIf/ [12:28:59] <_joe_> as in [12:28:59] I think we're recovering yeah [12:29:05] _joe_: some calicos are still in crashloop backoff but they should come back shortly [12:29:08] <_joe_> we're back to normal levels of rps to mw on k8s [12:29:11] "We are aware of issues with accessing some wikis, and we are investigating." as I am not yet 100% sure on the impact to end users [12:29:12] still nothing listening on tcp:179 (presume that should be a bird process) [12:29:25] ok it just came back on node I was looking at - kubernetes2048 [12:29:44] <_joe_> jayme: any idea why typha crashed? [12:29:50] bird and bird6 processes running ok [12:30:01] _joe_: I did not look yet - wanted to restore service first [12:30:14] <_joe_> jayme: yeah, I think we're ok-ish [12:30:27] <_joe_> it's refreshing the recovery is so fast [12:30:44] <_joe_> also please !log actions you have taken [12:31:14] should I switch status to monitoring or net yet confirmed the recovery? [12:31:36] all BGP sessions are back up on the CR side [12:31:40] <_joe_> jynus: I would say we've recovered but I'd want to check if typha is ok [12:31:47] <_joe_> because it's a huge SPOF right now [12:31:58] jynus: let's give it another few minutes [12:32:01] these are the first down log and last up log, to give us a timeframe of incident [12:32:05] https://www.irccloud.com/pastebin/e8JRaBYs/ [12:32:09] I will move it to identified [12:32:41] 5XX on restbase now [12:33:20] Looks like it's already on the way down though [12:33:25] but also seems recovering- maybe it was just backlog [12:33:30] been spiking in time with everything else [12:33:47] thumbor been struggling a bit but is hopefully recovering [12:33:49] yeah, or the monitoring threshold is much higher [12:36:00] was there anything else like a spike in requests or something? [12:36:10] <_joe_> the restbase thing is real [12:36:29] jynus: fair to said we're in status monitoring now [12:36:34] <_joe_> jayme: not that I can see [12:36:34] s/said/say/ [12:36:43] godog: setting, thanks [12:37:29] <_joe_> the correspondence with the scap deploy can't be casual, IMHO [12:39:54] do we have a doc? [12:40:11] <_joe_> jayme: I don't think so, I arrived here middle of the outage [12:40:18] any alarms pending? [12:41:12] I see a ferm one on k2016, but that may be unrelated? [12:41:14] jynus: not afaics [12:41:18] jayme: we do not no [12:41:23] <_joe_> jynus: looking at alertmanager, about 274 :P [12:41:30] LOL [12:41:36] jynus: Not unrelated, but i just restarted ferm [12:41:42] It'll clear [12:41:44] thanks [12:41:46] <_joe_> we still have the mw edit session loss one [12:41:51] MediaWiki edit session loss < That one... [12:41:53] :( [12:42:12] <_joe_> it will recover though [12:42:21] good point for impact on doc [12:42:22] <_joe_> https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=now-1h&to=now [12:45:07] jynus: are you creating the doc? I see you in the template. [12:45:35] sobanski: I am always in the template :-D [12:45:51] I didn't plan to, we have people on call for that [12:46:48] but I can help filling the few details I did (just the status page) when it is done [12:46:49] <_joe_> in terms of impact, https://grafana.wikimedia.org/d/000000208/edit-count?from=1702555537685&to=1702557648189 [12:46:56] hmm...meybe it's related to me restarting the apiservers some 20min'ish earlier than the outage [12:47:02] *maybe [12:47:08] <_joe_> jayme: hah [12:47:10] https://docs.google.com/document/d/1jThS61rCjtz-8GmHTbqe5L-I2DdszlPBoMlMtYOYfhI/edit#heading=h.95p2g5d67t9q [12:47:20] <_joe_> jayme: that is actually reassuring [12:47:40] * _joe_ goes back to the codejam [12:47:45] well...not really. 20min between the last reboot and the problem arising is a bit strange [12:47:48] godog: waiting for you calling it resolved to do it on the status page [12:48:01] the ferm failure on the on kubernetes2016 is unrelated, it comes from a change I made which enables a new cumin node and the puppet change to allow access from it triggers a restart of ferm (and fails to restart in some rare cases) [12:48:05] plus I obviously did not reboot all control-planes at once [12:48:08] <_joe_> jayme: if typha crashed, isn't that the time it took calico to go down even last time? [12:48:30] typha crashing does that, https://github.com/projectcalico/calico/issues/6167 [12:48:39] <_joe_> yeah [12:48:42] <_joe_> now the problem is [12:48:56] I don't recall the timing. But then I looked, typha was still down (20min after) [12:48:56] <_joe_> why did typha crash? if it did with the control plane restart [12:49:18] <_joe_> it's bad but manageable [12:49:22] I added a few pods, black magic threshold reached? [12:49:24] <_joe_> if it did because of scap [12:49:32] <_joe_> that is terrible [12:49:36] <_joe_> yeah kamila_ exactly [12:49:36] yes [12:49:54] yes...I'm going to dig for logs [12:49:58] <_joe_> I'd say we have an "easy" way to test this - doing a NULL scap deployment [12:50:23] yeah but like... I deployed eqiad as well [12:50:27] Nothing moved there [12:50:47] <_joe_> claime: yeah... [12:50:50] mw-api-int in eqiad servers a lot less requests, not sure if relevant though [12:50:56] * kamila_ afk for a few min [12:51:04] I doubt it's volume related [12:51:05] <_joe_> but we DEFINITELY need to page if typha is down [12:51:16] <_joe_> yeah volume wouldn't kill typha [12:51:21] (as far as requests are concerned, jury's out on pod volume) [12:51:28] <_joe_> it would kill other components of calico [12:58:07] yeah, agreed, just wanted to mention [13:05:08] eqiad and codfw run about the same number of pods, so it is weird [13:10:01] <_joe_> kamila_: it does indeed all point to this happening with the control planes restarts [13:10:22] <_joe_> why typha crashed, or why it wasn't automagically restarted, is not clear to me at all [13:10:39] yeah, probably, but why only in codfw? [13:10:51] I think we're also missing logs in logstash. I see calico kube controllers starting around 12:07 but no prior termination [13:10:55] <_joe_> because we only restarted the api server in codfw [13:11:05] oh, okay [13:11:06] _joe_: that's wrong [13:11:06] godog, topranks, urandom I am going for lunch, remember the incident is still ongoing on the doc and status page (just mentioning it because I won't be able to update it) [13:11:12] <_joe_> jayme: uh [13:11:14] <_joe_> interesting [13:11:20] I've restarted in both dcs [13:11:28] ok, okay :D [13:11:30] <_joe_> uhhh ok [13:11:31] also, there are hours between me restarting 2001 and 2002 [13:11:49] <_joe_> jayme: loss of logs is probably due to loss of network? [13:12:01] <_joe_> I am not sure the pods could talk to the log collector at the time [13:12:11] <_joe_> unless the logs are sent via rsyslog from the host [13:12:13] plus: typha did recover automatically [13:12:29] they are send via rsyslog [13:12:42] <_joe_> jayme: uhm so typha recovered then crashed? [13:12:45] yes [13:12:57] <_joe_> do you know when it actually crashed? [13:13:06] but it's still weird why all 3 replicas where down at some point [13:13:08] not yet [13:13:17] still collecting data [13:18:12] I've resolved the statuspage incident, cc jynus [13:21:20] _joe_: very facepalm...it got oom killed [13:23:47] oops '^^ [13:24:53] so it probably is kinda related to the deployment going on as well as maybe extra load due to reconsilation because it had to reconnect to reconnect to masters or reconnections of the calico controllers (which also oom killed) [13:27:23] https://grafana-rw.wikimedia.org/d/wX9rDrIIk/wip-service-resource-usage-breakdown?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s&var-namespace=kube-system&var-container=calico-typha&from=now-24h&to=now&viewPanel=13 [13:28:08] spike time fits [13:28:50] not sure why the spike exists though [13:34:56] I think it's the above stampeede situation [13:35:49] calico-controllers being oom killed probably means additional load to typha which was loaded because of scap (rolling quite a bunch of containers) [13:36:03] that led to typha(s) being oom killed [13:38:21] <_joe_> ok that's not great [13:42:18] obviously not. I wonder why those where oom targets actually as they run with highest priority [13:43:52] I know [13:44:04] it hit the memory limit [13:44:30] root@deploy2002:~# kubectl describe deployments.apps -n kube-system calico-typha | grep -A2 Limits [13:44:31] Limits: [13:44:31] cpu: 300m [13:44:31] memory: 150Mi [13:44:46] now why that limit is there, I don't know that part yet [13:46:03] <_joe_> that's... very restrictive [13:47:38] way too close, yes [13:48:52] yes, sure it did...but is marked as system-critical [13:49:33] and it wasn't very restrictive initially...we just never increased when adding all the nodes and pods :/ [13:49:40] (oh, looks like nothing is overriding the limit in the chart) [13:49:41] yeah [13:50:02] we're overriding in main.yaml [13:50:07] I've a patch ready [13:50:13] oh, sorry [13:50:36] ok, I'm way behind, sorry about that [13:53:33] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983191 [13:53:34] no [13:53:49] no problem :) [13:54:02] I had a head start ;) [13:54:29] well, TIL that typha exists, so there is that :D [13:54:31] plus I'm the one that probably should have taken care/thought about this in first place [13:58:21] thanks for feeding the datapoints to the doc k.amila_ <3 [13:58:30] sure :-) [14:02:45] meta question (this might be the wrong channel/time): should the incident template have an action items section? [14:05:41] I guess it should :) [14:05:51] would you mind double checking the CR for sanity kamila_? [14:07:10] jayme: uh, I was just about to mention that CI looks funky [14:07:53] * jayme mumbles [14:08:07] something somewhere might be getting inherited in a weird way, so that CR is somehow coming up with a 150mCPU limit [14:14:27] and I messed up memory...will come back to you :) [14:17:58] ack [14:19:15] btw, do we call it resolved now? [14:22:35] yeah, I think we can [14:23:19] ah, already done on statuspage, will update doc too, thanks [14:24:43] hmmm we are at 68 nodes [14:25:04] IIRC there was some "formula" between number of nodes and number of typha instances [14:27:10] uh, there's also "Although Typha can be used with etcd, etcd v3 is already optimized to handle many clients so using it is redundant and not recommended." [14:27:22] at https://docs.tigera.io/calico/latest/reference/typha/overview [14:28:38] akosiaris: it's not a problem of not enough typhas [14:28:52] yeah, I just saw again https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724957 [14:28:59] kamila_: that's a different thing. calico supports using k8s api as datastore *or* etcd directly [14:29:04] supposedly with 3 we can go to 100s of nodes [14:30:12] we used to do calico <-> etcd directly in that past btw. We no longer do (for a good while now) [14:30:25] kamila_: patch is updated (deep merge mess) [14:31:40] * akosiaris reviewing too [14:32:10] kube-controllers got affected too? [14:32:19] as in directly or as an aftermath? [14:32:55] actually, it probably doesn't matter. I doubt we want to see either throttled or OOM killed in any case [14:33:12] directly I guess [14:33:43] it holds watchers for endpoints and nodes...those probably grew by quite a bit [14:34:46] the reason I'm keeping limits for memory is to keep it in the guaranteed class. Given that memory consumption was quite stable over the years that schould be okay [14:34:53] if we keep an eye [14:35:07] or some alert'ish thing [14:45:27] change has been rolled out [14:46:34] <_joe_> jayme: tbh we should alert somehow if typhas are down [14:46:46] yes, next on the list [18:29:38] when updating a certificate (cergen / envoy) per https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate and it tells me to "cert clean" it on the Puppet CA, do I still use puppetmaster1001 or do I have to think about puppetserver* now [19:22:51] does anyone know how to set an nginx header only if the request doesn't already have that header? I'd like to conditionally set the Accept: header. nginx conf and testing here: https://phabricator.wikimedia.org/P54463#220330 [20:38:42] nm...think I got it. Will update my paste shortly