[13:22:49] I will start with look at logstash logs [13:23:09] ack effie, thank you [13:23:17] thanks, looking as well [13:23:23] so the calico pods are crashlooping afaics due to the typha ones down [13:23:24] is this a rerun of the issue we had with typha a few weeks ago? [13:23:28] who's IC? [13:23:32] not super clear in the logs why they are down [13:24:18] I'll take IC [13:25:13] they are failing the health probe afaics [13:25:15] https://docs.google.com/document/d/1_cgPVSajxMKcN66tCl8ldB38Utl4RORUU3u_SsKXcsI/edit#heading=h.95p2g5d67t9q [13:25:41] elukey: can you share which host? I want to check the bird stuff [13:26:05] sure [13:26:09] mw1386.eqiad.wmnet [13:26:13] kubernetes1024.eqiad.wmnet [13:26:17] kubernetes1022.eqiad.wmnet [13:27:03] I'll post a status update [13:27:05] fyi: there was a deployment when the outage started. maybe scap terminated at the wrong time? [13:27:08] ok if I try to delete one typha pod to see if anything recovers? [13:27:20] I think first port of call is the calico-controller being down? [13:27:52] I see calico-typha running on kubernetes1024. is that it? [13:28:04] elukey: sounds like a start [13:28:15] hnowlan: that too yes [13:28:16] Error getting cluster information config ClusterInformation="default" error=Get "https://10.64.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [13:28:57] two typha pods are running now though [13:29:00] anybody did anything? [13:29:03] is etcd ok btw? [13:29:06] elukey: not me [13:29:22] some typhas have been up throughout I think [13:29:31] nothing obvious from logstash at first glance, other than calico saying that typha is not running [13:29:50] hnowlan: yeah two are arunning now [13:30:21] and I see calico pods running as well, not all of them [13:31:15] side note - shall we depool k8s services in eqiad? [13:31:41] looking at https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha metrics are back now FWIW [13:32:25] confirmed, all typha pods up [13:32:34] recovering [13:32:49] calico-nodes up [13:32:49] what was the action here? [13:32:59] nobody did anything IIUC [13:33:02] I think no one did anything [13:33:08] ok, fun :) [13:33:11] sigh [13:33:24] but I cant help thining that something else triggered this [13:33:29] thumbor is back says monitoring [13:33:37] kubernetes did the magic? (crashlooping and restarting) [13:33:42] !incidents [13:33:42] 4557 (ACKED) [2x] VarnishUnavailable global sre (varnish-text) [13:33:43] 4558 (ACKED) [2x] HaproxyUnavailable cache_text global sre () [13:33:43] 4559 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:33:43] 4560 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [13:33:43] 4561 (ACKED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [13:33:43] 4556 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:33:54] from the metrics it seems that it all started at around 13:11 UTC, am I right? [13:34:27] first page was 13:16 [13:34:29] I wodner if there was a network hiccup somewhere [13:34:35] the timing does match the security deploy , kind of [13:35:20] typha down again [13:35:22] it does not make sense to be related [13:35:27] calico-typha-75d4649699-h7vgq [13:35:27] excellent [13:35:29] 13:15 the 5xx spike started. now it's way down [13:36:14] 2024-04-03 13:33:30.491 [WARNING][7] sync_server.go 479: Currently have too many connections, terminating one at random. connID=0x129 current=421 max=420 thread="numConnsGov" [13:36:29] 2024-04-03 13:33:29.442 [INFO][7] sync_server.go 636: Failed to read from client client=10.64.0.58:56156 connID=0x3e error=read tcp 10.64.16.61:5473->10.64.0.58:56156: use of closed network connection thread="read" [13:36:40] does this look interesting to anyone? etcd dashboads seems to not be ok https://grafana-rw.wikimedia.org/d/Ku6V7QYGz/etcd3?orgId=1&var-site=eqiad&var-cluster=kubernetes&var-instance_prefix=kubetcd [13:36:42] CPU usage on kubemaster1001 went up from 30% to CPU exhaustion at 13:06 [13:37:14] there is some throttling in the kube-system namespace: https://grafana.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?orgId=1&var-dc=thanos&var-site=eqiad&var-prometheus=k8s&var-sum_by=container&var-service=kube-system&var-container=All&from=now-3h&to=now probably calico node pods need more resources? [13:37:14] calico nodes have 10x cpu consumtion: https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-node-26zzs&var-pod=calico-node-2k4ws&var-pod=calico-node-2mfsg&var-container=All [13:37:22] Hello everyone - I just noticed plenty of ResourceLoader exceptions - mostly “Cannot access the database: could not connect to any replica DB server”. Started recenly, around 25mins ago. Still happening - eqiad DC - I assume it’s related [13:37:42] jynus: not yet re: incident status, still ongoing [13:37:48] pmiazga: see status panel https://www.wikimediastatus.net/ [13:37:52] swift/thumbor spikes back down to normal [13:37:57] pmiazga: yes related, thanks [13:38:33] * inflatador is a bit surprised that prod control plane nodes only have 4 vCPU/12GB RAM [13:39:06] hnowlan: could it be that typha pods are handling too many conns for some reason? That we reached a tipping point [13:39:08] XioNoX and/or topranks have you lot seen anything on your end? [13:39:18] but I guess that wasn't the direct cause anyway [13:39:27] maybe just raising their replica could be enough to step out of the mud [13:39:51] effie: anything? :) Not that I'm aware of, no [13:40:18] elukey: yeah I was wondering that, trying to verify now [13:40:38] XioNoX: it is never a bad idea to ask the networking people if all is well on their end :) [13:40:45] hnowlan: I am ready to double the typha pods [13:41:54] appservers RED dashboard - back to normal. wikimediastatus - wiki response time - back to normal [13:41:55] if we have an ongoing problem, this will just buy is some time [13:42:01] us* [13:42:11] https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha?orgId=1&from=now-1h&to=now cpu use was climbing before this kicked in [13:42:36] a spike in updates as well [13:42:46] what triggered all of this? [13:42:52] unclear so far [13:42:57] * akosiaris still reading backlog across 3 different channels [13:42:59] typha conns [13:43:00] elukey@mw1386:~$ sudo nsenter -t 1813037 -n netstat -tunap | wc -l [13:43:00] 1040 [13:43:37] akosiaris: calico went into crashlooping.. then it came back [13:43:39] akosiaris: I have noticed some error messages related to too many conns handled by typha pods, would it be safe/ok to just bump the replicas to say 6 ? [13:44:09] IIRC increasing typha pods doesn't bring any trouble, and it may be a quick way to see if they are overwhelmed for some reason [13:44:15] maybe we reached a tipping point? [13:44:42] Based on the updates then I would move the status to monitoring (not resolved), without being identified [13:44:45] IIRC the docs say that with the number of typha pods we got we should be able to serve 10x what we got [13:45:00] jynus: agreed [13:45:17] done [13:45:37] akosiaris: yeah I agree, but maybe there is some constraint in the way we set up the pod that makes this issue [13:45:51] as an additional data poitn, I can edit normally even if the edit rate seems to be lagging [13:46:25] > Since one Typha instance can support hundreds of Felix instances, it reduces the load on the datastore by a large factor. [13:47:21] unless we are hitting some bug like e.g. https://github.com/projectcalico/calico/issues/5629 and give we are in monitoring status, I don't see what immediate effect would increasing the typha pods hae [13:47:23] have* [13:47:42] I 'd rather we gathered some evidence/logs/metrics etc first to figure out what happened [13:48:14] akosiaris: I reported some logs above, I am not throwing this at random :) [13:48:41] ah this one? sync_server.go 479: Currently have too many connections, terminating one at random. connID=0x129 current=421 max=420 thread="numConnsGov" ? [13:48:46] godog: sorry to bother you again, do we create a ticket as sort of agree or even reuse T361705? [13:48:46] sorry, missed that one [13:48:46] T361705: ProbeDown - https://phabricator.wikimedia.org/T361705 [13:49:00] *we sort of agree for incidents [13:49:08] Apr 3 13:12:37 kubernetes1024 kernel: [18671255.692578] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=3, oom_score_adj=998 [13:49:27] late to the party [13:49:29] akosiaris: yeah exactly, but I totally agree that it shouldn't be the case of typha totally overwhelmed, I thought to just see if raising the replicas could bring some relief [13:49:33] jynus: no bother at all, yes I'll be opening a tracking task [13:49:42] does a typha failure bring down the calico controller? [13:49:47] looking at the core routers it looks like they got TCP RST to the keepalive packets and took it down [13:49:51] "BGP peer 10.64.32.109 (External AS 64601): Error event Broken pipe(32) for I/O session" [13:49:56] ^^ these kind of logs for them all [13:49:59] I will then close the automatic one, onces I check it is back up again [13:50:34] first at 13:15:05 UTC [13:50:47] jynus: https://phabricator.wikimedia.org/T361706 [13:50:48] hnowlan: IIRC yes, but the calico controller being down doesn't cause an immediate issue. If it is coupled with an mw deploy though, mw will suffer [13:51:00] akosiaris: there was a deploy [13:51:09] ah, there's our trigger then [13:51:13] security deploy [13:51:43] akosiaris: looks like it, but how? I mean, we deploy often [13:52:05] jelto: we cross-posted, feel free to reopen [13:52:19] 13:11:44.580095 ... "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"calico-typha\" with CrashLoopBackOff .. restarting failed container=calico-typha [13:52:23] since we seem to our way to recovery (fingers crossed) and I have a meeting in 10, would someone mind taking over IC? [13:52:32] it tried to restart it but failed at first.. then eventually it worked [13:52:45] effie: not sure yet, but the sheer amount of pods might have pushed something beyond some threshold. [13:52:46] jynus: all good, that was just a automated task for collab k8s services. [13:53:08] interesting, sal is down [13:53:13] https://sal.toolforge.org/ [13:53:48] this copy works: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:21] i restarted sal [13:54:26] I don't see an MW deploy though [13:54:28] taavi: thanks! [13:55:12] 13:06 < Dreamy_Jazz> I have two security patches to deploy. I will say once I'm done. [13:57:10] godog: go to your meeting, I will keep updating the doc [13:57:21] mutante: thank you <3 appreciate it [13:57:30] thanks both! should we update wikimestatus.net as well now? [13:57:30] uing deploy_security.py ... ? [13:57:47] sukhe: ready when you make the call [13:58:02] sorry, that was for mutante [13:58:10] yea, let's update it, jynus [13:58:12] jynus: not that I am the authority but +1 since the dashboards look good [13:58:14] resolve it [13:58:20] doing it [13:58:52] akosiaris: yes, Dreamy_Jazz says yes [13:58:55] what on earth is deploy_security.py ? [13:59:01] akosiaris, hnowlan, effie - I was checking https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-typha-75d4649699-99xsw&var-pod=calico-typha-75d4649699-cl5rz&var-pod=calico-typha-75d4649699-h7vgq&var-container=All&from=now-3h&to=now, typha pods are running close to limits for memory [13:59:36] mutante reported an OOM kill on a kube node which corresponds to calico-typha-75d4649699-cl5rz [13:59:40] maybe invite him here or to a private chanel for discussing details? [13:59:52] hnowlan: yeah I confirm from mw1386's dmesg [13:59:56] I 'll reach out to him privately [14:00:00] talking to Dreamy_Jazz [14:00:02] thanks [14:00:06] ah you are already? thanks [14:00:09] [Wed Apr 3 13:09:59 2024] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=997 [14:00:32] this is right before the 13:11 mark in grafana, where typha pods stopped publishing [14:00:47] elukey: ok, that's a smoking gun for why calico-typhas where killed. I assume this got gradually all 3 [14:00:48] 13:59 https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Deployment:_via_script [14:00:53] akosiaris: ^ [14:01:02] which brings us to the other issue, what happened at ~13:00 UTC that caused all this [14:01:16] this was pretty much what happened last time no? [14:01:23] or was it the other way around, the controller got oomkilled [14:01:53] IIUC since their working memory is very close to their cgroup's limits anything that caused a little more load could have caused the tip over [14:02:10] and once one typha was down, the other two got more conns, etc.. [14:02:27] yeah, that's what I meant by gradually [14:02:40] one got killed, traffic moved to the rest and this percollated [14:02:43] elukey: I would argue that even if the limits were higher, we would end up with the same situation [14:03:05] yea, so if you do a "grep oom-killer /var/log/syslog/" on kubernetes1024..it happens all the time with node.. but calico-typha the first one is 13:10:35 [14:03:26] on kubernetes1024: [14:03:26] [Wed Apr 3 13:08:38 2024] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=998 [14:03:41] the node part is easy to explain, especially if there is a zotero container running there [14:03:43] then it tried to restart pods but somehow failed.. then it tried again and it worked and things came back.. afaict [14:03:58] and on kubernetes1022 [14:03:59] [Wed Apr 3 13:04:48 2024] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=998 [14:04:09] there is one way to find out, shall we ask Dreamy_Jazz to deploy again ? either way their deploy does seem to have gone well anyway [14:04:10] that seems the winner, the one that crumbled for first [14:04:23] but we *should* in theory be able to continue with 2 typhas no? [14:05:01] hnowlan: yes yes connection wise probably, maybe memory/cpu wise there is a bottleneck [14:05:32] effie: I wouldn't risk another outage if we don't apply some bandaid first [14:05:34] either way 300mb seems like too little memory for the typhas [14:05:35] is someone here channel operator? [14:05:43] to invite Dreamy_Jazz maybe [14:05:52] they say the code is half in production [14:06:06] mutante: this is a public channel [14:06:10] they can just /join [14:06:29] oh, heh, it told me "you are not operator" when i did /invite :P [14:06:37] mutante: so you restarted some typha pods? I didn't see it in the channel, do you recall more or less when and how many times? [14:07:05] elukey: no, I did not restart, all I did was look at syslog and dashboards [14:07:07] Just invited them [14:07:09] I think mutante wrote "it" meaning Kubernetes [14:07:14] Hello [14:07:25] mutante: ahhh okok I saw that you wrote "I tried to restart pods" and I asked, thanks :) [14:07:27] yea, I meant what Jelto said.. it did its thing to restart pods [14:07:47] Dreamy_Jazz: so you said the code is only half deployed? [14:07:48] Dreamy_Jazz: hi! [14:08:03] internet went out ? [14:08:39] Yeah. It seems that /srv/mediawiki-staging/php-1.42.0-wmf.24/extensions/CheckUser has the patch applied but /srv/patches does not [14:08:52] My internet dropped outside the house (my router had several red lights) [14:09:13] So the console session I had open doing the deploy disconnected [14:09:24] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1016794 for typha memory limits [14:10:04] would be nice if we had k8s events to see the oomkill on the pod level [14:10:08] but we don't store that far back [14:10:19] Dreamy_Jazz: so, I assume that restarting the process will finish the deploy ? [14:10:21] The reason I say this is that running `git log` in `/srv/mediawiki-staging/php-1.42.0-wmf.24/extensions/CheckUser` shows the security patch that I was deploying [14:10:44] I think it should, but the script might not handle applying the patch when the patch is already there [14:11:05] I will also continue my meeting now that the incident has passed, but ping me if you need more hands for anything [14:11:36] as long as the script uses scap apply-patches it should be fine [14:11:47] I don't know if it does that. [14:11:48] ok what if, we merge hnowlan's patch to increase memlimits (I am curious if that was the only issue), and then have Dreamy_Jazz redeploy [14:12:06] The script is https://gitlab.wikimedia.org/repos/releng/release/-/raw/main/deploy_security.py [14:12:35] It uses scap sync-file [14:13:21] does redeploy mean having to revert and deploy again? [14:13:29] of course it does not :/ [14:13:45] i think the easiest way is to manually add the patch to /srv/patches, commit, run scap apply-patches, and then scap sync-world [14:14:05] I have a silly question I think, is that part of the documentation up to date? [14:14:38] because I would expect something more than sync file there [14:14:56] It has been the script I've used several times before to apply security patches succesfully. [14:14:57] hnowlan: maybe we could set 500m? Wondering how much free memory we need, I am also fine doubling [14:15:17] (disclaimer, I have not been keeping up with scap changes regarding mw-on-k8s) [14:15:28] i suspect that script is the best thing there is today. I think ideally the script should be replaced by something built into scap that does what I said [14:15:38] For example, I used it this morning for T361293 [14:16:14] I may be missing context here, so I will stop [14:16:29] Dreamy_Jazz: at around 10 UTC I would guess? [14:16:46] Yeah, shortly after I had posted the comment on that task [14:16:48] elukey: yeah, I don't mind. 600 just seems pretty cheap for something so critical [14:17:07] `10:20 logmsgbot: dreamyjazz Deployed security patch for T361293` [14:17:13] hnowlan: ack ack [14:17:26] From the https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] hnowlan: +1ed [14:17:51] That was the point when the deployment finished (as the last stage is to make a log message) [14:18:50] follow-up: make security deploy script use scap apply-patches ? [14:19:16] elukey: hnowlan https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha?orgId=1&from=now-6h&to=now&viewPanel=78 [14:19:23] there is a pattern with the deployments [14:20:09] let's do what taavi said and get the security patches manually into /srv/patches and get this deployed all the way.. so we know it's stable state and we also see if something happens again?{ [14:20:22] as long as people are still around [14:20:36] I would bump typha's memory first [14:20:42] before any other attempt [14:20:46] ack [14:20:51] I agree [14:21:17] effie: yeah cpu is a concern too, I didn't see any throttling thought, that is good [14:21:25] *though [14:21:32] If that involves a scap-backport, I wonder whether it might cause the git log reset back to before my half-complete security deploy? [14:22:23] Because scap backport applies the patches as one of the first few steps, so perhaps the patch not being in `/srv/patches` might mean it doesn't then get applied? [14:23:19] effie: there are some spikes that don't correspond to deployments [14:23:25] but there might be something there [14:24:02] what I propose is adding that your new patch to /srv/patches and then running `scap apply-patch` which is the part of backport you're describing [14:24:12] apply-patches* [14:24:14] hnowlan: that is what I am trying to correlate, but around ~10 UTC we had some similar situation that didnt escallate as much [14:25:09] At 10 UTC there would be two different sync-file commands ran as it is run separately for each wmf version in deploy_security.py [14:25:12] Dreamy_Jazz: wait for hnowlan's go ahead, however you decide to more forward [14:25:21] Ofc [14:25:26] <3 [14:26:10] The first deploy ended at 10:06 UTC based on the server admin log `10:06 logmsgbot: dreamyjazz Deployed security patch for T361293` [14:26:32] Which seems to correlate to the two spikes on that graph at 10 UTC [14:26:53] I'll file a scap feature request to add a feature to do the entire patching operation in a much smarter way that what the current script does [14:27:04] 👍 [14:27:19] yea, the spikes are at the deployment times it seems: https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha?orgId=1&from=now-12h&to=now [14:28:19] effie, Dreamy_Jazz: increase is in place, go ahead [14:29:10] Okay. I'll start by adding the patch file to /srv/patches [14:32:41] Added the patch to /srv/patches [14:32:47] Only for wmf.24 [14:32:57] As the script did not reach wmf.25 [14:33:36] now you have two 1.42.0-wmf.24/extensions/CheckUser/01- patches? also you will need to commit them [14:34:05] Yes it is 01-T361479.patch [14:34:39] What needs to be done to commit them? [14:35:02] The change is already in the `/srv/mediawiki-staging/php-1.42.0-wmf.25/extensions/CheckUser` [14:35:07] *already applied [14:35:22] the first number should be sequential for each patch, so the new `01-T361479.patch` should be `02-T361479.patch` as there is already a 01-something.patch [14:35:41] and /srv/patches is a local git repository, so git add and commit the patch file afterwards [14:35:45] There wasn't a file called 01-T361479.patch until I added it [14:35:53] Ah I see. [14:37:33] Added the commit. [14:37:43] it's still numbered wrong? [14:38:13] Oh do you mean the numbering is in-dependent of the ticket number? [14:38:28] yes [14:38:31] Ah I see. [14:38:35] I will fix that. [14:39:12] I had misunderstood what the number meant (I thought it was in case there was multiple security patches for the same ticket) [14:39:30] Fixed [14:39:47] The directory is now `01-T361293.patch` and ` 02-T361479.patch` [14:39:58] * `02-T361479.patch` [14:40:10] ok, that looks correct [14:40:32] now run `scap apply-patches` [14:41:51] Done [14:41:58] Output said all patches were already applied [14:42:33] and /srv/mediawiki-staging/php-1.42.0-wmf.24/extensions/CheckUser matches what you expect? [14:42:48] Yes [14:43:07] Both security deploy patches on the wmf.24 origin branch [14:43:24] good, new you can run `scap sync-world` to sync the changes out (and add `--pause-after-testserver-sync` if you need to test it first) [14:44:13] follows syslog on kubernetes1024 where the oom-killer showed up before [14:44:36] calico still looks calm [14:45:30] probably unrelated but wikifeeds is still paging due to 503 - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway [14:46:31] At the stage of helmfile deployments [14:46:39] on the scap sync-world command [14:48:24] Command has failed [14:48:38] Or just got a load of error outputs [14:48:44] It seems to still be running [14:48:53] It is saying the configuration file is group-readable [14:49:14] in calico graphs: CPU saturation just went up minimally but far from last time [14:49:34] A small extract of a large amount of logs: [14:49:45] https://www.irccloud.com/pastebin/YMmCsn7o/ [14:50:14] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [14:50:25] ? [14:50:29] I assume that means that we need to manually rollback the currently in-progress upgrade? [14:50:46] Full logs: [14:51:02] Actually the full logs are too long [14:51:21] For IRC [14:51:30] https://phabricator.wikimedia.org/paste/ [14:51:36] Dreamy_Jazz: use phab [14:51:52] let me check something please [14:52:19] https://phabricator.wikimedia.org/P59348 [14:52:55] I set visibility for the paste as all users [14:53:06] I think nothing private was in the output [14:54:09] Should I exit sync-world? [14:54:13] It is still running [14:54:40] godog: looking at that [14:55:36] cheers hnowlan [14:56:23] Dreamy_Jazz: let it finish for now [14:56:28] fwiw: during the deploy I saw "node invoked oom-killer" (which seems to be normal) but nothing happened with calico-typha this time [14:56:32] Okay. [14:56:56] I just got another batch of the same logs (regarding the file being group-readable) [14:57:02] (fwiw filed https://phabricator.wikimedia.org/T361709 in scap) [14:57:17] adding that to the doc, thanks taavi [14:57:53] Dreamy_Jazz: 'WARNING: Kubernetes configuration file is group-readable' is just a warning, the actual error is on the next line [14:58:10] taavi: it is ok to ignore that [14:58:33] Dreamy_Jazz: let me know when sync world has exited [14:59:11] hnowlan: I was thinking of manyally deploying the failed deployments, as I see [14:59:14] https://usercontent.irccloud-cdn.com/file/vQUtH0dc/image.png [14:59:28] in the diff of eg mw-web [15:00:08] or just deploy again all mw* [15:02:58] effie: sync-world has now finished. [15:03:29] Dreamy_Jazz: can you please update your phab paste? [15:03:36] Sure. [15:04:05] calico typha has two characteristic bumps again in the cpu usage but lower than last time [15:04:12] Done [15:04:21] https://grafana.wikimedia.org/d/p8RgaNXGk/calico-typha?orgId=1&from=now-6h&to=now&viewPanel=78 [15:04:53] jelto: yeah, mw* were not actually deployed [15:07:12] wikifeeds errors on the way down [15:08:11] wikifeeds page just resolved [15:08:19] Dreamy_Jazz: did you run your deployments in a screen or tmux? [15:08:32] a roll-restart fixed it, annoyingly - seems like there might be some kind of persistent envoy behaviour that was possibly marking backends as down when they had recovered? [15:08:32] nice, thanks. what was the resolution on that? [15:09:00] I ran scap sync-world and the security deploy without tmux or screen [15:09:09] hnowlan: ah [15:09:43] how is the deployment situation now? do we need to deploy all of mw* again ? [15:09:48] Dreamy_Jazz: it would be great if you dont to that again :) [15:09:56] mutante: we are looking into it [15:10:01] ack, thanks [15:10:16] Yeah. I think I've learnt my lesson on using tmux :) [15:10:28] I think we should put a big letter warning there [15:10:44] I think you should [15:10:56] I was not aware that I should using tmux or screen when running these commands [15:11:24] hm, I don’t usually use a server-side tmux either… [15:11:33] since I already use a local tmux to SSH into different systems side-by-side [15:11:42] (using the script at https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers/Script) [15:11:45] should I change that? [15:12:14] so far seems like.. calico controller down AND deployment needs to happen at the same time to get an outage.. AND deployment got interrupted but that was a separate issue [15:13:48] I dunno if they're that tightly coupled - the typha issues started *as* the deployment happened [15:14:37] based that statement on "the calico controller being down doesn't cause an immediate issue. If it is coupled with an mw deploy though, mw will suffer" [15:15:15] the controller itself went down because of the typha issue I believe [15:15:48] nods [15:17:20] Lucas_WMDE: you should be using a server side tmux, I was not aware this is not properly documented [15:18:07] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers only mentions tmux / screen in the maintenance scripts section at the end, at least :S [15:18:26] Lucas_WMDE: as I said, I was not aware [15:18:31] should I make a task? (don’t want to interrupt the other discussion here too much) [15:18:40] because this disconnect kept somethings in an inconsistent state [15:18:58] Lucas_WMDE: I have made a note, I will sort it latr [15:19:03] ack [15:23:22] Is there anything else I can help with regarding the incident / deployment? [15:23:45] i feel now like the inconsistent deployment state is all unrelated to the original incident but kind of its own incident [15:24:24] like a succesful deployment would have also triggered it [15:25:08] Dreamy_Jazz: codfw seems to be ok [15:25:22] so serviceops will attempt to mend eqiad [15:25:44] Okay. [15:33:31] I am going to call the incident resolved once deployment state is also fixed.. though it seems kind of separate now. [15:49:01] ok so, we have a situation where helm in eqiad is left in a questionable state [15:49:38] serviceops will depool mw-web-ro from eqiad, and attemt to fix the problem in this deployment first [15:50:01] ack, thanks for the update [15:50:02] if that works out, we will proceed to redeploy in whole eqiad [16:00:40] thanks! [16:26:52] rollback was successful, we will be running scap shortly [16:27:14] Dreamy_Jazz: seems like we mostly sorted it out [16:27:27] Great :) [16:27:32] :) thank you Effie. calling incident resolved [16:27:48] well, maybe after scap, but yea [16:28:22] If everything goes to plan, can I deploy once more on a tmux session? Or should I wait till later? [16:28:33] *deploy security patches [16:28:41] yes, we should be back to normal [16:28:48] :D [16:59:49] Are you done with the scap deploy now effie? [17:07:20] Dreamy_Jazz: scap is finished [17:15:24] Thanks [23:51:01] Hi! Could someone perhaps help me find why or how iptables is blocking access from within a container to the docker host IP? This is for a change to codesearch, which is puppetised at https://gerrit.wikimedia.org/g/operations/puppet/+/HEAD/modules/codesearch/manifests/init.pp. [23:51:40] Right now, the codesearch-frontend is talking to the local Hound instance (runs on the same server, differnet port) but does so via a public web proxy (WMCS dynamicproxy) and thus trips rate limits given all reqs come from the same IP. [23:53:19]