[06:16:53] 10serviceops, 10SRE, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10elukey) Just as reminder, mw1384 was [[ https://sal.toolforge.org/log/_2yPSHoBa_6PSCT9smu4 | dep... [07:04:17] 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10JMeybohm) helm-charts not having a service IP was a design decision because we only have (and need) one replica per DC. IIRC there are alr... [07:42:22] 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Joe) >>! In T285707#7182854, @JMeybohm wrote: > helm-charts not having a service IP was a design decision because we only have (and need)... [08:25:22] 10serviceops, 10SRE, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) The problem seems to be quite clearly caused by excessive apcu locking. Let's review the sy... [08:28:40] 10serviceops, 10SRE, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) Also: the memory gets exhausted by this operation: ` $sma_info = apcu_sma_info(); ` this m... [08:35:35] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) p:05High→03Unbreak! The recurring problem seems to... [08:36:36] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) [08:46:44] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10tstarling) >>! In T285634#7181000, @Legoktm wrote: >> and al... [09:02:28] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-April-June): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) Ignore above, accidentally... [09:13:01] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10jijiki) I don't know if it helps, the increasing numbers of... [09:22:01] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) @jijiki I think this correlation is another hint that w... [09:24:30] 10serviceops, 10SRE, 10Release, 10Train Deployments, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Addshore) I have a suspicion that this Wikibase cache is rel... [09:25:53] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 3 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Addshore) [09:29:33] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Ladsgroup) a:03Ladsgroup On to find and revert/fix the culprit [09:29:58] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) I think what @Addshore just found is a good candidate for being the sour... [09:33:16] <_joe_> I'm going to try to look at the logs on an appserver to try and determine which requests caused the apcu flurry of gets on mw1355 [09:33:22] \o/ [09:33:23] <_joe_> anyone wants to tag along? [09:33:31] sure, as long as it doesn't interfere [09:36:17] <_joe_> so on mw1355, sudo -i tmux -rt logs [09:36:22] <_joe_> use a large terminal though :D [09:36:40] I always do (it was not me with the teeny tiny one yesterday!) [09:38:10] you're missing an "attach" in there [09:38:11] I'm on [09:49:33] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) Scavenging the production logs, we found that `Special:EntityData` reque... [10:02:59] <_joe_> jayme / jelto / effie do you want to replay how I found the culprit in the logs? [10:04:38] <_joe_> even after lunch, I just realized it could be lunch time for our northern friends :) [10:05:48] yeah we could [10:17:40] joe: I would be interested although I don't fully understand the problem/missing a bit of context. [10:19:56] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10daniel) > Scavenging the production logs, we found that Special:EntityData re... [10:24:04] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 5 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) >>! In T285634#7183188, @daniel wrote: >> Scavenging the production logs... [10:28:07] _joe_: sure! [10:28:37] <_joe_> ok, let's do it at 13:00Z? I want to take a break given it was a very intense morning [10:30:11] fine with me [10:37:45] is there a canonical way to restart all pods for a service? I know there's a few ways to do it but I was wondering what's the serviceops recommended method and is it documented anywhere? [10:44:01] <_joe_> hnowlan: last time I did it, I didn't find anything preestablished [10:44:06] <_joe_> so maybe we should indeed [10:45:11] the helmfile -e eqiad foo --args --recreate-pods does no longer work, right? [10:45:26] <_joe_> I think it didn't when I tried last [10:45:56] joe: Onboarding Chat 07 - Elasticsearch will start at 12:45ZZ and is planned until 13:45, so I'm blocked at 13:00Z. Would it be possible to have the session tomorrow? [10:46:19] <_joe_> jelto: I'm off tomorrow [10:46:39] <_joe_> and I'm going afk right now, btw :) [10:46:43] <_joe_> ttyl! [10:52:38] hnowlan: as SRE you could probably do kubectl rollout restart deployment foo -n foo [10:53:09] but I'm not sure if that will work with only service deployer credentials [10:53:56] it does not :( [10:54:14] ah, yeah :( [10:54:50] the main reason I ask is I'm imagining a service owner restarting a service in an emergency rather than someone with root [10:56:23] jayme: o/ I am checking how we deploy pod-security-policies via helmfile, and IIUC it is specifically called via a sync step (not in other ways). What I am wondering atm is if helmfile_psp.yaml could have multiple releases with different names (like pod-security-policies-istio) and when we bootstrap a cluster we choose the right one (rather than adding conditionals etc..). Another thing could [10:56:29] be, for istio, to create a simple chart for the PSP rules, that gets deployed where needed. The more I look helmfile_psp the more it seems that it contains only very generic things [10:57:24] it does only contain generic things as it's tailored only to production clusters (which are quite simple in terms of PSP) [10:57:59] but helmfile_psp.yaml is included as base in helmfile.yaml IIRC, so a helmfile sync will apply that as well [10:58:18] (no specific sync step for psp needed) [10:58:27] yep yep that is the risk, even if we don't do it specifically on wikitech [10:59:48] for the purpose of the current istio deployment, I'd need just to allow pod creation and mount points, all the NET_ADMIN things are related to sidecar injection and init-containers that we don't use for the moment [11:00:39] (I mean https://github.com/istio/istio/blob/master/samples/security/psp/sidecar-psp.yaml is not needed) [11:00:45] with risk you mean the risk of stuff being deployed to all clusters? [11:01:39] yes exactly [11:01:54] as you were saying, a helmfile sync (without restriction) might apply all [11:02:09] but for istio-specific, it may be ok to have psp rules in a chart [11:02:33] there will be a similar thing for knative and kfserving I assume [11:03:26] You can put PSPs into charts but then you'd need broad credentials to install them ofc [11:04:00] right right [11:04:11] I see that allow-restricted-psp is deployed via helmfile_namespace [11:04:30] at least I see a RoleBinding [11:05:31] you need the service account to have access to the PSP for it to apply [11:05:40] that's why the reference is there [11:06:02] Q: Do you need to *override* the default PSPs from helmfile_psp.yaml? [11:06:34] or do you just need to add some? [11:06:54] ah right so all the service accounts for $namespace gets the privileges [11:07:04] yes [11:07:22] for the moment I'd need istio service accounts to be able to spin up pods in the istio-sytem namespace [11:08:25] probably also other things like get secrets, mount, etc.. [11:08:54] but basic things IIUC, no idea if I need to override or using the ones already there [11:09:33] I see...we should probably arrange some time to chat about this with a bit more detail [11:11:07] the istio sidecar machinery needs, without the use of a cni plugin, some root privileges (like NET_ADMIN) to mess with iptables via init-container, but for the moment it is not needed by ML [11:11:27] the gateway pod uses the envoy proxy etc.. but it is meant not to require root privileges (binds to high ports etc..) [11:11:29] what you could do for now I guess is just add a rolebinding like the one from line 100 in helmfile_psp.yaml for the istio-system namespace [11:12:03] that you can do manually to veryfy [11:12:09] *verify [11:12:47] in theory allow-restricted-psp would be sufficient right? [11:12:47] and then maybe block an hour or so for us on friday to figure out how to properly integrate that into helmfile.d [11:12:51] yep [11:13:08] that's basically dwtfyw [11:13:33] ah ok so you are saying, IIUC, that I could just create a yaml file with the rolebinding, kubectl apply -f to ml-serve-eqiad and verify [11:13:37] then bother Janis [11:13:52] :D [11:15:11] if so sounds good to me, it should unblock things, and I could refine the permissions manually after testing [11:16:00] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Tarrow) >>! In T285634#7183188, @daniel wrote: > > Did the code change, or i... [11:23:42] elukey: yeah :) [11:24:06] one does not simply write PSPs without try and error [11:24:29] ahahhahaha ok thanks that made my day, I just imagined you like the meme [11:24:45] will do and report back if I succeed [11:31:34] ack [11:47:07] Error: failed to create discovery service: failed to create CA: failed to create a self-signed istiod CA: failed to create CA due to secret write error [11:47:10] 2021-06-29T11:40:55.797909Z error pkica Failed to write secret to CA (error: Post "https://10.64.77.1:443/api/v1/namespaces/istio-system/secrets": x509: cannot validate certificate for 10.64.77.1 because it doesn't contain any IP SANs). Abort. [11:47:15] ... [11:47:42] self signed CA for the control plane [11:47:48] TIL [11:47:51] * elukey cries in a corner [11:48:06] I love when security makes you discover things [11:48:12] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Ladsgroup) This is basically done, we just need to wait to see if it continue... [13:01:19] _joe_: should we postpone the playback to like thursday, friday or next week so jel.to can join as well? [13:01:31] <_joe_> yteah the issue is [13:01:43] <_joe_> I might be off later in the week, and y'all are off next week :P [13:01:54] <_joe_> anyways, we'll see, maybe I'll be back on thursday [13:02:12] ah, ideed ... forgot about the next week [13:02:29] <_joe_> and the week after I will be off as I'll be working next week :P [13:08:38] this is going to be some crazy time :P [13:19:42] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Joe) 05Open→03Resolved Data on the number of apcu gets/s normalized after... [13:35:35] 10serviceops, 10GitLab, 10SRE, 10vm-requests, 10Patch-For-Review: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) Next we need to reserve a service IP for this new host in netbox and after the changes above are merged we can then add that new service IP to Hiera... [13:45:27] 10serviceops, 10SRE, 10Wikidata, 10wdwb-tech, and 6 others: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Jdforrester-WMF) [15:42:09] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) The switchover is mostly complete now, we were read only from 2021-06-29 14:21:26.671853 to 2021-06-29 14:23:23.504447, or 1m57s. The raw notes... [16:44:32] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10sgrabarczuk) [17:19:52] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10thcipriani) p:05Triage→03Medium [17:48:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Release-Engineering-Team (Radar): CI pipeline/job to build and release helm chart artifacts - https://phabricator.wikimedia.org/T257333 (10thcipriani) Is this task superseded by the cronjob on the deployment machines that publishes to the chart museum? [20:39:11] effie: what was the conclusion around mw2383? was that server having issues or was it overloaded because of the bad weights? [23:57:33] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10Urbanecm)