[00:18:56] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1... [00:21:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1... [08:59:02] 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) [08:59:47] 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) 05Open→03In progress [09:00:19] 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) [09:20:45] 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) p:05Triage→03Medium [09:38:42] 10serviceops, 10Patch-For-Review: wikikube LIST secrets latency - https://phabricator.wikimedia.org/T323706 (10JMeybohm) 05Open→03Resolved Updated resource requirements have just been deployed and special handling in alerts has been removed. Resolving this. [10:09:34] Should I go ahead with removing the PHP opcache health warning? https://phabricator.wikimedia.org/T324649 / https://gerrit.wikimedia.org/r/865580 [10:15:30] claime: I think we should have a short discussion before we do so [10:15:48] let's :) [10:16:20] I need to shoot off in a bit, but the tldr is that [10:16:49] when a server does not have a 99.99% hit ratio, it is either not receiving traffic [10:17:10] or, there is a problem with our code [10:17:32] Ok, and are we actually acting on this? [10:17:39] Because there's 33 of them right now. [10:18:26] I am doing a mediawiki deploy, so opcache is warming up [10:18:42] It's not just that [10:18:56] It's 2/3 of the volume of all sre tagged alarms [10:19:00] And it's a warning [10:19:24] I mean at that point it is just noise [10:19:24] which is why we should tweak the alert prolly, rather than remove it [10:20:04] the warning should appear if a server's opcache is below 99.99% after eg an hour [10:20:28] that should be enough for new servers to warm up, and to figure out if there code issues [10:21:12] there has been a few times that it was code, which is why I am not saying just go ahead and kill it [10:21:35] yeah yeah I get what you're saying [10:22:09] I'll paste our convo in the task, and we'll think on it. It's not urgent anyways [10:22:46] I will update teh task [10:22:55] thanks [10:22:57] I will reply I mean [10:23:13] too much irc logs on a task just make it harder to read [10:23:56] Agreed, I just didn't want to put the onus on you to write a comment, and the onus on me to remember to formulate it properly ;) [10:25:25] it is alright [10:29:27] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Clement_Goubert) contint1001 crashed again today, bad DIMM, had to powercycle it from iDRAC. [10:29:45] 10serviceops, 10Patch-For-Review: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10jijiki) We should briefly think this a little bit more before removing the alert altogether. When a server's opcache hit ratio is below 99.99%, it is either: * not receiving any traffic * it is war... [11:09:25] 10serviceops, 10Patch-For-Review: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) I've uploaded a new PS to raise the alarm to after 6 retries at 10 minute intervals. Sounds good? [11:10:45] 10serviceops, 10Patch-For-Review: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) [11:21:10] for Debian packaging of a go application, do we have to package all the go module dependencies or is there a way to instruct `dh_golang` to `go get` the dependencies from the network and bundle them in the package? [11:21:46] it is not for wmf production but for a local use case, I would like to backport a package for local usage without the hassle of backporting all of the dependencies ;) [11:39:46] enabling service mesh for prod thumbor: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865595 [11:48:18] <_joe_> hashar: look here https://wikitech.wikimedia.org/wiki/Helm#Importing_a_new_version [11:48:30] <_joe_> we tend to vendor all dependencies in the git repo/package [11:48:37] <_joe_> contrary to debian's own recommendations [11:49:14] <_joe_> hnowlan: so you'll have thumbor using TLS? [11:50:06] _joe_: ohh nice. Thank you :-] [11:50:34] <_joe_> hnowlan: uhm, wait, how do yu plan to do the transition? [11:57:37] _joe_: no - maybe I'm misunderstanding the purpose of the mesh seciton here. I'm hoping to enable the mesh here for connecting to swift rather than for exposing thumbor [11:57:51] <_joe_> uhm sadly it does both [11:58:09] <_joe_> so, we can add an if guard around the local tls termination [11:58:19] <_joe_> let me get there after lunch [11:59:58] At this point maybe it just makes as much sense to connect directly to swift and not use the mesh until we move thumbor to use tls [12:13:18] <_joe_> well it seems like a possible usecase anyways, I'll see what I can do [12:13:31] <_joe_> it's more or less adding a switch to turn on the TLS-based service or not [12:15:50] ah cool [13:42:17] _joe_: in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865158 i'm bringing in the flink helm chart with changes. I'm guessing I should bring in some of the vendor scaffold templates? especially the base ones, and e.g. apply our metadata everywhere? [13:42:44] or, should I not worry about it and just keep it close to upstream for metadata too, since I think, there will only be one flink operator ever deployed [13:43:22] flink-operator helm chart * [13:43:31] <_joe_> ottomata: what are you modifying compared to upstream? [13:43:46] 10serviceops, 10MW-on-K8s: Helmfile apply failing on deploy server - https://phabricator.wikimedia.org/T324553 (10Clement_Goubert) 05Open→03Resolved Just confirmed the fix worked by restarting the failed service. `Dec 07 13:42:31 deploy1002 systemd[1]: train-presync.service: Succeeded.` [13:43:53] _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865158/4/charts/flink-kubernetes-operator/README.md#11 [13:45:08] <_joe_> ack thanks [13:46:02] mostly just removing support for things we don't want. i could keep it exactly the same as upstream and just manage the restrictions via admin_ng values somehow. [13:49:59] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10BTullis) [13:51:40] <_joe_> ottomata: yeah, I mean, we might work with upstream eventually if this chart changes a lot over time [13:51:55] <_joe_> but for now, I'd just keep it close to upstream [13:52:45] <_joe_> I am going to say, I wouldn't assume we'll only ever have one operator per cluster, especially when upgrading it, we might be happy if we can deploy more than one [13:53:05] <_joe_> so if that is lacking and our metadata would help there, I'd consider it [13:53:16] <_joe_> otherwise, just go with "closest to upstream" [13:54:33] i think as long as they are deployed as different releases/namespaces, it will be fine [13:55:10] hm, except maybe the resources they create in the flink app watchNamespaces. hm. will check [13:55:28] maybe those need to have some uniqueness indicating whiich flink-operator release created them? [13:57:37] sounds like premature optimization to me tbh. [13:58:05] we could probably add that later on when there is actual need, no? [14:02:05] I would also argue not to remove things from the chart that can just stay disabled/unused to allow for easier merging of uptream changes (like imagePullSecrets and operatorVolumeMounts) [14:06:41] <_joe_> +1 [14:06:57] <_joe_> I was about to comment on the patch :) [14:07:06] that makes sense :) [14:07:40] <_joe_> I would mostly keep things disabled that we can disable, and *add* feature flags allowing us to disable stuff [15:04:30] OH, okay! [15:04:37] great, i'll do that then. [15:04:59] i thnk most things are behind feature flags already, except maybe hte ability to use FlinkSessionJob, but we can just prevent that in review [15:08:47] jayme: re webhook; i can't say I fully understand what it does, but Ben was removing it in the spark operator, so I thought we should too? [15:08:51] i can add it back in [15:19:58] I'd say we should at least understand what it does before we use it or remove it :) [15:24:34] :0 [15:24:37] :) [15:25:34] Madness I say [15:28:06] * _joe_ mounsieur de la palice has entered the chat [15:31:20] XVth century french references, damn [15:35:25] <_joe_> claime: we work for an encyclopedia :) [15:35:40] <_joe_> (the reference is https://en.wikipedia.org/wiki/Jacques_de_La_Palice) [15:36:13] _joe_: Honestly didn't think knowledge of that guy went further than France, but seeing his service record, I can see why you would know about him :') [15:36:42] <_joe_> claime: we also use "lapalissiano" to say something's a tautology [15:37:17] Litteraly the same word "lapalissade" in french [15:37:38] All that because of the long s [15:37:54] "How typography shapes the world", thanks for coming to my TED talk [16:58:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye [16:58:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye [17:26:20] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > It'd be nice if the value of watchNamespaces didn't have to be hardcoded when the flink-operator is deployed Oh, [[ htt... [18:32:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:44:34] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) thanks @fgiunchedi and @Clement_Goubert . I will follow-up on the hardware issue and with dcops. [18:45:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye completed: - kubernetes1024 (**WARN... [18:45:54] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye completed: - kubernetes1023 (**WARN... [18:49:01] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn) [18:49:54] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) There is T313832 already which is about setting up the replacement for this, contint1002. Also there is now T324698 to e... [18:50:57] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn) [18:52:45] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn) purchase date: 2016 not under warranty and replacement is already here so there is no point in trying to get the RAM replaced afaict we can turn this... [19:03:27] 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10Dzahn) [19:05:54] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) T313832 , T324698 and T294276 in general have higher prio now and cover this. This ticket can stay about a possible swi... [20:02:21] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10hashar) @Dzahn can we stick to {T313832} for the implementation. I am fine having a task for decommissioning contint1001 but... [20:05:33] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > `kubernetes.operator.dynamic.namespaces.enabled` Ah, but the upstream helm chart does not work with this feature becaus... [20:06:42] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Perhaps, we could wildcard the namespaces that the flink-operator is allowed to modify? E.g. namespace that starts with... [20:09:24] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar) 05Open→03Declined In 2020 we have switched the service from eqiad to codfw and this task was to switch back to eqiad... [20:47:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) [20:47:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) 05Open→03Resolved completed