[00:18:56] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1...
[00:21:47] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1...
[08:59:02] <wikibugs>	 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert)
[08:59:47] <wikibugs>	 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) 05Open→03In progress
[09:00:19] <wikibugs>	 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert)
[09:20:45] <wikibugs>	 10serviceops: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) p:05Triage→03Medium
[09:38:42] <wikibugs>	 10serviceops, 10Patch-For-Review: wikikube LIST secrets latency - https://phabricator.wikimedia.org/T323706 (10JMeybohm) 05Open→03Resolved Updated resource requirements have just been deployed and special handling in alerts has been removed. Resolving this.
[10:09:34] <claime>	 Should I go ahead with removing the PHP opcache health warning? https://phabricator.wikimedia.org/T324649 / https://gerrit.wikimedia.org/r/865580 
[10:15:30] <effie>	 claime: I think we should have a short discussion before we do so 
[10:15:48] <claime>	 let's :)
[10:16:20] <effie>	 I need to shoot off in a bit, but the tldr is that
[10:16:49] <effie>	 when a server does not have a 99.99% hit ratio, it is either not receiving traffic 
[10:17:10] <effie>	 or, there is a problem with our code 
[10:17:32] <claime>	 Ok, and are we actually acting on this?
[10:17:39] <claime>	 Because there's 33 of them right now.
[10:18:26] <effie>	 I am doing a mediawiki deploy, so opcache is warming up
[10:18:42] <claime>	 It's not just that
[10:18:56] <claime>	 It's 2/3 of the volume of all sre tagged alarms
[10:19:00] <claime>	 And it's a warning
[10:19:24] <claime>	 I mean at that point it is just noise
[10:19:24] <effie>	 which is why we should tweak the alert prolly, rather than remove it 
[10:20:04] <effie>	 the warning should appear if a server's opcache is below 99.99% after eg an hour 
[10:20:28] <effie>	 that should be enough for new servers to warm up, and to figure out if there code issues
[10:21:12] <effie>	 there has been a few times that it was code, which is why I am not saying just go ahead and kill it 
[10:21:35] <claime>	 yeah yeah I get what you're saying
[10:22:09] <claime>	 I'll paste our convo in the task, and we'll think on it. It's not urgent anyways
[10:22:46] <effie>	 I will update teh task 
[10:22:55] <claime>	 thanks
[10:22:57] <effie>	 I will reply I mean
[10:23:13] <effie>	 too much irc logs on a task just make it harder to read 
[10:23:56] <claime>	 Agreed, I just didn't want to put the onus on you to write a comment, and the onus on me to remember to formulate it properly ;)
[10:25:25] <effie>	 it is alright 
[10:29:27] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Clement_Goubert) contint1001 crashed again today, bad DIMM, had to powercycle it from iDRAC.
[10:29:45] <wikibugs>	 10serviceops, 10Patch-For-Review: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10jijiki) We should briefly think this a little bit more before removing the alert altogether. When a server's opcache hit ratio is below 99.99%, it is either:  * not receiving any traffic * it is war...
[11:09:25] <wikibugs>	 10serviceops, 10Patch-For-Review: Remove PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) I've uploaded a new PS to raise the alarm to after 6 retries at 10 minute intervals. Sounds good?
[11:10:45] <wikibugs>	 10serviceops, 10Patch-For-Review: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert)
[11:21:10] <hashar>	 for Debian packaging of a go application, do we have to package all the go module dependencies or is there a way to instruct `dh_golang`  to `go get` the dependencies from the network and bundle them in the package?
[11:21:46] <hashar>	 it is not for wmf production but for a local use case, I would like to backport a package for local usage without the hassle of backporting all of the dependencies ;)
[11:39:46] <hnowlan>	 enabling service mesh for prod thumbor: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865595 
[11:48:18] <_joe_>	 hashar: look here https://wikitech.wikimedia.org/wiki/Helm#Importing_a_new_version
[11:48:30] <_joe_>	 we tend to vendor all dependencies in the git repo/package
[11:48:37] <_joe_>	 contrary to debian's own recommendations
[11:49:14] <_joe_>	 hnowlan: so you'll have thumbor using TLS?
[11:50:06] <hashar>	 _joe_: ohh nice. Thank you :-]
[11:50:34] <_joe_>	 hnowlan: uhm, wait, how do yu plan to do the transition?
[11:57:37] <hnowlan>	 _joe_: no - maybe I'm misunderstanding the purpose of the mesh seciton here. I'm hoping to enable the mesh here for connecting to swift rather than for exposing thumbor
[11:57:51] <_joe_>	 uhm sadly it does both
[11:58:09] <_joe_>	 so, we can add an if guard around the local tls termination
[11:58:19] <_joe_>	 let me get there after lunch
[11:59:58] <hnowlan>	 At this point maybe it just makes as much sense to connect directly to swift and not use the mesh until we move thumbor to use tls 
[12:13:18] <_joe_>	 well it seems like a possible usecase anyways, I'll see what I can do
[12:13:31] <_joe_>	 it's more or less adding a switch to turn on the TLS-based service or not
[12:15:50] <hnowlan>	 ah cool
[13:42:17] <ottomata>	 _joe_:  in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865158 i'm bringing in the flink helm chart with changes.  I'm guessing I should  bring in some of the vendor scaffold templates?  especially the base ones, and e.g. apply our metadata everywhere?
[13:42:44] <ottomata>	 or, should I not worry about it and just keep it close to upstream for metadata too, since I think, there will only be one flink operator ever deployed
[13:43:22] <ottomata>	 flink-operator helm chart *
[13:43:31] <_joe_>	 ottomata: what are you modifying compared to upstream?
[13:43:46] <wikibugs>	 10serviceops, 10MW-on-K8s: Helmfile apply failing on deploy server - https://phabricator.wikimedia.org/T324553 (10Clement_Goubert) 05Open→03Resolved Just confirmed the fix worked by restarting the failed service. `Dec 07 13:42:31 deploy1002 systemd[1]: train-presync.service: Succeeded.`
[13:43:53] <ottomata>	 _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865158/4/charts/flink-kubernetes-operator/README.md#11
[13:45:08] <_joe_>	 ack thanks
[13:46:02] <ottomata>	 mostly just removing support for things we don't want. i could keep it exactly the same as upstream and just manage the restrictions via admin_ng values somehow. 
[13:49:59] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10BTullis)
[13:51:40] <_joe_>	 ottomata: yeah, I mean, we might work with upstream eventually if this chart changes a lot over time
[13:51:55] <_joe_>	 but for now, I'd just keep it close to upstream
[13:52:45] <_joe_>	 I am going to say, I wouldn't assume we'll only ever have one operator per cluster, especially when upgrading it, we might be happy if we can deploy more than one
[13:53:05] <_joe_>	 so if that is lacking and our metadata would help there, I'd consider it
[13:53:16] <_joe_>	 otherwise, just go with "closest to upstream"
[13:54:33] <ottomata>	 i think as long as they are deployed as different releases/namespaces, it will be fine
[13:55:10] <ottomata>	 hm, except maybe the resources they create in the flink app watchNamespaces.  hm.  will check
[13:55:28] <ottomata>	 maybe those need to have some uniqueness indicating whiich flink-operator release created them?
[13:57:37] <jayme>	 sounds like premature optimization to me tbh.
[13:58:05] <jayme>	 we could probably add that later on when there is actual need, no?
[14:02:05] <jayme>	 I would also argue not to remove things from the chart that can just stay disabled/unused to allow for easier merging of uptream changes (like imagePullSecrets and operatorVolumeMounts)
[14:06:41] <_joe_>	 +1
[14:06:57] <_joe_>	 I was about to comment on the patch :)
[14:07:06] <jayme>	 that makes sense :)
[14:07:40] <_joe_>	 I would mostly keep things disabled that we can disable, and *add* feature flags allowing us to disable stuff
[15:04:30] <ottomata>	 OH, okay!
[15:04:37] <ottomata>	 great, i'll do that then.
[15:04:59] <ottomata>	 i thnk most things are behind feature flags already, except maybe hte ability to use FlinkSessionJob, but we can just prevent that in review 
[15:08:47] <ottomata>	 jayme:  re webhook; i can't say I fully understand what it does, but Ben was removing it in the spark operator, so I thought we should too?
[15:08:51] <ottomata>	 i can add it back in
[15:19:58] <jayme>	 I'd say we should at least understand what it does before we use it or remove it :)
[15:24:34] <ottomata>	 :0
[15:24:37] <ottomata>	 :)
[15:25:34] <claime>	 Madness I say
[15:28:06] * _joe_ mounsieur de la palice has entered the chat
[15:31:20] <claime>	 XVth century french references, damn
[15:35:25] <_joe_>	 claime: we work for an encyclopedia :)
[15:35:40] <_joe_>	 (the reference is https://en.wikipedia.org/wiki/Jacques_de_La_Palice)
[15:36:13] <claime>	 _joe_: Honestly didn't think knowledge of that guy went further than France, but seeing his service record, I can see why you would know about him :')
[15:36:42] <_joe_>	 claime: we also use "lapalissiano" to say something's a tautology
[15:37:17] <claime>	 Litteraly the same word "lapalissade" in french 
[15:37:38] <claime>	 All that because of the long s
[15:37:54] <claime>	 "How typography shapes the world", thanks for coming to my TED talk
[16:58:04] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye
[16:58:47] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye
[17:26:20] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > It'd be nice if the value of watchNamespaces didn't have to be hardcoded when the flink-operator is deployed Oh, [[ htt...
[18:32:52] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[18:44:34] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) thanks @fgiunchedi and @Clement_Goubert . I will follow-up on the hardware issue and with dcops.
[18:45:47] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye completed: - kubernetes1024 (**WARN...
[18:45:54] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye completed: - kubernetes1023 (**WARN...
[18:49:01] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn)
[18:49:54] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) There is T313832 already which is about setting up the replacement for this, contint1002.  Also there is now T324698 to e...
[18:50:57] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn)
[18:52:45] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures - https://phabricator.wikimedia.org/T324698 (10Dzahn) purchase date: 2016   not under warranty  and replacement is already here  so there is no point in trying to get the RAM replaced afaict  we can turn this...
[19:03:27] <wikibugs>	 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10Dzahn)
[19:05:54] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) T313832 , T324698  and T294276 in general have higher prio now and cover this.  This ticket can stay about a possible swi...
[20:02:21] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10hashar) @Dzahn can we stick to {T313832} for the implementation. I am fine having a task for decommissioning contint1001 but...
[20:05:33] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > `kubernetes.operator.dynamic.namespaces.enabled` Ah, but the upstream helm chart does not work with this feature becaus...
[20:06:42] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Perhaps, we could wildcard the namespaces that the flink-operator is allowed to modify? E.g. namespace that starts with...
[20:09:24] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar) 05Open→03Declined In 2020 we have switched the service from eqiad to codfw and this task was to switch back to eqiad...
[20:47:27] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson)
[20:47:45] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) 05Open→03Resolved completed