[00:36:57] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) [00:37:36] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) [00:54:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [07:09:39] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10hashar) Side note: don't run npm install , it has a high potential to get the machine owned/taken over or turned in a botnet agent :-] To upgrade Node you pretty much need to upgrade the OS: * Debian Buste... [08:09:09] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) testreduce is a VM, we can easily spin a new testreduce1002 VM running Bookworm which would provider nodejs 18 and npm 9. [08:10:32] claime jayme I'm looking into file-based CDS https://www.envoyproxy.io/docs/envoy/latest/start/quick-start/configuration-dynamic-filesystem which seems like it would work, I'm not sure though about the helm/k8s part, as in I thought writing a new file from the mesh module would work, and of course it does but we can't directly reference env variables or downward api from configmaps AFAICS [08:10:38] which makes sense [08:12:13] godog: I think what you'll have to do is reference the file in the configmap, but catch and write it from entrypoint.sh [08:12:17] Does that make sense? [08:12:39] yeah...it does not really solve the problem of how to get the variable (hostname) into the file [08:13:12] jayme: downward api in deployment [08:13:26] catch it in entrypoint.sh like we do for concurrency etc [08:13:32] write it to a file [08:13:39] It's disgusting [08:13:45] yeah...that's what I meant [08:14:01] claime: ah got it, ok [08:14:06] Basically use this https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ [08:14:25] it's more or less the same as passing the whole envoy.yaml from the configmap through envsubstr and write it somewhere [08:14:35] More or less [08:14:37] so cds is not really needed [08:14:48] just another layer of confusion imho [08:14:53] Fair enough [08:15:15] which is what entrypoint.sh does already, though we're not writing envoy.yaml.tpl so nothing happens in practice [08:15:21] I don't really have an opinion either way, I think envoy is being very difficult with its configuration options [08:15:49] godog: yes, exactly [08:16:01] gotta be honest, the more I dig into this the more it feels going against the grain, as opposed to be able to use a cluster-wide dns name [08:16:26] Can't you have a service dns name that always references the node-local deployment? [08:16:27] I get it why it is an daemonset with nodeport [08:16:34] godog: why don't we do that btw? [08:16:36] I seem to remember being able to do something like this [08:16:43] absolutely [08:16:56] not sure why it has to be a nodeport [08:17:00] jayme: I don't know, I've been asked to look into this last thurs [08:17:06] eheh [08:17:21] trow everything over like I did when you asked me to help :-p [08:17:39] * godog touche' [08:17:57] https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/#using-service-internal-traffic-policy [08:18:04] is what you are looking for [08:18:17] thank you, I'll read up on that [08:18:44] you can use the k8s internal service name then always reach the pod on the local node (or none) [08:19:08] That's what I was thinking about [08:19:14] Couldn't quite get the name to pop up [08:19:32] Great to have people that actually know stuff! [08:19:47] +1 [08:19:58] hrhr [11:10:51] jayme: for the automatic certificate generation stuff in k8s - do I need to define something in the private repo for the certificates? My $name-main-tls-proxy container is failing because it can't load the certs [11:30:13] kubetcd1005 will briefly go down for a Ganeti node reboot [11:32:53] hnowlan: no, that should not be necessary - which deployment? [11:39:11] jayme: media-analytics in codfw and eqiad. staging is fine, unsurprisingly [11:39:48] actually that is a surprise :p [11:39:54] I'll take a look [11:40:48] thanks! [11:42:48] hmm..in staging it does not use a cert-manager cert [11:43:15] so I guess it does not do so in prod as well... [11:44:21] is there a vendor dependency the chart is behind in? [11:44:25] ah, the cassandra-http-gateway chart has not been updated yet [11:44:33] https://phabricator.wikimedia.org/T300033 [11:44:35] yeah :/ [11:44:49] should be easy enough though I suppose [11:45:02] ahhh right, yeah, shound be grand [11:45:08] thanks! [11:45:17] [/29 [11:45:39] there are a lot of example patches and a small guide on how to do it on the task [11:46:34] if you can, please update to mesh.configuration 1.4.0 as well right away [11:49:10] will do [11:49:20] ❤️ [12:02:07] I 'll do the same for linkrecommendation today. Splitting up in a couple of patches too [12:04:04] but my understanding currently is that all that we need is a mesh.certmanager.extraFQDNS = ["api.wikimedia.org"] stanza in values.yaml [12:04:22] after ofc the vendor/ modules have been updated to 1.4.0 [12:05:45] * akosiaris would love it so much if helmfile diff could default to --context=5 [12:06:39] I've not checked the cert of linkrecommendation...if it has api.w.o in SAN then yes :) [12:07:23] hmm, let me make sure. There is an externally visible instance of the service, accessible under the api-gateway [12:07:30] that's why I thought so, /me making sure [12:07:33] or you also take on https://phabricator.wikimedia.org/T302717, in which case you wont have to change extraFQDNs :-P [12:08:21] ah no, it doesn't, cool [12:08:25] best to check the cergen config and copy what's in there [12:08:28] sweet [12:11:25] 10serviceops: Remove tls-proxy cpu limits on eventstreams - https://phabricator.wikimedia.org/T345243 (10Clement_Goubert) [12:12:54] 10serviceops, 10MW-on-K8s: Remove tls-proxy cpu limits on eventgate - https://phabricator.wikimedia.org/T345244 (10Clement_Goubert) [12:13:15] 10serviceops: Remove tls-proxy cpu limits on eventgate - https://phabricator.wikimedia.org/T345244 (10Clement_Goubert) [12:13:42] 10serviceops: Remove tls-proxy cpu limits on eventgate - https://phabricator.wikimedia.org/T345244 (10Clement_Goubert) p:05Triage→03Medium [12:14:06] 10serviceops: Remove tls-proxy cpu limits on eventstreams - https://phabricator.wikimedia.org/T345243 (10Clement_Goubert) p:05Triage→03Medium [12:18:18] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) [12:18:49] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) [12:22:12] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) Additionally, while the key for both certificates are of the same size, during negotiation, e... [13:00:58] 10serviceops, 10Observability-Tracing, 10Patch-For-Review, 10User-fgiunchedi: jaeger is configured to receive traces from production - https://phabricator.wikimedia.org/T344253 (10JMeybohm) > Before we can complete T343302: otel collector is configured to send traces to jaeger we need to get jaeger collect... [13:19:25] hey folks, if you are ok I can take care of the kafka-main reboots [13:21:07] (Starting with codfw) [13:21:50] thanks elukey - you're great! [13:21:56] <3 [13:23:28] I am still paying the price of that redis password leak that happened the day after Alex rolled out the new password [13:26:43] lol [13:31:59] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) Thanks Mortiz. We would also need to copy over the test database when you create the new VM. It is not catastrophic if not done since we can always reinitialize the test set, but it would save us s... [13:47:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [13:50:14] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) linkrecommendation done today. [13:54:14] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) [14:00:18] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) Ack, I'll look into it tomorrow. For the new VM I'd simply reuse the current specs of testreduce1001, so 4 CPU cores, 6G RAM and 40G disk. [14:47:31] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) [14:47:34] 10serviceops, 10Parsoid: Move testreduce to nodejs 12 - https://phabricator.wikimedia.org/T301303 (10ssastry) [14:50:42] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [14:51:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10akosiaris) [14:52:19] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) >>! In T345220#9130723, @MoritzMuehlenhoff wrote: > Ack, I'll look into it tomorrow. For the new VM I'd simply reuse the current specs of testreduce1001, so 4 CPU cores, 6G RAM and 40G disk. Reg d... [14:53:16] 10serviceops, 10Parsoid: Request for additional disk space on testreduce1001 - https://phabricator.wikimedia.org/T296051 (10ssastry) 05Open→03Declined I am going to decline this on my end since I think we have found a way to work with the 50tb disk. [15:05:22] 10serviceops, 10SRE, 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [15:06:44] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [15:13:01] 10serviceops, 10Parsoid: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) >>! In T345220#9131133, @ssastry wrote: > I would say 50gb is probably the minimum we need and 60gb would probably give us a bit more cushion. Let me know if I need to do anything to cha... [15:23:07] kafka main codfw rebooted :) [15:25:26] thanks! [15:31:11] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) p:05Triage→03Medium [15:54:52] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10kamila) [15:55:30] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10SRE: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) 05Open→03Resolved The remaining Benthos errors are due to T340935, other than that this is working. (I still n... [15:55:54] 10serviceops, 10Similarusers: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 (10hnowlan) [15:57:22] 10serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10hnowlan) [16:24:21] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [17:01:47] 10serviceops, 10Similarusers: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 (10kostajh) @Niharika @Tchanders any concerns with this? [17:12:10] 10serviceops, 10SRE, 10ops-codfw: Decommission thumbor200[34] - https://phabricator.wikimedia.org/T344597 (10wiki_willy) a:03Jhancock.wm [18:46:32] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10Jdforrester-WMF) [20:15:19] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10daniel) The idea sounds good, but I have no idea whether it's practical.... [20:21:07] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10daniel) > The now not-really-stateless back-end service The backend serv... [20:58:11] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10Jdforrester-WMF) >>! In T345289#9132178, @daniel wrote: > If we don't wan... [21:57:05] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10BPirkle) >>! In T345289#9132178, @daniel wrote: > If it find nothing in m... [22:17:58] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10Jdforrester-WMF) >>! In T345289#9132335, @BPirkle wrote: > One especially... [22:37:01] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10BPirkle) > having to have both the PHP MW extension code and the (current...