[08:23:43] 10serviceops, 10Kubernetes: Reduction of Secret-based Service Account Tokens - https://phabricator.wikimedia.org/T345892 (10JMeybohm) [08:59:52] hi folks! [09:00:42] I am debugging an issue with the envoy sidecar, namely proxying to thanos-swift [09:01:30] I added more info to https://phabricator.wikimedia.org/T339890#9155267, but basically when I try to hit localhost:6022 I always get a 404 (from the main app's container) [09:01:50] afaics it seems as if the local envoy proxy doesn't set the Host header when proxying to Thanos Swift [09:02:24] (and IIUC enovy on the Thanos side wants it, since it specifies some host names in the listeners' config) [09:02:32] has it happend before ? [09:05:52] FYI Puppet is disabled on the testreduce hosts since almos 5 days with message "migration". Is that known/WIP? [09:06:44] volans: Yeah, https://phabricator.wikimedia.org/T345831 [09:08:41] ack thanks. A reference in the disable message would have been useful :) I hope the migration goes well and doesn't require more than 2 weeks (auto-removal period from puppetdb ;) ) [09:14:06] elukey: /me looking [09:17:56] akosiaris: thanks! It may be a pebcak, but I don't find any other explanation [09:18:30] I am trying to figure out what envoy does by default [09:18:38] maybe I can just turn on debugging [09:22:04] 10serviceops, 10SRE, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki) [09:22:09] 10serviceops, 10SRE, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki) @Jhancock.wm I am afraid the server is dead again :( [09:23:00] rip mw2444 [09:24:11] thumbor uses swiftclient (as we do) but mesh is disabled afaics: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/thumbor/values.yaml [09:25:44] 10serviceops, 10Shellbox: Rename the shellbox service to shellbox-score - https://phabricator.wikimedia.org/T345868 (10jijiki) a:03jijiki [09:25:47] 10serviceops, 10Shellbox: Rename the shellbox service to shellbox-score - https://phabricator.wikimedia.org/T345868 (10jijiki) Picking up the task [09:28:00] elukey: yeah, nobody else is using the mesh for thanos-swift as far as I can tell [09:28:30] machinetranslation is about to, but I have de-prioritized this for other work [09:28:49] akosiaris: mmm I am checking the cluster config in the envoy proxy's one, and I am wondering if we'd need to set the SNI [09:31:50] on thanos-fe nodes I see that envoy requires thanos-swift.something.wmnet [09:35:08] elukey: it definitely won't hurt to try it out [09:35:16] nothing else uses it, so no risk [09:35:26] ack, trying to figure out how to set it [09:39:54] in a related SNI thing, I have posted https://gerrit.wikimedia.org/r/c/operations/puppet/+/941888 [09:40:10] but that's for the other side of it, that is ingress [09:41:50] akosiaris: I was checking that one as well, I thought it was related to local tls proxies connecting to an ingress-enabled endpoint [09:42:12] basically similar to my use case, set the sni explicitly when proxying [09:42:39] (so istio requiring a proper SNI to route the request etc..) [09:44:33] I am tempted to set use_ingress: true for thanos-swift [09:44:35] and test it [09:45:10] not the permanent solution, but it it works I could add some extra config bit like "sni: etc.." [09:45:20] (or set_sni: true) [09:48:33] elukey: sure, go ahead [09:48:58] akosiaris: created https://gerrit.wikimedia.org/r/c/operations/puppet/+/956373 ! [09:53:56] akosiaris: worked! Now I have another error :D [09:55:08] mmm tls validation [09:55:36] "certificate verify failed: unable to get local issuer certificate" [09:55:55] ah ok maybe we don't use the bundle with PKI's root cert? [09:56:40] ah no thanos use a cergen cert [09:57:27] ahhh I have probably forgot wmf-certificates in the docker image [09:57:29] * elukey cries in a corner [09:57:40] *pat pat& [09:57:44] is ok [09:57:49] :p [09:59:04] :D [10:09:43] Good morning. I'm seeking a review for a patch to production-images please. It adds support for building multiple minor versions of spark: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/956374 [10:09:53] I am likely going to get a storm -1s for it but https://gerrit.wikimedia.org/r/c/operations/puppet/+/956379 [10:10:05] basically renaming "uses_ingress" to "sets_sni [10:23:38] There's going to be a lot of chart patching to do because iirc that's used all over the place [10:24:19] ah snap it is used in the mesh module [10:24:23] didn't see it [10:24:28] * elukey cries in a corner [11:15:46] elukey: lol. There is a way out btw. Bump the module's major version (per standard semver), alter "uses_ingress" to "set_sni", set both variables in puppet with a clear note to a task and we 'll gradually upgrade charts to the new mesh module version [11:16:08] it will take longer that you originally anticipated, but at least there is hope [12:33:28] How can I see difference between two images from https://docker-registry.wikimedia.org/python3-build-bullseye/tags/ - latest image seems breaking MinT [12:36:25] kart_: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/python-build/bullseye/ [12:36:59] I don't see any changes though in 2 years [12:37:10] what's the issue you are seeing? [12:37:49] we do routinely rebuild this image btw, so master always points to an image with security updates applied [12:38:09] but apparently nothing else has changed aside that [12:39:39] https://integration.wikimedia.org/ci/job/machinetranslation-pipeline-test/275/console seems conflict of packages? [12:41:21] AttributeError: module 'virtualenv.create.via_global_ref.builtin.cpython.mac_os' has no attribute 'CPython3macOsBrew' ? [12:41:48] MacOs ? [12:42:36] ah. surely bug from virtualenv [12:43:49] I don't know where the bug is, but it's emitted from the run-test part [12:44:06] Yes [12:44:31] Let me check in details, just wanted to confirm if it has anything to do with updated image. [12:44:42] the image gets built ok, so Debian packages are not to blame here [12:44:49] ok, let us know what the issue was [12:44:56] or if you need more help [12:46:45] Sure! [12:47:04] We never blame Debian when at the DebConf :D [13:36:02] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) @Eevans I totally understand your point of view, but at the same time I am not clear what procedure we should follow when an issue like this one happen... [13:43:49] akosiaris: good suggestion thank you :) [13:44:32] 10serviceops, 10SRE, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I opened a Dell support ticket to get a replacement. I've rebooted it for now but expect it to go down again. SR: 175669963 [14:17:33] filed all changes, I can already see Janis wondering "Why Luca Why" when reviewing [14:20:35] I'm reading here as well - so I'm prepared :-p [14:21:40] ahahahhaah [14:23:26] elukey: could you maybe add a separate task just for the renaming so we can properly track progress (and link those changes to it)? As this requires updating all charts before the old flag can be removed I think it might be worth it [14:24:02] jayme: definitely yes [14:24:19] we might get lucky as we can piggyback on 300033 and the telemetry additions [14:24:56] T300033 [14:28:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert the defective DIMM has been replaced and booted up. Error hasn't repeated yet. `The self-heal operation suc... [14:33:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Clement_Goubert) Thanks @Jhancock.wm ! [14:51:13] jayme: you can also tell me "it is totally crazy don't do it" [14:51:28] I came up with the new name but anything is fine [14:52:28] I think the new name is more spreaking tbh [14:54:14] okok [15:06:13] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) After working with Dell, we determined that the drive is bad and they will be sending a replacement. [15:06:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:07:52] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) p:05Triage→03Medium [15:32:10] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [15:47:52] 10serviceops, 10acl*WMF-FR: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10RhinosF1) Per #-sre, serviceops own the config to power this [15:47:54] 10serviceops, 10acl*WMF-FR: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10Damilare) [16:10:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [16:19:15] 10serviceops, 10Security-Team, 10Security: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10Aklapper) > Damilare added a project: acl*WMF-FR. Please see https://www.mediawiki.org/wiki/Phabricator/Help#Restricting_access_to_tasks - thanks! [16:19:20] 10serviceops, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10sbassett) p:05Triage→03Low [16:19:32] 10serviceops, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10sbassett) [16:23:10] 10serviceops, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10sbassett) >>! In T346055#9157172, @Damilare wrote: > A case of erring on the side of caution I guess. The #security-team is always fine with this approach :) [16:50:55] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [20:35:26] 10serviceops, 10Fundraising-Backlog, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10AKanji-WMF) [20:45:41] 10serviceops, 10Shellbox: Rename the shellbox service to shellbox-score - https://phabricator.wikimedia.org/T345868 (10RLazarus) >>! In T345868#9151520, @Clement_Goubert wrote: > I propose using a `_shellbox_common_` directory like we have a `_aqs2-common_` and a `_mediawiki-common_` directory in `helmfile.d/s... [21:43:23] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10Eevans) >>! In T345058#9156401, @elukey wrote: > ... Is your recommendation to just to let the instance depooled, stop puppet etc.. and then ping Data Persiste...