[02:39:33] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10212955 (10Scott_French) I just realized I didn't mention the obvious alternative before: if we have some confidence in the mesh support already added to the cask chart in T363996, we could go that route... [07:09:27] 06serviceops, 10MW-on-K8s: mw-scripts SAL integration - https://phabricator.wikimedia.org/T376776 (10jijiki) 03NEW [07:14:30] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10213139 (10elukey) @Scott_French hi! I'd vote to introduce the mesh TLS support for echostore, it shouldn't be super hard to do given all the work done for sessionstore (I thought it was already completed... [08:33:10] 06serviceops, 10MW-on-K8s, 10Sustainability (Incident Followup): mw-scripts SAL integration - https://phabricator.wikimedia.org/T376776#10213243 (10Peachey88) [09:27:52] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10213425 (10hnowlan) The main reason sessionstore didn't roll ahead with using the mesh was concern around the extremely broad impact any issues might have incurred. The risk profile for echostore is a **l... [10:22:10] hey, I saw some issues with certmanager: https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cfssl-issuer&orgId=1&from=now-12h&to=now [10:22:25] I wonder if a fallout of codfw k8s issues or something else [10:22:51] https://grafana.wikimedia.org/goto/-kssMXkNg?orgId=1 [10:23:55] Since around 3:00 [10:42:05] jynus: I [10:42:20] jynus: I'd assume its fallout - will take care [10:42:25] thanks for raising! [11:12:29] 06serviceops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10213709 (10elukey) Rolled out the swift proxy change, it seems that it has a solved the issue. Doing more tests before closing to... [11:14:13] 06serviceops, 10MW-on-K8s, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795 (10JMeybohm) 03NEW [11:14:24] 06serviceops, 10MW-on-K8s, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10213724 (10JMeybohm) p:05Triage→03High [11:19:05] 06serviceops, 10MW-on-K8s, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10213731 (10JMeybohm) [11:20:39] 06serviceops, 06Data-Persistence, 13Patch-For-Review: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#10213732 (10elukey) The current status seems to be that only staging has mesh configs enabled, but it doesn't work (namely the deploy doesn't... [11:21:33] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10213733 (10elukey) Totally agree with Hugh :) @Scott_French o/ I added a summary of my understanding to T363996#10213732, we can work together in finding the missing config if you want! [11:38:03] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#10213755 (10akosiaris) Noting that in Q3 FY24-25, that is the quarter starting on January 2025, we 'll be refreshing mw[2291-2376], which incl... [11:38:33] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#10213756 (10akosiaris) [12:19:21] jayme: I'll apply admin in eqiad and codfw wikikube with `--selector name=namespace-certificates` to remove the obsolete experimental query service ingress names. [12:19:21] Btw there is quite a big diff in all wikikube cluster for admin/external-services :) [12:20:17] ack [12:29:39] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10213880 (10Trizek-WMF) 05Open→03Resolved The retro is a bit late, sorry. - Great work with @Scott_French who nicely answ... [12:57:04] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#10213962 (10Mvolz) >>! In T349118#10081252, @Ottomata wrote: > FYI, [[ https://gitlab.wikimedia.org/tchin/service-utils/-... [13:11:00] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10213986 (10akosiaris) >>! In T371130#10213880, @Trizek-WMF wrote: > The retro is a bit late, sorry. Thanks for the retro, we a... [13:38:02] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10214102 (10Trizek-WMF) The reasons you give make a lot of sense. I'll document our process so that the person in charge books the... [13:53:27] 06serviceops, 06Data-Engineering, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.09.28 - 2024.10.18), and 3 others: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195#10214215 (10Ahoelzl) [14:00:36] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10214328 (10akosiaris) >>! In T371130#10214102, @Trizek-WMF wrote: > The reasons you give make a lot of sense. I'll document our p... [14:33:25] 06serviceops, 06Data-Persistence, 13Patch-For-Review: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#10214586 (10elukey) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1078944 was needed, now staging works! ` root@deploy2002:... [14:34:43] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10214593 (10elukey) @Scott_French session store staging seems to work now, so I think we can probably proceed with echostore staging as well. [14:36:48] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10214595 (10Scott_French) Thank you both, @elukey and @hnowlan! Yeah, if we can make the mesh support work, agreed that's the vastly preferable option, and that's great news that sessionstore staging seems... [15:42:45] 06serviceops, 10MW-on-K8s, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10214855 (10JMeybohm) mwscript-cleanup already ran, but did not clean up anything probably because the releases are too fresh. Not sure if... [15:56:07] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#10214863 (10Ottomata) > What service are you using this for? https://wikitech.wikimedia.org/wiki/Event_Platform/EventStr... [16:03:19] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm [16:03:21] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm [17:18:07] 06serviceops, 10Thumbor: Thumbor's use of the `expensive` poolcounter queue can break rendering formats - https://phabricator.wikimedia.org/T376828 (10hnowlan) 03NEW [17:18:09] 06serviceops, 10Thumbor: Thumbor's use of the `expensive` poolcounter queue can break rendering formats - https://phabricator.wikimedia.org/T376828#10215152 (10hnowlan) p:05Triage→03High [17:19:40] 06serviceops, 10Structured Data Engineering, 06Structured-Data-Backlog, 10Thumbor: Thumbor's use of the `expensive` poolcounter queue can break rendering formats - https://phabricator.wikimedia.org/T376828#10215153 (10hnowlan) [17:23:42] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm executed with errors: - mc-misc20... [17:23:43] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm executed with errors: - mc-misc20... [17:48:05] 06serviceops, 06Data-Persistence, 13Patch-For-Review: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#10215216 (10Scott_French) Alright, with https://gerrit.wikimedia.org/r/1078979 merged, sessionstore staging metrics are now back. I'll proceed... [17:54:00] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm [17:54:02] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215246 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm [18:20:36] 06serviceops, 13Patch-For-Review: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10215305 (10Scott_French) a:03Scott_French Ah, a gotcha that I should have foreseen: `failed to create resource: Service "kask-staging-tls-service" is invalid: spec.ports[0].nodePor... [19:28:07] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm completed: - mc-misc2001 (**PASS*... [19:28:09] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm completed: - mc-misc2002 (**WARN*... [19:29:05] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215506 (10Jhancock.wm) [19:32:49] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215511 (10Jhancock.wm) 05Open→03Resolved @jijiki this is ready for you. [20:25:02] 06serviceops, 13Patch-For-Review: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10215648 (10Scott_French) Alright, destroy / apply seems to work as expected: app is healthy, requests to `/echoseen/v1` make it to the app, and app metrics are even working. I'll sta... [20:49:02] 06serviceops, 10MW-on-K8s, 10Sustainability (Incident Followup): mw-scripts SAL integration - https://phabricator.wikimedia.org/T376776#10215670 (10RLazarus) A mwscript-k8s flag to log to SAL is on my to-do list -- I hadn't gotten around to filing a task, thanks. I don't think it would have helped us in thi...