[11:06:20] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) [11:43:45] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) Note that we have tried updating the firmware: mw1388 is on new UEFI and iDRAC and still exhibits this problem. [11:48:39] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) Pasting SEL entries for completeness here. Surprise, they aren't particularly helpful ` Record: 11 Date/Time: 01/03/2024 19:44:59 Source: system Seve... [11:50:07] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Host rebooted by kamila@cumin1002 with reason: None [12:19:04] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10MoritzMuehlenhoff) Is this reproducible with every reboot or just some? One thing worth doing is to connect to the serial console an then issue a reboot over Cumin. Maybe we... [12:22:10] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) Additional findings: - the `watchdog: watchdog0: watchdog did not stop!` message seems to be a red herring, it's always there - the problem only occurs when running... [12:24:31] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) >>! In T354413#9437404, @MoritzMuehlenhoff wrote: > Is this reproducible with every reboot or just some? One thing worth doing is to connect to the serial console an... [12:39:57] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Jgiannelos) The snippet from the http gateway helm chart is not using keyspace/tables (because its user... [13:48:29] 10serviceops, 10Kubernetes: kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging - https://phabricator.wikimedia.org/T353314 (10JMeybohm) 05Open→03Resolved This is now fixed by approach #1 and certs with 72h expiry have been issued on kubestagemaster1001.eqiad.wmnet, al... [14:19:15] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 10Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Krinkle) a:05DAlangi_WMF→03jijiki Moving to our Radar as our part is done, I believe. Feel free to move to our Inbox anytime. [14:35:57] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Joe) >>! In T350507#9437424, @Jgiannelos wrote: > The snippet from the cassandra-http-gateway helm chart... [15:51:54] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Jgiannelos) Sounds good I will leave it as it is on the patch [17:07:37] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) Note that this is non-deterministic: the problem seems to happen more than half the time but far from always. So several reboots may be required to reproduce. Yay! [17:32:27] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10EBernhardson) a:03EBernhardson