[00:05:30] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [00:10:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [00:14:10] PROBLEM - Host cloudcephosd1043 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:48] RECOVERY - Host cloudcephosd1043 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [00:17:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [00:21:40] 06cloud-services-team, 10Cloud-VPS, 10Ceph: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) - https://phabricator.wikimedia.org/T404249#11196082 (10Andrew) All of Row C (rack C8) is now running reef and bookworm. Going to pause for Friday/the weekend and reimage the rest of the cluster next week. [00:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [00:37:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [00:40:57] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [00:56:08] 10Cloud-Services: TOOLFORGE: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! - https://phabricator.wikimedia.org/T405050 (10pwangai) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a mo... [00:56:46] 06cloud-services-team, 10Toolforge: TOOLFORGE: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! - https://phabricator.wikimedia.org/T405050#11196123 (10pwangai) [01:00:37] 06cloud-services-team, 10Toolforge: TOOLFORGE: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! - https://phabricator.wikimedia.org/T405050#11196126 (10JJMC89) 05Open→03Invalid The bastions were replaced. See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/I4M335NM... [01:07:37] 06cloud-services-team, 10Toolforge: TOOLFORGE: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! - https://phabricator.wikimedia.org/T405050#11196128 (10pwangai) @JJMC89 I missed that announcement, thank you. [01:19:56] 10Quarry: Queries with blank titles aren't clickable in the recent query list - https://phabricator.wikimedia.org/T405051 (10Perryprog) 03NEW [01:20:43] 10Quarry: Queries with blank titles aren't clickable in the recent query list - https://phabricator.wikimedia.org/T405051#11196149 (10Perryprog) [01:23:01] 10Quarry: Queries with blank titles aren't clickable in the recent query list - https://phabricator.wikimedia.org/T405051#11196153 (10Perryprog) →14Duplicate dup:03T197029 [01:23:06] 10Quarry: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029#11196155 (10Perryprog) [01:37:41] 10Quarry: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029#11196157 (10Perryprog) It's worth mentioning that at least one component of this issue is the fact that all whitespace query names are allowed, which is actually what causes the blank entries to show... [01:38:18] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [01:51:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:00:01] 10Quarry: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029#11196164 (10Cryptic) The more immediate problem - as reported in both current duplicates - is that if you add a title and then either blank it or replace it with whitespace, it doesn't go back to "Cli... [02:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:29:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:54:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:32:19] 06cloud-services-team, 14Cloud-VPS (Debian Buster Deprecation): Buster VMs in cloud-vps PKI project - https://phabricator.wikimedia.org/T405017#11196353 (10elukey) Hey Andrew! These are used by deployment-prep's TLS certificates, so we should probably upgrade them to say bookworm. We have to do the upgrade in... [06:41:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:50:54] 06cloud-services-team, 14Cloud-VPS (Debian Buster Deprecation): Buster VMs in cloud-vps PKI project - https://phabricator.wikimedia.org/T405017#11196378 (10fgiunchedi) a:05fgiunchedi→03None I have not interacted with this project in a long time, can't judge the status/usefulness of the VMs [06:53:55] (03PS1) 10Lokal Profil: comitting local changes from toolforge [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189630 [06:57:04] !log godog@r5 testlabs START - Cookbook wmcs.nfs.add_server [06:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [07:08:05] !log godog@r5 testlabs END (PASS) - Cookbook wmcs.nfs.add_server (exit_code=0) [07:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [07:09:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:14:07] !log godog@r5 testlabs START - Cookbook wmcs.nfs.add_server [07:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [07:19:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:22:26] !log godog@r5 testlabs END (FAIL) - Cookbook wmcs.nfs.add_server (exit_code=99) [07:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [07:24:06] (03CR) 10Jean-Frédéric: [C:03+2] comitting local changes from toolforge [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189630 (owner: 10Lokal Profil) [07:26:08] (03Merged) 10jenkins-bot: comitting local changes from toolforge [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189630 (owner: 10Lokal Profil) [07:28:20] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [08:02:44] 10Tool-unwatchlist, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 10Technical-Tool-Request: Create a tool to unwatchlist large numbers of pages - https://phabricator.wikimedia.org/T401274#11196518 (10Chlod) https://unwatchlist.toolforge.org has been created and deployed but it's currently [pending OAuth a... [08:15:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:23:01] (03PS1) 10Filippo Giunchedi: wmcs_libs: add get_address_family [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189791 (https://phabricator.wikimedia.org/T404584) [08:23:03] (03PS1) 10Filippo Giunchedi: nfs: make add_server idempotent for service IP [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) [08:26:22] (03CR) 10CI reject: [V:04-1] nfs: make add_server idempotent for service IP [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [08:27:01] (03CR) 10CI reject: [V:04-1] wmcs_libs: add get_address_family [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189791 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [08:38:01] (03PS2) 10Filippo Giunchedi: wmcs_libs: add get_address_family [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189791 (https://phabricator.wikimedia.org/T404584) [08:38:01] (03PS2) 10Filippo Giunchedi: nfs: make add_server idempotent for service IP [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) [08:41:31] (03CR) 10CI reject: [V:04-1] nfs: make add_server idempotent for service IP [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [08:45:05] (03CR) 10Filippo Giunchedi: "FTR the CI failures are these:" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [08:55:02] (03PS1) 10Slyngshede: Add cdn_private_git_token dummy [labs/private] - 10https://gerrit.wikimedia.org/r/1189798 [08:56:02] (03PS2) 10Slyngshede: Add cdn_private_git_token dummy [labs/private] - 10https://gerrit.wikimedia.org/r/1189798 [09:03:59] (03CR) 10Slyngshede: [V:03+2 C:03+2] Add cdn_private_git_token dummy [labs/private] - 10https://gerrit.wikimedia.org/r/1189798 (owner: 10Slyngshede) [09:18:52] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11196668 (10fgiunchedi) Something else I noticed is memory spikes spaced 2h apart, which correspond to block compaction, that also might contribute to push memory usage over the... [09:33:45] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Eliminate single point of failure from Toolforge front proxy - https://phabricator.wikimedia.org/T283948#11196703 (10taavi) a:03taavi [09:34:17] (03Abandoned) 10Majavah: Read running tools from grid-webservices tool [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/703189 (https://phabricator.wikimedia.org/T284564) (owner: 10Majavah) [09:48:37] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-6:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:56:38] 10Tool-unwatchlist, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 10Technical-Tool-Request: Create a tool to unwatchlist large numbers of pages - https://phabricator.wikimedia.org/T401274#11196730 (10Samwalton9-WMF) >>! In T401274#11196518, @Chlod wrote: > https://unwatchlist.toolforge.org has been create... [09:59:57] 06cloud-services-team, 10Toolforge: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11196740 (10DamianZaremba) I have a suspicion that this is the reason for 5min outage on cluebotng-review this morning. Checking the current events (don't have the... [10:18:37] RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-6:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:29:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:04:07] 10Toolforge, 06tools-infrastructure-team: Eliminate single point of failure from Toolforge front proxy - https://phabricator.wikimedia.org/T283948#11196910 (10taavi) [11:05:43] 10Toolforge, 06tools-infrastructure-team: Rebuild Toolforge HAProxies to support IPv6 - https://phabricator.wikimedia.org/T405078 (10taavi) 03NEW [11:05:56] 10Toolforge, 06tools-infrastructure-team: Rebuild Toolforge HAProxies to support IPv6 - https://phabricator.wikimedia.org/T405078#11196924 (10taavi) p:05Triage→03Medium [11:08:09] (03open) 10taavi: channels: Update -cloud-feed project tags [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/61 [11:08:13] (03update) 10taavi: channels: Update -cloud-feed project tags [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/61 [11:08:28] (03update) 10taavi: channels: Update -cloud-feed project tags [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/61 [12:02:27] (03PS1) 10Lokal Profil: fix deploy message [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189851 [12:06:46] (03PS2) 10Lokal Profil: fix deploy message [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189851 [12:07:29] 10Tools: [Wudele] Clarify timezone information for date/time options in Wudele polls - https://phabricator.wikimedia.org/T405088 (10Wikitanvir) 03NEW [12:35:10] (03CR) 10Jean-Frédéric: [C:03+2] fix deploy message [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189851 (owner: 10Lokal Profil) [12:37:08] (03Merged) 10jenkins-bot: fix deploy message [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189851 (owner: 10Lokal Profil) [13:00:47] (03PS1) 10Filippo Giunchedi: wmcs_libs: add network_id to NeutronPort [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189866 (https://phabricator.wikimedia.org/T404584) [13:00:49] (03PS1) 10Filippo Giunchedi: wmcs_libs: add optional ttl to recordset_create [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189867 (https://phabricator.wikimedia.org/T404584) [13:00:51] (03PS1) 10Filippo Giunchedi: nfs: do DNS flip on network migration [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189868 [13:03:44] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-11 [13:04:38] (03CR) 10CI reject: [V:04-1] nfs: do DNS flip on network migration [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189868 (owner: 10Filippo Giunchedi) [13:04:38] (03CR) 10CI reject: [V:04-1] wmcs_libs: add network_id to NeutronPort [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189866 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [13:04:53] (03CR) 10CI reject: [V:04-1] wmcs_libs: add optional ttl to recordset_create [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189867 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [13:09:47] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-11 [13:10:20] (03PS3) 10Filippo Giunchedi: nfs: make add_server idempotent for service IP [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189792 (https://phabricator.wikimedia.org/T404584) [13:10:20] (03PS2) 10Filippo Giunchedi: wmcs_libs: add network_id to NeutronPort [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189866 (https://phabricator.wikimedia.org/T404584) [13:10:20] (03PS2) 10Filippo Giunchedi: wmcs_libs: add optional ttl to recordset_create [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189867 (https://phabricator.wikimedia.org/T404584) [13:10:21] (03PS2) 10Filippo Giunchedi: nfs: do DNS flip on network migration [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189868 [13:14:09] (03CR) 10CI reject: [V:04-1] nfs: do DNS flip on network migration [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189868 (owner: 10Filippo Giunchedi) [13:19:28] (03CR) 10Filippo Giunchedi: "Aside from the CI failure, this is a sketch of a solution as I'd like input on whether I'm on the right track with the code." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189868 (owner: 10Filippo Giunchedi) [13:35:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:00:24] 10cloud-services-team (Hardware), 10Cloud-VPS, 05Goal: eqiad1: procure 1 additional cloudlb server - https://phabricator.wikimedia.org/T341062#11197583 (10taavi) 05Open→03Invalid I don't think we have interest in this at the moment. [14:00:44] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570#11197591 (10taavi) 05Open→03Invalid [14:01:31] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks.ceph] create a script to get the list of rbd images affected by stuck/inactive PGs - https://phabricator.wikimedia.org/T331636#11197592 (10taavi) [14:40:21] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11197657 (10akosiaris) Thanks for this writeup! Couple of inline replies >>! In T404726#11194763, @dcaro wrote: > In the team meeting from today we decided that we should fi... [15:07:42] 06cloud-services-team, 14Cloud-VPS (Debian Buster Deprecation): Buster VMs in cloud-vps PKI project - https://phabricator.wikimedia.org/T405017#11197702 (10Andrew) >>! In T405017#11196353, @elukey wrote: > Hey Andrew! These are used by deployment-prep's TLS certificates, so we should probably upgrade them to... [15:17:25] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fix .gitreview after a merge mishap [openstack/horizon/horizon] (rebuild) - 10https://gerrit.wikimedia.org/r/1189291 (owner: 10Andrew Bogott) [15:17:38] (03Abandoned) 10Andrew Bogott: Fix .gitreview after a merge mishap [openstack/horizon/horizon] (rebuild) - 10https://gerrit.wikimedia.org/r/1189291 (owner: 10Andrew Bogott) [15:18:51] 10Cloud-VPS (Project-requests): Request creation of gitlab-runners-staging VPS project - https://phabricator.wikimedia.org/T404386#11197729 (10Andrew) 05Open→03Resolved @dduvall I think you're all set but please re-open if I've missed anything. [15:34:10] 10cloud-services-team (Hardware), 10Cloud-VPS: wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568#11197779 (10taavi) a:03Andrew `lang=irc andrewbogott: T377568 is no longer relevant, right? <+stashbot> T377568: wmcs codfw hardware changes proposal - https://phabricator.w... [15:44:18] 06cloud-services-team, 10Cloud-VPS, 07patch-welcome, 07Python3-Porting: Upgrade various Cloud VPS Python 2 scripts to Python 3 - https://phabricator.wikimedia.org/T218426#11197818 (10taavi) 05Open→03Resolved Calling this done as no Cloud VPS infrastructure host even has Python 2 installed these days. [15:44:58] 06cloud-services-team: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T404745#11197822 (10taavi) 05Open→03Invalid [15:56:44] (03PS1) 10Andrew Bogott: Fix gitreview to point back at wmf gerrit [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189895 [15:56:44] (03PS1) 10Andrew Bogott: remove MANIFEST.in [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189896 [15:56:44] (03PS1) 10Andrew Bogott: Fix merge issue with tox.ini [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189897 [15:57:36] (03PS2) 10Andrew Bogott: Fix gitreview to point back at wmf gerrit [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189895 [15:57:36] (03PS2) 10Andrew Bogott: remove MANIFEST.in [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189896 [15:57:36] (03PS2) 10Andrew Bogott: Fix merge issue with tox.ini [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189897 [15:58:01] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fix gitreview to point back at wmf gerrit [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189895 (owner: 10Andrew Bogott) [15:58:07] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] remove MANIFEST.in [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189896 (owner: 10Andrew Bogott) [15:58:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fix merge issue with tox.ini [openstack/horizon/magnum-ui] - 10https://gerrit.wikimedia.org/r/1189897 (owner: 10Andrew Bogott) [16:10:01] 10Tool-unwatchlist, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, 10Technical-Tool-Request: Create a tool to unwatchlist large numbers of pages - https://phabricator.wikimedia.org/T401274#11197879 (10Chlod) >>! In T401274#11196730, @Samwalton9-WMF wrote: >>>! In T401274#11196518, @Chlod wrote: >> https://... [16:39:29] (03merge) 10bd808: channels: Update -cloud-feed project tags [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/61 (owner: 10taavi) [17:04:03] (03PS1) 10Ologuie Arlette: fix production page design [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1189907 [18:06:04] (03PS1) 10Arendpieter: Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1189915 (https://phabricator.wikimedia.org/T359554) [18:07:00] (03CR) 10CI reject: [V:04-1] Use IDP for authentication [labs/striker] - 10https://gerrit.wikimedia.org/r/1189915 (https://phabricator.wikimedia.org/T359554) (owner: 10Arendpieter) [18:09:31] (03CR) 10Arendpieter: "I’d really appreciate it if you could review my changes. I’m a Python developer myself, but I haven’t worked with Django before. Your feed" [labs/striker] - 10https://gerrit.wikimedia.org/r/1189915 (https://phabricator.wikimedia.org/T359554) (owner: 10Arendpieter) [18:25:45] 06cloud-services-team, 10Horizon: Update our Horizon release to 2025.2 - https://phabricator.wikimedia.org/T405117 (10Andrew) 03NEW [18:26:19] 06cloud-services-team, 10Horizon: Update our Horizon release to 2025.2 - https://phabricator.wikimedia.org/T405117#11198174 (10Andrew) The magnum UI in the latest horizon currently doesn't work; it requires an API microversion that we aren't running yet. Once magnum is updated to 2025.2/Flamingo the magnum pan... [18:32:30] 06cloud-services-team, 10Horizon: Update our Horizon release to 2025.2 - https://phabricator.wikimedia.org/T405117#11198180 (10Andrew) p:05Triage→03Medium [18:40:40] 10cloud-services-team (Hardware), 10Cloud-VPS: wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568#11198190 (10Andrew) As of today: == What happened == [x] cloudcontrol2006-dev: increase memory in-place, or replace with another server with higher memory [x] cloudcontrol2007-dev:... [18:46:07] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712#11198198 (10Andrew) 05Open→03Resolved I believe this to be fixed [18:51:05] 06cloud-services-team, 10Cloud-VPS: OpenStack services should use system users to talk to Keystone - https://phabricator.wikimedia.org/T273150#11198203 (10Andrew) [20:41:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:47:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:48:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:52:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:53:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [21:30:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [21:31:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [22:03:59] 06Toolforge-standards-committee: Define a process for keeping the committee membership "fresh" - https://phabricator.wikimedia.org/T379844#11198539 (10bd808) [23:20:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [23:21:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown