[00:16:42] (CloudVPSDesignateLeaks) firing: (2) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:37:48] 10Grid-Engine-to-K8s-Migration: Migrate hazard-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319785#9556499 (10Hazard-SJ) As far as I recall, the migration was partially done: some tasks were migrated, but others were still yet to be migrated. I don't recall offhand w... [02:06:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 25553 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [02:21:53] RECOVERY - ensure kvm processes are running on cloudvirt1032 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:07:46] (03CR) 10Eugene233: "Thanks for submitting this fix @Ketulucas. A comments was left for you." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003601 (https://phabricator.wikimedia.org/T248587) (owner: 10Ketulucas) [04:16:43] (CloudVPSDesignateLeaks) firing: (2) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:25:34] (03PS1) 10Eugene233: Add user on callback after registering identity [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1004695 (https://phabricator.wikimedia.org/T357948) [04:25:57] (03CR) 10CI reject: [V: 04-1] Add user on callback after registering identity [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1004695 (https://phabricator.wikimedia.org/T357948) (owner: 10Eugene233) [04:29:52] (03PS2) 10Eugene233: Add user on callback after registering identity [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1004695 (https://phabricator.wikimedia.org/T357948) [05:06:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 36353 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [07:28:45] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138#9557114 (10YonatanWMIL) I'm starting to fill the gaps that we have due to the DB going down. Should take about a day. Afterwards i'll reactivate the cron job and update. [07:43:15] 10Cloud-Services, 10Toolforge, 10Wikimedia-Mailing-lists: Create a labs-announce-l mailing list - https://phabricator.wikimedia.org/T91864#9557473 (10JJMC89) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and... [08:06:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 45239 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [08:16:58] (CloudVPSDesignateLeaks) firing: (2) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:36:45] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-88 [08:37:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-88 [08:38:07] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:47:04] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-44.tools.eqiad1.wikimedia.cloud to the cluster [08:47:04] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:47:15] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-89 [08:47:53] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-89 [08:48:07] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:52:28] (03PS2) 10Ketulucas: fix application instructions [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003601 (https://phabricator.wikimedia.org/T248587) [08:52:30] (03PS1) 10Ketulucas: clean up readme file [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1005028 [08:52:32] (03PS1) 10Ketulucas: Bug: T248587. Added application instructions [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1005029 (https://phabricator.wikimedia.org/T248587) [08:57:47] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-45.tools.eqiad1.wikimedia.cloud to the cluster [08:57:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:59:10] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-90 [08:59:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-90 [09:00:07] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:02:50] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [09:05:10] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:15:44] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-46.tools.eqiad1.wikimedia.cloud to the cluster [09:15:44] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:29:46] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-91 [09:30:26] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-91 [09:31:20] (03PS1) 10Majavah: toolforge: Label and taint new ingress nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005033 (https://phabricator.wikimedia.org/T357425) [09:31:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:31:42] 10Toolforge (Toolforge iteration 05), 10Patch-For-Review: Automatically add required taints and labels to ingress nodes - https://phabricator.wikimedia.org/T357425#9558185 (10taavi) p:05Triage→03Medium a:03taavi [09:33:45] (03PS1) 10Slyngshede: P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 [09:35:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:35:33] (03CR) 10Slyngshede: [V: 03+2] P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:35:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:37:27] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the tools cluster [09:38:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Nice. LGTM." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005033 (https://phabricator.wikimedia.org/T357425) (owner: 10Majavah) [09:41:00] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-47.tools.eqiad1.wikimedia.cloud to the cluster [09:41:00] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:42:18] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: ACPI kernel failure on debian installer last step - https://phabricator.wikimedia.org/T357896#9558219 (10LSobanski) [09:46:28] !log taavi@cloudcumin1001 tools Added a new k8s ingress tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud to the cluster [09:46:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a ingress role in the tools cluster [09:52:23] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-ingress-6 [09:53:06] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-ingress-6 [09:59:28] (InstanceDown) firing: Project tools instance tools-k8s-ingress-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:04:28] (InstanceDown) resolved: Project tools instance tools-k8s-ingress-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:15:57] (SystemdUnitDown) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1032 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1032 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:44:25] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-92 [10:45:05] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-92 [10:45:24] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [10:47:40] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:54:48] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:56:34] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-48.tools.eqiad1.wikimedia.cloud to the cluster [10:56:34] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [10:56:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-93 [10:57:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-93 [10:57:35] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-94 [10:58:13] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-94 [11:04:36] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-95 [11:05:17] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-95 [11:05:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-94 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:05:47] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [11:06:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 56039 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [11:10:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-94 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:16:51] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-49.tools.eqiad1.wikimedia.cloud to the cluster [11:16:51] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [11:16:57] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [11:21:56] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:26:30] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-50.tools.eqiad1.wikimedia.cloud to the cluster [11:26:30] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [11:26:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [11:28:24] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [11:28:43] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [11:30:00] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.post-reimage preparing cloudvirt cloudvirt1032.eqiad.wmnet for duty (nova discovery, canary VM) Pending aggregates though. (T319184) [11:30:05] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [11:30:28] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.post-reimage (exit_code=0) preparing cloudvirt cloudvirt1032.eqiad.wmnet for duty (nova discovery, canary VM) Pending aggregates though. (T319184) [11:31:03] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [11:31:22] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [11:36:24] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-51.tools.eqiad1.wikimedia.cloud to the cluster [11:36:24] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [11:36:52] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [11:37:11] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [11:43:00] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-96 [11:43:40] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-96 [11:44:11] 10cloud-services-team, 10User-aborrero: wmcs.openstack.cloudvirt.lib.ensure_canary cookbook creates multiple canary VMs - https://phabricator.wikimedia.org/T357970#9558528 (10aborrero) [11:45:18] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:45:22] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) [11:45:28] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:45:34] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [11:45:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [11:55:58] !log taavi@runko toolsbeta START - Cookbook wmcs.toolforge.k8s.worker.drain for node toolsbeta-test-k8s-worker-nfs-1 [11:56:00] 10Toolforge, 10Patch-For-Review: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#9558585 (10dcaro) [11:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:56:08] 10Cloud-Services, 10Toolforge: Use skeleton home directories and PAM as a base for maintain-dbusers and maintain-kubeusers - https://phabricator.wikimedia.org/T91235#9558582 (10dcaro) 05Open→03Resolved a:03dcaro Decided to go in a simpler way. [11:56:18] 10Toolforge, 10Patch-For-Review, 10User-dcaro: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#9558589 (10dcaro) 05Open→03In progress [11:56:18] !log taavi@runko toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node toolsbeta-test-k8s-worker-nfs-1 [11:56:20] 10Toolforge, 10Patch-For-Review, 10User-dcaro: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#2170929 (10dcaro) a:03dcaro [11:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:56:41] !log taavi@runko toolsbeta START - Cookbook wmcs.toolforge.k8s.worker.drain for node toolsbeta-test-k8s-worker-nfs-1 [11:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:56:48] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-52.tools.eqiad1.wikimedia.cloud to the cluster [11:56:49] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [11:56:49] 10Toolforge (Toolforge iteration 05), 10Patch-For-Review, 10User-dcaro: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#9558591 (10dcaro) [11:56:55] !log taavi@runko toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node toolsbeta-test-k8s-worker-nfs-1 [11:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:57:10] 10Toolforge (Toolforge iteration 05), 10Patch-For-Review, 10User-dcaro: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#9558578 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/13... [11:58:17] 10Toolforge (Toolforge iteration 05), 10Patch-For-Review, 10User-dcaro: Tool Labs users .bashrc file does not exist for tools accounts - https://phabricator.wikimedia.org/T131561#2170929 (10dcaro) [12:01:48] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9558616 (10aborrero) [12:05:26] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-97 [12:06:06] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-97 [12:07:39] (03PS1) 10Majavah: toolforge: k8s: Pass a cluster name to drain cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005058 [12:09:13] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:13:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-97 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:15:52] 10cloud-services-team, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631#9558687 (10dcaro) Is this because we split the cookbook in three steps? (as in, would it be enough to store the value on the first step, and reuse in the next ones?) [12:16:58] (CloudVPSDesignateLeaks) firing: (2) Detected 20 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:18:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-97 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:18:46] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-53.tools.eqiad1.wikimedia.cloud to the cluster [12:18:46] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [12:19:08] 10Toolforge: 2024-02-19: toolforge NFS cleanup - https://phabricator.wikimedia.org/T357882#9558694 (10DB111) [12:19:17] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-98 [12:19:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-k8s-worker-nfs-51 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:19:59] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-98 [12:20:05] 10Tools: wiki-osm.pl: Use of uninitialized value within @kml in lc at /data/project/osm4wiki/public_html/cgi-bin/wiki/wiki-osm.pl line 166. - https://phabricator.wikimedia.org/T357899#9558691 (10DB111) 05Open→03Resolved a:03DB111 Still seeing no errors in error.log after some days. Maybe related to tempora... [12:20:35] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:21:18] (03CR) 10David Caro: [C: 03+1] "LGTM, there's sometimes some ambivalence between using a node name and discover the cluster, or using the cluster name directly, but in th" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005058 (owner: 10Majavah) [12:21:22] (03CR) 10Nikerabbit: [C: 03+1] Fix a lego message [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003743 (owner: 10Amire80) [12:23:19] 10Tools: Parameter l (level) not working for osm4wiki.toolforge.org/cgi-bin/wiki/wiki-osm.pl - https://phabricator.wikimedia.org/T321972#9558708 (10DB111) 05Open→03Declined [12:26:54] (03CR) 10Majavah: [C: 03+2] toolforge: k8s: Pass a cluster name to drain cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005058 (owner: 10Majavah) [12:29:22] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-54.tools.eqiad1.wikimedia.cloud to the cluster [12:29:22] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [12:29:30] 10Cloud-Services, 10Toolforge: Collect and display basic metrics for all tools (service groups) - https://phabricator.wikimedia.org/T129630#9558717 (10dcaro) I wonder if nowadays all this should be ingested by prometheus and exposed through grafana dashboards instead of implementing specific tools to show each... [12:29:37] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-99 [12:30:13] (03Merged) 10jenkins-bot: toolforge: k8s: Pass a cluster name to drain cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005058 (owner: 10Majavah) [12:30:18] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-99 [12:30:31] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:30:56] 10Cloud-VPS, 10Toolforge, 10cloud-services-team, 10Sustainability (Incident Followup): Write some labs tests that monitor login and sudo permissions - https://phabricator.wikimedia.org/T127716#9558726 (10dcaro) p:05Medium→03Low [12:32:49] 10Toolforge, 10Tools-Kubernetes, 10Kubernetes: Run https://github.com/kubernetes/node-problem-detector on all our nodes - https://phabricator.wikimedia.org/T140249#9558727 (10dcaro) p:05Medium→03Low [12:34:26] 10Toolforge, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: [toolforge.root] Setup ELK based logging for tool labs infrastructure components - https://phabricator.wikimedia.org/T141500#9558729 (10dcaro) [12:36:24] 10Cloud-Services, 10Toolforge: /data/project/aaaaaa and /data/project/prometheus are owned by root - https://phabricator.wikimedia.org/T152170#9558732 (10dcaro) 05Open→03Declined Will reopen if needed [12:40:01] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-55.tools.eqiad1.wikimedia.cloud to the cluster [12:40:01] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [12:46:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-100 [12:47:05] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-100 [12:47:16] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:54:38] 10cloud-services-team, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631#9558773 (10aborrero) This is a recent change in openstack, apparently, see https://docs.openstack.org/nova/latest/admin/compute-node-identification.html I think t... [12:57:21] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-56.tools.eqiad1.wikimedia.cloud to the cluster [12:57:21] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:11:00] 10Toolforge: become should have a better error message when homedir doesn't exist - https://phabricator.wikimedia.org/T149511#9558804 (10dcaro) 05Open→03Declined Will reopen if needed (we might move become inside the toolforge cli) [13:14:39] 10Toolforge, 10cloud-services-team: Replace Toolschecker alerts with Prometheus based ones - https://phabricator.wikimedia.org/T313030#9558812 (10dcaro) p:05Triage→03Medium [13:18:53] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the toolsbeta cluster [13:26:11] !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker toolsbeta-test-k8s-worker-10.toolsbeta.eqiad1.wikimedia.cloud to the cluster [13:26:11] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the toolsbeta cluster [13:26:25] (03CR) 10Eugene233: [C: 03+2] Adding prominence immediatley after adding new item fails [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003859 (https://phabricator.wikimedia.org/T357871) (owner: 10Eugene233) [13:26:40] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-test-k8s-worker-9 [13:26:41] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-test-k8s-worker-9 [13:26:51] (03Merged) 10jenkins-bot: Adding prominence immediatley after adding new item fails [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003859 (https://phabricator.wikimedia.org/T357871) (owner: 10Eugene233) [13:26:54] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-9 [13:26:54] !log taavi@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=2) for host toolsbeta-test-k8s-worker-9 [13:28:47] (03PS1) 10Majavah: toolforge: k8s: Fix arguments being passed to drain cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005087 [13:29:00] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-9 [13:30:23] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the toolsbeta cluster [13:31:34] 10Toolforge: [toolforge] create fullstack tests - https://phabricator.wikimedia.org/T357977#9558896 (10dcaro) [13:31:40] 10Toolforge: [toolforge] create fullstack tests - https://phabricator.wikimedia.org/T357977#9558908 (10dcaro) p:05Triage→03High [13:31:44] 10cloud-services-team: toolschecker: naming refresh - https://phabricator.wikimedia.org/T277542#9558917 (10dcaro) [13:31:50] 10Toolforge, 10Tools-Kubernetes, 10Kubernetes: Setup monitoring for kubernetes core components. - https://phabricator.wikimedia.org/T131929#9558918 (10dcaro) [13:33:26] 10Toolforge: [toolforge] create fullstack tests - https://phabricator.wikimedia.org/T357977#9558973 (10dcaro) [13:33:30] 10Toolforge, 10cloud-services-team: Replace Toolschecker alerts with Prometheus based ones - https://phabricator.wikimedia.org/T313030#9558972 (10dcaro) [13:34:01] 10Toolforge, 10Tools-Kubernetes, 10cloud-services-team, 10Kubernetes: Build replacement for the webservice toolschecker test - https://phabricator.wikimedia.org/T142164#9558914 (10dcaro) 05Open→03Declined This is replaced with {T357977} [13:36:34] 10Toolforge, 10Documentation: Record video tutorial(s) of basic Toolforge access and use - https://phabricator.wikimedia.org/T162654#9558991 (10dcaro) p:05Medium→03Low [13:38:27] !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker toolsbeta-test-k8s-worker-11.toolsbeta.eqiad1.wikimedia.cloud to the cluster [13:38:27] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the toolsbeta cluster [13:38:46] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a control role in the toolsbeta cluster [13:46:48] !log taavi@cloudcumin1001 toolsbeta Added a new k8s control toolsbeta-test-k8s-control-9.toolsbeta.eqiad1.wikimedia.cloud to the cluster [13:46:48] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a control role in the toolsbeta cluster [13:48:00] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-control-6 [13:48:08] !log taavi@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=2) for host toolsbeta-test-k8s-control-6 [13:48:46] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-control-6 [13:50:15] (03PS2) 10Majavah: toolforge: k8s: Fix arguments being passed to drain cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1005087 [13:53:37] 10Toolforge, 10Kubernetes: Decide on upgrade policy for Kubernetes - https://phabricator.wikimedia.org/T133598#9559028 (10dcaro) 05Open→03Resolved a:03dcaro We decided not to decide, and keep the upgrades "best effort", trying to keep up with the 4-months cadence of the major releases, but falling back t... [13:54:55] 10Toolforge: Apply pretty 'banned' error page to user-agent bans - https://phabricator.wikimedia.org/T122583#9559035 (10dcaro) 05Open→03Invalid I think this does not apply anymore, will reopen if needed. [13:56:51] 10Data-Services: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-20 - https://phabricator.wikimedia.org/T357979#9559042 (10fnegri) [13:57:12] 10Data-Services: [toolsdb] Replica is frequently lagging behind the primary - https://phabricator.wikimedia.org/T357624#9559055 (10fnegri) [13:57:30] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-20 - https://phabricator.wikimedia.org/T357979#9559053 (10fnegri) [13:57:43] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-20 - https://phabricator.wikimedia.org/T357979#9559053 (10fnegri) 05Open→03Stalled p:05Triage→03Medium [13:58:16] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: [toolforge] Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#9559057 (10dcaro) [14:00:15] 10Toolforge, 10cloud-services-team: allow tool users to attach strace to their processes (at least on exec hosts) - https://phabricator.wikimedia.org/T114401#9559078 (10dcaro) 05Open→03Declined The grid will be shutting down on Mar 14th 2024. [14:00:37] 10Data-Services: [toolsdb] Replica is frequently lagging behind the primary - https://phabricator.wikimedia.org/T357624#9559081 (10fnegri) [14:00:45] 10Toolforge: [toolforge] Provide easier way(s) to contact people abusing resources - https://phabricator.wikimedia.org/T114560#9559086 (10dcaro) [14:01:58] 10Toolforge: [toolforge] Automate getting the maintainers from the tool accounts/uids - https://phabricator.wikimedia.org/T114560#1699391 (10dcaro) [14:02:42] 10cloud-services-team, 10User-aborrero: wmcs.openstack.cloudvirt.lib.ensure_canary cookbook creates multiple canary VMs - https://phabricator.wikimedia.org/T357970#9559093 (10aborrero) p:05Triage→03Medium [14:17:30] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656#9559173 (10taavi) [14:19:56] 10Toolforge (Toolforge iteration 05), 10Toolforge Jobs framework: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537#9559178 (10dcaro) a:03dcaro [14:21:29] 10Toolforge (Toolforge iteration 05): [jobs] Enable filelog for buildservice-based images - https://phabricator.wikimedia.org/T357897#9559184 (10dcaro) 05duplicate→03Resolved [14:29:48] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9559227 (10dcaro) 05Stalled→03In progress [14:29:50] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140#9559228 (10dcaro) [14:31:10] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: [builds-api] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065#9559231 (10dcaro) 05Stalled→03In progress [14:31:14] 10Cloud Services Proposals, 10Toolforge Build Service, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, and 4 others: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs - https://phabricator.wikimedia.org/T194332#9559232 (10dcaro) [14:31:51] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745#9559249 (10aborrero) [14:35:07] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762#9559271 (10dcaro) 05In progress→03Resolved [14:35:11] 10Toolforge (Toolforge iteration 05), 10Toolforge Jobs framework, 10Patch-For-Review, 10User-aborrero: toolforge: introduce OpenAPI to jobs framework - https://phabricator.wikimedia.org/T356523#9559248 (10aborrero) 05Open→03In progress [14:39:26] 10Toolforge (Toolforge iteration 05): [toolforge-cd] gitlab-ci refactor - https://phabricator.wikimedia.org/T353514#9559304 (10dcaro) a:05Raymond_Ndibe→03None [14:39:56] 10Toolforge: [toolforge-cd] gitlab-ci refactor - https://phabricator.wikimedia.org/T353514#9408988 (10dcaro) [14:42:12] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164#9559316 (10aborrero) Maybe an idea: have a per-tool network quota for concurrent connections. We don't have any semantics... [14:46:07] 10Toolforge (Toolforge iteration 05): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9559334 (10dcaro) https://spec.openapis.org/oas/v3.0.3#info-object [14:49:17] 10Toolforge (Toolforge iteration 05): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9559346 (10dcaro) We can try to set it to the number of commits that changed the open-api definition file to make it more stable w... [14:52:58] 10PAWS: jupyterlab to 4.1.2 - https://phabricator.wikimedia.org/T357990#9559379 (10rook) [14:54:03] 10PAWS: jupyterlab to 4.1.2 - https://phabricator.wikimedia.org/T357990#9559393 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/378 [14:54:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:54:55] (03PS1) 10Slyngshede: IDP: Add superset_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 [14:55:13] vivian-rook opened https://github.com/toolforge/paws/pull/378 [14:58:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [14:59:02] 10Toolforge (Toolforge iteration 05): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9559421 (10fnegri) I found some conversations online about what to put in the `info.version` field, but no conclusive answer: * h... [15:00:33] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] IDP: Add superset_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [15:05:02] (03PS1) 10Slyngshede: IDP: Add Superset next k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005105 [15:05:36] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] IDP: Add Superset next k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005105 (owner: 10Slyngshede) [15:19:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-k8s-worker-nfs-51 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:21:52] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-k8s-worker-nfs-51.tools.eqiad1.wikimedia.cloud [15:22:15] 10Toolforge (Toolforge iteration 05), 10cloud-services-team: Migrate remaining tools off Gridengine - https://phabricator.wikimedia.org/T313405#9559546 (10dcaro) p:05Medium→03High [15:23:09] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-k8s-worker-nfs-51.tools.eqiad1.wikimedia.cloud [15:28:30] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.41 ms [15:29:28] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-k8s-worker-nfs-51 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:29:28] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [15:31:02] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:32:45] (03PS1) 10Josefanthony: Change machine vision request url [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1004699 [15:33:36] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.28 ms [15:38:16] !log taavi@cloudcumin1001 tools Added a new k8s worker tools-k8s-worker-102.tools.eqiad1.wikimedia.cloud to the cluster [15:38:16] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [15:39:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-102 [15:39:37] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-102 [15:40:28] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [15:46:11] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review: [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904#9559762 (10Andrew) Backy2 always persists one snapshot for each volume in order to do incremental backups. So as... [15:48:58] !log taavi@cloudcumin1001 tools Added a new k8s worker tools-k8s-worker-102.tools.eqiad1.wikimedia.cloud to the cluster [15:48:58] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [15:50:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-101 [15:50:45] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-101 [16:00:44] (03CR) 10Josefanthony: "I modified the old API endpoint URL 'https://m2c.wikimedia.se/extract/' with the new URL 'https://m2c.wmcloud.org/extract/' in the url fie" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1004699 (owner: 10Josefanthony) [16:03:10] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [16:04:44] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-102 [16:04:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-102 [16:07:55] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656#9559920 (10taavi) [16:12:31] !log taavi@cloudcumin1001 tools Added a new k8s worker tools-k8s-worker-103.tools.eqiad1.wikimedia.cloud to the cluster [16:12:31] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [16:16:58] (CloudVPSDesignateLeaks) firing: (2) Detected 24 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:17:22] 10PAWS: jupyterlab to 4.1.2 - https://phabricator.wikimedia.org/T357990#9560026 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/378 [16:17:31] 10PAWS: jupyterlab to 4.1.2 - https://phabricator.wikimedia.org/T357990#9560028 (10rook) 05Open→03Resolved [16:17:42] vivian-rook closed https://github.com/toolforge/paws/pull/378 [16:18:48] (03CR) 10Brouberol: "Thanks for pushing these patches. I should have done so in the first place." [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [16:36:46] 10Tool-Pageviews, 10Wikimedia-Hackathon-2024: Create tool to get total media requests of all media in a category - https://phabricator.wikimedia.org/T245698#9560202 (10TBurmeister) I believe the WMF data products team are already working on this as part of the Commons Impact Metrics project, see T355560 and ht... [16:51:43] (CloudVPSDesignateLeaks) firing: (2) Detected 24 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:56:43] (CloudVPSDesignateLeaks) resolved: (2) Detected 24 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:13:47] 10Toolforge, 10Documentation, 10Wikimedia-Hackathon-2024, 10good first task: Find and fix inaccuracies in Toolforge Django tutorial - https://phabricator.wikimedia.org/T245683#9560411 (10TBurmeister) [17:24:30] 10Cloud-VPS (Project-requests): Request creation of mdwiki-offline VPS project - https://phabricator.wikimedia.org/T358023#9560449 (10Harej) [17:54:49] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:12:56] PROBLEM - toolschecker: start a job and verify on buster on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/buster - 177 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:22:56] RECOVERY - toolschecker: start a job and verify on buster on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.780 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:47:42] 10Toolforge, 10Documentation: Update and Improve Toolforge and Cloud VPS Technical Documentation - https://phabricator.wikimedia.org/T203131#9560961 (10TBurmeister) [18:47:44] 10Toolforge, 10Documentation: Document how to install Python modules in a tool's home directory/virtual environment - https://phabricator.wikimedia.org/T63824#9560962 (10TBurmeister) [18:51:35] 10Tool-Wikidata-Periodic-Table, 10Wikidata, 10Documentation, 10Patch-For-Review, 10Wikimedia-Hackathon-2024: Improve documentation of Wikidata periodic table - https://phabricator.wikimedia.org/T99847#9560980 (10TBurmeister) [18:56:24] 10Tools, 10Tech-Docs-Team, 10Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9561003 (10TBurmeister) [19:03:00] 10Tools, 10Tech-Docs-Team, 10Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9561003 (10TBurmeister) [19:03:42] 10Tools, 10Tech-Docs-Team, 10Documentation, 10Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9561003 (10TBurmeister) [19:04:06] 10Tools, 10Tech-Docs-Team, 10Documentation, 10Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9561027 (10TBurmeister) 05Open→03In progress p:05Triage→03Medium [20:24:52] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10Trust and Safety Product Team, 10XTools: Display results according to final designs - https://phabricator.wikimedia.org/T358047#9561255 (10Tchanders) [20:26:28] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10Trust and Safety Product Team, 10XTools: Add filters to search form according to final designs - https://phabricator.wikimedia.org/T358048#9561268 (10Tchanders) [21:48:55] (03PS1) 10VolkerE: releases: Bump Code to 1.3.3 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1005174 [22:09:52] (03CR) 10LWatson: [C: 03+2] releases: Bump Code to 1.3.3 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1005174 (owner: 10VolkerE) [22:10:26] (03Merged) 10jenkins-bot: releases: Bump Code to 1.3.3 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1005174 (owner: 10VolkerE) [22:20:53] 10Tool-Pageviews, 10Data Products, 10Data-Engineering, 10Pageviews-API: No Pageviews data since 2024-02-17 - https://phabricator.wikimedia.org/T357910#9561660 (10Framawiki) 05Open→03Resolved a:03Sfaci Looks fixed for me, thanks @Sfaci and @BTullis for the quick fix. [23:11:50] 10Tool-global-search: 500: Internal Server Error on (Gadgets-definition|.*\.(js|css|json)) - https://phabricator.wikimedia.org/T358061#9561755 (10stjn)