[00:15:28] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:16:36] (03PS1) 10BryanDavis: wikibugs: Extract XACT to page anchor mappings from data-javelin-init-data [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) [00:17:29] (03PS2) 10BryanDavis: wikibugs: Extract XACT to page anchor mappings from data-javelin-init-data [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) [00:20:28] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:22:30] (03PS1) 10BryanDavis: ci: bump tested python version to 3.9 [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 [00:30:05] 10Wikibugs, 10Patch-For-Review, 10User-bd808: Frequent exception while trying to extract anchors from task - https://phabricator.wikimedia.org/T199007 (10bd808) 05Open→03In progress a:03bd808 [00:34:39] 10Wikibugs: Get anchors from API instead of screen scraping - https://phabricator.wikimedia.org/T1177 (10bd808) [00:41:37] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:50:04] 10Wikibugs: Get anchors from API instead of screen scraping - https://phabricator.wikimedia.org/T1177 (10bd808) `maniphest.gettasktransactions` does seem to return the needed data currently. The endpoint takes a list of numeric task ids as input and returns a dict of numeric task id to array of transactions data... [00:54:49] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:00:28] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate Puppet CA: paws-puppetmaster-01.paws.eqiad.wmflabs is about to expire in 27d 20h 58m 23s - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:11:18] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10Klein) 05Resolved→03Open Unfortunately I found out that only one of four tasks has been migrated successfully. [01:13:54] 10Wikibugs, 10Patch-For-Review, 10User-bd808: Frequent exception while trying to extract anchors from task - https://phabricator.wikimedia.org/T199007 (10bd808) {T1177} is a better long term fix than the patch I have proposed as this is the screen scraping that task proposes to replace. [01:17:55] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10Klein) @aborrero, @dcaro, we communicated on the support channel and asked me to open a ticket about my issue so I reopened this. As said there, tried again my 4 tasks... [01:26:25] 10Grid-Engine-to-K8s-Migration: Migrate croptool from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319653 (10Soda) 05Open→03Resolved This should be done based on the merged PR and the lack of jobs at https://grid-deprecation.toolforge.org/t/croptool. There are still a few... [02:51:55] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10Philipnelson99) [02:54:22] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10Philipnelson99) [03:07:29] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10RoySmith) This is totally fascinating. When I try that command on my Mac desktop, it also fails in the same way. I obviously don't have the .ssh/wmf key file, but that shouldn't cause the timeout... ` % ssh -F /dev/nul... [03:40:41] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10Philipnelson99) I think the issue was I was using the .com tld and not .org. Really sorry about that. shout out to Jeremy for catching it. [03:41:37] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:42:58] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10Philipnelson99) 05Open→03Resolved [03:43:27] 10Toolforge: Toolforge SSH connection error - https://phabricator.wikimedia.org/T357493 (10JJMC89) 05Resolved→03Invalid [04:00:28] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate Puppet CA: paws-puppetmaster-01.paws.eqiad.wmflabs is about to expire in 27d 17h 58m 23s - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:18:24] (03PS2) 10Eugene233: Campaigns and contributions are added with no users [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/954644 (https://phabricator.wikimedia.org/T304090) [05:38:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:43:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:28:27] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @Andrew following up to see if this has been put back into service? [06:41:38] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:00:28] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate Puppet CA: paws-puppetmaster-01.paws.eqiad.wmflabs is about to expire in 27d 14h 58m 23s - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:43:00] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-68 [07:43:44] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-68 [07:44:12] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [07:53:55] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-24.tools.eqiad1.wikimedia.cloud to the cluster [07:53:55] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [07:54:02] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-69 [07:54:43] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-69 [07:56:00] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:00:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:05:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:05:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:05:48] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-25.tools.eqiad1.wikimedia.cloud to the cluster [08:05:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:06:46] 10Grid-Engine-to-K8s-Migration: Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319965 (10Xover) >>! In T319965#9536932, @Soda wrote: > Based on looking at the code, and playing around with the buildpacks/the jobs framework, I'm a bit pessimistic that we wi... [08:07:38] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-70 [08:08:18] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-70 [08:09:04] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:10:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:11:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:14:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:16:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:19:28] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:21:20] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-26.tools.eqiad1.wikimedia.cloud to the cluster [08:21:21] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:22:11] 10Grid-Engine-to-K8s-Migration: Migrate huggle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319797 (10Petrb) @dcaro ok but can we somehow check if this URL https://huggle.toolforge.org/ is grid or k8s? I think grid was this URL - http://tools.wmflabs.org/huggle/ which w... [08:22:15] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-71 [08:22:54] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-71 [08:23:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:27:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:28:14] 10tool-wdlocator, 10translatewiki.net: Add wdlocator to translatewiki.net - https://phabricator.wikimedia.org/T357495 (10Samwilson) [08:28:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:29:45] 10tool-wdlocator, 10Technical-Tool-Request: Make a tool to browse Wikidata item geometry on OpenStreetMap - https://phabricator.wikimedia.org/T344222 (10Samwilson) 05Open→03Resolved [08:32:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:32:45] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-27.tools.eqiad1.wikimedia.cloud to the cluster [08:32:46] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:32:54] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-72 [08:33:30] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'P{O:wmcs::openstack::codfw1dev::virt_ceph}' [08:33:39] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-72 [08:33:48] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:33:58] (InstanceDown) resolved: Project tools instance tools-k8s-worker-71 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:43:14] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-28.tools.eqiad1.wikimedia.cloud to the cluster [08:43:14] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:43:48] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-73 [08:44:30] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-73 [08:44:33] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:47:57] (SystemdUnitDown) firing: The service unit systemd-timedated.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:50:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-73 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:53:06] 10tool-wdlocator, 10translatewiki.net, 10Language-Team (Language-2024-January-March), 10Localization Infrastructure FY2023-24, 10Unplanned-Sprint-Work: Add wdlocator to translatewiki.net - https://phabricator.wikimedia.org/T357495 (10Nikerabbit) p:05Triage→03Medium [08:54:54] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-29.tools.eqiad1.wikimedia.cloud to the cluster [08:54:54] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:55:23] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-74 [08:55:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-73 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:55:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-k8s-worker-nfs-25 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:56:04] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-74 [08:56:57] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:02:56] (SystemdUnitDown) resolved: The service unit systemd-timedated.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:07:20] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-30.tools.eqiad1.wikimedia.cloud to the cluster [09:07:20] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:07:56] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:13:06] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:13:14] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-k8s-worker-nfs-25.tools.eqiad1.wikimedia.cloud [09:14:29] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-k8s-worker-nfs-25.tools.eqiad1.wikimedia.cloud [09:20:08] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10Klein) As the time of speaking the one task that "never started" has started and is working fine so I believe 2 are successful and 2 fail. [09:22:50] 10Grid-Engine-to-K8s-Migration: Migrate huggle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319797 (10dcaro) >>! In T319797#9540970, @Petrb wrote: > @dcaro ok but can we somehow check if this URL https://huggle.toolforge.org/ is grid or k8s? > > I think grid was this UR... [09:32:58] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-k8s-worker-nfs-25 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:33:19] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'P{O:wmcs::openstack::codfw1dev::virt_ceph}' [09:41:38] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:08:49] (03PS4) 10Majavah: openstack: cloudvirt: add support for batch reboots [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1003047 [10:09:28] (03CR) 10David Caro: [C: 03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1003047 (owner: 10Majavah) [10:11:38] (03CR) 10Majavah: [C: 03+2] openstack: cloudvirt: add support for batch reboots [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1003047 (owner: 10Majavah) [10:14:52] (03Merged) 10jenkins-bot: openstack: cloudvirt: add support for batch reboots [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1003047 (owner: 10Majavah) [10:15:56] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'P{O:wmcs::openstack::eqiad1::virt_ceph}' [10:35:47] 10Toolforge Jobs framework, 10User-aborrero: toolforge jobs current image aliases - https://phabricator.wikimedia.org/T357388 (10dcaro) > Having users specify versions does not magically turn them into diligent maintainers. Agree, though the problem here is not a tool stopping to work (as most tools will stop... [10:36:54] PROBLEM - Host cloudvirt1031 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:11] 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) >>! In T356016#9539392, @LucasWerkmeister wrote: > (Note: At the moment T320140 isn’t actually blocked on th... [10:38:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [10:38:52] RECOVERY - Host cloudvirt1031 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [10:43:38] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'P{O:wmcs::openstack::eqiad1::virt_ceph}' [10:44:34] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review: [maintain-harbor] Improvements to subcommands and config validation - https://phabricator.wikimedia.org/T353059 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merg... [10:54:34] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.543% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:55:36] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'P{O:wmcs::openstack::eqiad1::virt_ceph}' [10:57:39] 10Grid-Engine-to-K8s-Migration: Migrate bawolff from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319584 (10dcaro) I'll add you to the list of tools to 'extend for a bit', let me know if you have issues, need some guidance/advice and such. [10:58:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [10:59:57] PROBLEM - Host cloudvirt1031 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:02:05] RECOVERY - Host cloudvirt1031 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:11:12] (03PS4) 10Majavah: openstack: cloudvirt: safe_reboot: Downtime during reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991016 (https://phabricator.wikimedia.org/T347490) [11:11:13] (CloudVPSDesignateLeaks) resolved: (2) Detected 28 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:11:14] (03PS3) 10Majavah: openstack: cloudvirt: don't restore maintenance aggregate [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991020 [11:11:16] (03PS4) 10Majavah: openstack: cloudvirt: set_maintenance: Abort if already in maintenance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991021 [11:11:18] (03PS4) 10Majavah: openstack: cloudvirt: set_maintenance: Remove real aggregates [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991022 [11:18:36] (03CR) 10Majavah: openstack: cloudvirt: safe_reboot: Downtime during reboot (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991016 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah) [11:19:34] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.884% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:21:02] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'P{O:wmcs::openstack::eqiad1::virt_ceph}' [11:21:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:29:09] (03CR) 10David Caro: [C: 03+1] openstack: cloudvirt: safe_reboot: Downtime during reboot (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991016 (https://phabricator.wikimedia.org/T347490) (owner: 10Majavah) [11:33:35] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1032.eqiad.wmnet}' [11:36:20] PROBLEM - Host cloudvirt1032 is DOWN: PING CRITICAL - Packet loss = 100% [11:38:30] RECOVERY - Host cloudvirt1032 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:38:36] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1032.eqiad.wmnet}' [11:38:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1032 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:39:51] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1033.eqiad.wmnet}' [11:41:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991020 (owner: 10Majavah) [11:42:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991021 (owner: 10Majavah) [11:42:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991022 (owner: 10Majavah) [11:52:26] (03PS4) 10Majavah: openstack: cloudvirt: Don't restore maintenance aggregate [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991020 [11:52:28] (03PS5) 10Majavah: openstack: cloudvirt: set_maintenance: Abort if already in maintenance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991021 [11:52:30] (03PS5) 10Majavah: openstack: cloudvirt: set_maintenance: Remove real aggregates [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991022 [11:57:27] (03CR) 10Majavah: [C: 03+2] openstack: cloudvirt: Don't restore maintenance aggregate [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991020 (owner: 10Majavah) [11:57:32] (03CR) 10Majavah: [C: 03+2] openstack: cloudvirt: set_maintenance: Abort if already in maintenance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991021 (owner: 10Majavah) [11:57:38] (03CR) 10Majavah: [C: 03+2] openstack: cloudvirt: set_maintenance: Remove real aggregates [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991022 (owner: 10Majavah) [11:58:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1032 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:58:54] 10Grid-Engine-to-K8s-Migration: Migrate robokobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320014 (10dcaro) Hi @Thibaut120094! I'm looking a bit into this. In order to get some logs I ran this: ` tools.robokobot@tools-sgebastion-10:~$ toolforge jobs run --image tool... [12:01:35] (03Merged) 10jenkins-bot: openstack: cloudvirt: Don't restore maintenance aggregate [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991020 (owner: 10Majavah) [12:01:37] (03Merged) 10jenkins-bot: openstack: cloudvirt: set_maintenance: Abort if already in maintenance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991021 (owner: 10Majavah) [12:01:39] (03Merged) 10jenkins-bot: openstack: cloudvirt: set_maintenance: Remove real aggregates [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991022 (owner: 10Majavah) [12:06:17] PROBLEM - Host cloudvirt1033 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:38] 10Grid-Engine-to-K8s-Migration: Migrate robokobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320014 (10dcaro) Okok, that should be fixed now, it was not that issue you were hitting, but a different one that was already fixed, but required us to rebuild the pywikibot im... [12:07:58] (03PS5) 10Majavah: openstack: cloudvirt: safe_reboot: Downtime during reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/991016 (https://phabricator.wikimedia.org/T347490) [12:08:37] RECOVERY - Host cloudvirt1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:08:48] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1033.eqiad.wmnet}' [12:09:44] (InterfaceSpeedError) firing: brq7425e328-56 on cloudvirt1033:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [12:09:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1033 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:09:51] 10cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1033:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357521 (10phaultfinder) [12:11:19] 10cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1033:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357521 (10taavi) 05Open→03Resolved a:03taavi [12:13:33] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1034.eqiad.wmnet}' [12:14:44] (InterfaceSpeedError) resolved: brq7425e328-56 on cloudvirt1033:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [12:14:50] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1032 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:17:07] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE-tools, 10Spicerack, 10Patch-For-Review: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10taavi) 05Open→03Resolved a:03taavi [12:29:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1033 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:31:58] PROBLEM - Host cloudvirt1034 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:11] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1034.eqiad.wmnet}' [12:34:34] RECOVERY - Host cloudvirt1034 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [12:35:02] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1035.eqiad.wmnet}' [12:38:20] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1033 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:41:37] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:45:48] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745 (10aborrero) My concerns are also related to maintainability. In particular, if we solve by hand th... [12:53:20] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1034 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:56:20] (03PS1) 10Jforrester: releases: Add jsdoc and jsdoc-wmf-theme [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1003410 (https://phabricator.wikimedia.org/T357524) [12:57:42] PROBLEM - Host cloudvirt1035 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:44] RECOVERY - Host cloudvirt1035 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:00:00] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1035.eqiad.wmnet}' [13:00:41] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1036.eqiad.wmnet}' [13:01:20] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1034 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:16:42] 10Toolforge Jobs framework, 10User-aborrero: toolforge-jobs: fix pkg_resources deprecation warning - https://phabricator.wikimedia.org/T357387 (10aborrero) 05Open→03Invalid I couldn't reproduce the warning in all tool accounts. Example account with the warning: `lang=shell-session tools.smallem@tools-sgeb... [13:20:26] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10taavi) a:03taavi [13:20:37] PROBLEM - Host cloudvirt1036 is DOWN: PING CRITICAL - Packet loss = 100% [13:21:20] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1035 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:22:28] 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) I think that the issue comes from the procfile buildpack, it's generating a `metadata.toml` file with proces... [13:22:49] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1036.eqiad.wmnet}' [13:22:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1036 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:22:53] RECOVERY - Host cloudvirt1036 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:23:07] PROBLEM - ensure kvm processes are running on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:23:12] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10dcaro) [13:23:37] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1037.eqiad.wmnet}' [13:24:18] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) 05Open→03In progress a:03dcaro [13:27:07] RECOVERY - ensure kvm processes are running on cloudvirt1036 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:27:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1035 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:34:57] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) It's actually quite tricky, and upstream might take a bit to fix it:... [13:46:11] PROBLEM - Host cloudvirt1037 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:03] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1037.eqiad.wmnet}' [13:48:09] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1038.eqiad.wmnet}' [13:48:45] RECOVERY - Host cloudvirt1037 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [13:49:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1037 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:50:00] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [13:50:07] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [13:54:50] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1036 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:54:57] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) So, to clarify myself, current behavior is: * If no parameters are p... [13:55:42] 10Tool-extjsonuploader: extjsonuploader complains about "Duplicate extension name 'SomeExtension' detected in these files" - https://phabricator.wikimedia.org/T357095 (10Tgr) Yeah but that question is already decided somehow. Having a list of duplicates isn't useless - one of the versions is probably outdated a... [13:57:55] 10Grid-Engine-to-K8s-Migration: Migrate etwikt from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319731 (10komla) Archived tool [13:58:25] 10Grid-Engine-to-K8s-Migration: Migrate etwikt from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319731 (10komla) 05Open→03Resolved a:03komla [13:58:45] 10Grid-Engine-to-K8s-Migration: Migrate farticle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319740 (10komla) Archived [13:58:57] 10Grid-Engine-to-K8s-Migration: Migrate farticle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319740 (10komla) 05Open→03Resolved [13:59:50] 10Grid-Engine-to-K8s-Migration: Migrate fun from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319758 (10komla) Archived [14:00:02] 10Grid-Engine-to-K8s-Migration: Migrate fun from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319758 (10komla) 05Open→03Resolved [14:05:11] PROBLEM - Host cloudvirt1038 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:41] RECOVERY - Host cloudvirt1038 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:06:54] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1038.eqiad.wmnet}' [14:07:23] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1039.eqiad.wmnet}' [14:07:56] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:09:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1037 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:09:52] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) @LucasWerkmeister I have updated the docs here https://wikitech.wiki... [14:12:56] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:15:49] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10dcaro) When did you create the tasks? They seem quite new (a bit more than 4h) I can see that all are set as `@daily`, that picks a random time, and repeats every day... [14:17:12] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10dcaro) We could add that info to the jobs cli also: ` tools.smallem@tools-sgebastion-10:~$ toolforge jobs list -o long /usr/bin/toolforge-jobs:15: DeprecationWarning: p... [14:23:56] 10Grid-Engine-to-K8s-Migration: Migrate gutrs from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319782 (10komla) Archived [14:24:28] 10Grid-Engine-to-K8s-Migration: Migrate gutrs from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319782 (10komla) 05Open→03Resolved a:03komla [14:25:20] 10Grid-Engine-to-K8s-Migration: Migrate iplookup from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319816 (10komla) Archived [14:25:35] 10Grid-Engine-to-K8s-Migration: Migrate iplookup from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319816 (10komla) 05Open→03Resolved a:03komla [14:26:32] 10Grid-Engine-to-K8s-Migration: Migrate historyview from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319790 (10komla) 05Open→03Resolved a:03komla Archived [14:26:33] PROBLEM - Host cloudvirt1039 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:11] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) [14:28:22] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10dcaro) [14:28:25] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1039.eqiad.wmnet}' [14:28:34] RECOVERY - Host cloudvirt1039 is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [14:28:50] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1040.eqiad.wmnet}' [14:29:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review, 10User-aborrero: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 (10taavi) With the patch I just deployed the new-style DNS records are live. It seems to work great: `lang=shell-session,li... [14:29:25] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: Build service: Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016 (10dcaro) 05In progress→03Stalled [14:29:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1039 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:35:58] (PuppetCertificateAboutToExpire) resolved: Puppet CA certificate Puppet CA: paws-puppetmaster-01.paws.eqiad.wmflabs is about to expire in 27d 7h 32m 23s - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:39:38] 10Grid-Engine-to-K8s-Migration: Migrate khanomalumat from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319843 (10komla) 05Open→03Resolved a:03komla Archived [14:40:05] 10Grid-Engine-to-K8s-Migration, 10urbanecmbot, 10User-Urbanecm: Migrate urbanecmbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320108 (10dcaro) >>! In T320108#9539081, @Urbanecm wrote: > Thanks @dcaro! I was figuring out how to make this work in k8s instead, and it... [14:46:46] PROBLEM - Host cloudvirt1040 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:42] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1040.eqiad.wmnet}' [14:48:47] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [14:48:49] RECOVERY - Host cloudvirt1040 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:49:50] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1039 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:04:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1039 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:05:07] PROBLEM - Host cloudvirt1041 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:05] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [15:07:10] RECOVERY - Host cloudvirt1041 is UP: PING OK - Packet loss = 0%, RTA = 7.99 ms [15:07:40] PROBLEM - ensure kvm processes are running on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:06] (ProbeDown) firing: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:19:06] (ProbeDown) resolved: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:21:09] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [15:21:28] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [15:21:45] RECOVERY - ensure kvm processes are running on cloudvirt1041 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:25:26] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1042.eqiad.wmnet}' [15:29:32] 10Data-Services: Make user_email_authenticated status visible on labs - https://phabricator.wikimedia.org/T70876 (10Xaosflux) Any update on this? Being able to query the "emailable" status for an enduser is already publicly available (e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json... [15:30:42] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10ops-eqiad: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10aborrero) [15:31:44] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:31:52] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) 05Stalled→03Open In a 2024-02-14 network sync meeting we decided to continue moving older cloudvirts into the new single NI... [15:40:43] 10Grid-Engine-to-K8s-Migration: Migrate activity from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319481 (10komla) 05Open→03Resolved Archived [15:41:38] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:41:58] 10Grid-Engine-to-K8s-Migration: Migrate alexabotsi from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319549 (10komla) 05Open→03Resolved a:03komla Archived [15:49:05] PROBLEM - Host cloudvirt1042 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:57] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1042.eqiad.wmnet}' [15:50:57] RECOVERY - Host cloudvirt1042 is UP: PING OK - Packet loss = 0%, RTA = 6.63 ms [15:51:29] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' [16:02:26] 10Grid-Engine-to-K8s-Migration, 10urbanecmbot, 10User-Urbanecm: Migrate urbanecmbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320108 (10Urbanecm) >>! In T320108#9542469, @dcaro wrote: >>>! In T320108#9539081, @Urbanecm wrote: >> Thanks @dcaro! I was figuring out h... [16:06:45] 10cloud-services-team, 10Infrastructure-Foundations, 10netops, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543 (10aborrero) [16:07:29] 10cloud-services-team, 10Infrastructure-Foundations, 10netops, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543 (10aborrero) p:05Triage→03Medium [16:07:37] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' [16:08:35] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10Klein) >>! In T320048#9542281, @dcaro wrote: > When did you create the tasks? They seem quite new (a bit more than 4h) Fairly recently because I made a change to the co... [16:27:57] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' [16:30:00] PROBLEM - Host cloudvirt1043 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:18] RECOVERY - Host cloudvirt1043 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:32:36] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' [16:33:25] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10dcaro) >>! In T320048#9543097, @Klein wrote: >>>! In T320048#9542281, @dcaro wrote: >> When did you create the tasks? They seem quite new (a bit more than 4h) > Fairly... [16:33:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1043 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [16:53:50] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1043 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:06:31] (03CR) 10Merlijn van Deen: [C: 03+1] "Code looks fine." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) (owner: 10BryanDavis) [17:06:45] 10PAWS: Replace NFS with object storage? - https://phabricator.wikimedia.org/T342106 (10rook) For our usecase this likely won't work very well [17:06:55] 10PAWS: Replace NFS with object storage? - https://phabricator.wikimedia.org/T342106 (10rook) 05Open→03Declined [17:08:19] 10Wikibugs, 10Patch-For-Review, 10User-bd808: Frequent exception while trying to extract anchors from task - https://phabricator.wikimedia.org/T199007 (10valhallasw) > I also think this may have been broken for 6 years now without too much notice except when folks are staring at the bot's logs for some othe... [17:11:32] (03CR) 10Merlijn van Deen: [C: 03+2] ci: bump tested python version to 3.9 [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 (owner: 10BryanDavis) [17:12:10] !log fran@wmf3169 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirtlocal1001.eqiad.wmnet}' (T356975) [17:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:13:14] !log fran@wmf3169 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirtlocal1001.eqiad.wmnet}' (T356975) [17:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:25:45] vivian-rook opened https://github.com/toolforge/paws/pull/375 [17:26:15] 10Grid-Engine-to-K8s-Migration: Migrate addletterboxdfilmidbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357549 (10komla) [17:27:33] 10Grid-Engine-to-K8s-Migration: Migrate aka from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357550 (10komla) [17:28:33] 10Grid-Engine-to-K8s-Migration: Migrate arbclerkbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357551 (10komla) [17:30:51] 10Grid-Engine-to-K8s-Migration: Migrate backup-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357553 (10komla) [17:32:05] 10Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554 (10komla) [17:34:10] 10Grid-Engine-to-K8s-Migration: Migrate gergesbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357555 (10komla) [17:35:11] 10Grid-Engine-to-K8s-Migration: Migrate himowd from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357556 (10komla) [17:35:51] 10Grid-Engine-to-K8s-Migration: Migrate hnatsumi-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357558 (10komla) [17:36:33] 10Grid-Engine-to-K8s-Migration: Migrate pagecounts from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357559 (10komla) [17:37:15] 10Grid-Engine-to-K8s-Migration: Migrate recoin from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357560 (10komla) [17:39:23] 10Grid-Engine-to-K8s-Migration: Migrate spacemedia from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357561 (10komla) [17:40:04] 10Grid-Engine-to-K8s-Migration: Migrate unblock-zh-status from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357562 (10komla) [17:43:10] 10Grid-Engine-to-K8s-Migration: Migrate updatewikiprojectmovies from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357563 (10komla) [17:46:53] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10dcaro) Hi @MBH! I was able to start working on some solution (not awesome, but kinda works) in https://github.com/Saisengen/wikibots/pull/1 Tested locally only though, it... [17:48:57] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] Prototype and user testing plan - https://phabricator.wikimedia.org/T356099 (10KColeman-WMF) [17:49:28] 10Grid-Engine-to-K8s-Migration, 10urbanecmbot, 10User-Urbanecm: Migrate urbanecmbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320108 (10dcaro) > > That's a fairly accurate impression. The issue is that the URL is hardcoded in some places already, so the current U... [17:50:23] 10Grid-Engine-to-K8s-Migration: Migrate wikiflix from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357566 (10komla) [17:52:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:58:49] 10Grid-Engine-to-K8s-Migration: Migrate xiplus from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357567 (10komla) [17:59:30] 10Grid-Engine-to-K8s-Migration: Migrate zhwiki-perm-qualicheck from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357568 (10komla) [18:07:20] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirtlocal1002 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:12:20] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:27:20] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirtlocal1003 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:37:20] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirtlocal1002 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:41:37] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:54:20] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirtlocal1003 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:58:56] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:13:57] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:32:28] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_grid_node for tools-sgeweblight-10-17, tools-sgeweblight-10-30 [19:32:37] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:34:25] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:34:31] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:34:49] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:34:52] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:36:53] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:36:56] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:38:39] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:38:48] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:39:28] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-17 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:40:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:42:38] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:44:20] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:44:28] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-17 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:45:22] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:45:26] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:45:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:48:35] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:49:21] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:49:24] !log andrew@bullseye admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=97) [19:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:49:45] !log andrew@bullseye admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [19:49:47] !log andrew@bullseye admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [19:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:51:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirtXXXX.eqiad.wmnet}' [19:51:50] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=97) on hosts matched by 'D{cloudvirtXXXX.eqiad.wmnet}' [19:51:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:55:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:55:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' [19:56:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' [19:59:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' [19:59:53] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' [20:00:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1065.eqiad.wmnet}' [20:00:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1064.eqiad.wmnet}' [20:00:44] (InterfaceSpeedError) firing: brq7425e328-56 on cloudvirt1067:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:00:59] 10cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1067:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357579 (10phaultfinder) [20:03:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [20:03:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [20:05:44] (InterfaceSpeedError) resolved: brq7425e328-56 on cloudvirt1067:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:05:50] (NeutronAgentDown) firing: (3) Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:07:00] (03PS1) 10Amire80: Remove unnecessary space from the end of a message [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003518 (https://phabricator.wikimedia.org/T299863) [20:07:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro here is an update from dell I found a couple of online articles: What Does Uncorrectable Se... [20:07:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1065.eqiad.wmnet}' [20:08:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' [20:09:38] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) https://www.minitool.com/lib/uncorrectable-sector-count.html https://community.wd.com/t/how-to-interp... [20:10:50] (NeutronAgentDown) firing: (4) Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:11:42] (CloudVPSDesignateLeaks) firing: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:16:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1064.eqiad.wmnet}' [20:16:38] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:16:42] (CloudVPSDesignateLeaks) firing: (2) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:24:16] (TektonDown) firing: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [20:24:20] (ToolforgeKubernetesNodeNotReady) firing: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:28:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' [20:30:47] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:36:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [20:36:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}' [20:39:16] (TektonDown) resolved: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [20:40:47] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:41:23] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [20:42:20] 10Tool-bub2: Integrate Google Books and Trove to Wikisource - https://phabricator.wikimedia.org/T357582 (10Okerekechinweotito) [20:44:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [20:44:27] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [20:45:50] (NeutronAgentDown) firing: (3) Neutron neutron-linuxbridge-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:47:19] (TektonDown) firing: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [20:49:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [20:50:50] (NeutronAgentDown) resolved: (2) Neutron neutron-linuxbridge-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:51:38] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) It's back in service but only as of today. [20:51:42] (CloudVPSDesignateLeaks) firing: (2) Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:56:42] (CloudVPSDesignateLeaks) resolved: (2) Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:57:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}' [20:59:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1059.eqiad.wmnet}' [21:05:52] 10Tool-bub2: Integrate Google Books and Trove to Wikisource - https://phabricator.wikimedia.org/T357582 (10theprotonade) 05Open→03In progress [21:08:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [21:10:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1058.eqiad.wmnet}' [21:17:20] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1060 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:21:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1059.eqiad.wmnet}' [21:22:20] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1059 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:23:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance toolsbeta-test-k8s-haproxy-4 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [21:27:20] (NeutronAgentDown) firing: (3) Neutron neutron-linuxbridge-agent on cloudvirt1059 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:28:36] (03PS2) 10BryanDavis: ci: bump tested python version to 3.9 [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 [21:34:23] (ToolforgeKubernetesNodeNotReady) resolved: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [21:36:09] (03CR) 10BryanDavis: [V: 03+2] ci: bump tested python version to 3.9 [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 (owner: 10BryanDavis) [21:36:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1058.eqiad.wmnet}' [21:36:27] (03CR) 10Majavah: [C: 03+2] "*poke*" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 (owner: 10BryanDavis) [21:36:38] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:37:16] (03Merged) 10jenkins-bot: ci: bump tested python version to 3.9 [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003128 (owner: 10BryanDavis) [21:37:19] (TektonDown) resolved: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [21:37:20] (NeutronAgentDown) firing: (3) Neutron neutron-linuxbridge-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:38:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1056.eqiad.wmnet}' [21:38:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1057.eqiad.wmnet}' [21:39:26] (03PS3) 10BryanDavis: wikibugs: Extract XACT to page anchor mappings from data-javelin-init-data [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) [21:42:20] (NeutronAgentDown) firing: (3) Neutron neutron-linuxbridge-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:43:28] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance toolsbeta-test-k8s-haproxy-4 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [21:45:16] 10wikitech.wikimedia.org, 10Content-Transform-Team-WIP, 10DiscussionTools, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support), 10Patch-For-Review: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374 (10bd808) [21:53:47] (03CR) 10BryanDavis: "Test fixture updated in PS3." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) (owner: 10BryanDavis) [21:57:20] (NeutronAgentDown) resolved: (2) Neutron neutron-linuxbridge-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:59:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1056.eqiad.wmnet}' [22:00:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1057.eqiad.wmnet}' [22:01:35] (NeutronAgentDown) firing: (2) Neutron neutron-linuxbridge-agent on cloudvirt1057 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [22:21:35] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1057 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [22:48:05] (03PS1) 10Eugene233: Frontend handles revision id as string instead of int [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003038 (https://phabricator.wikimedia.org/T357592) [22:51:32] (03CR) 10Eugene233: [C: 03+2] Frontend handles revision id as string instead of int [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003038 (https://phabricator.wikimedia.org/T357592) (owner: 10Eugene233) [22:52:34] (03Merged) 10jenkins-bot: Frontend handles revision id as string instead of int [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003038 (https://phabricator.wikimedia.org/T357592) (owner: 10Eugene233) [22:59:28] (03CR) 10Eugene233: [C: 03+2] Remove unnecessary space from the end of a message [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003518 (https://phabricator.wikimedia.org/T299863) (owner: 10Amire80) [23:00:29] (03Merged) 10jenkins-bot: Remove unnecessary space from the end of a message [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1003518 (https://phabricator.wikimedia.org/T299863) (owner: 10Amire80) [23:24:59] 10Toolforge Jobs framework, 10User-aborrero: toolforge jobs current image aliases - https://phabricator.wikimedia.org/T357388 (10tstarling) >>! In T357388#9541226, @dcaro wrote: > Would that be acceptable for you? Well, it's not my product, and ultimately I don't get to choose the architectural direction whic... [23:25:26] 10Grid-Engine-to-K8s-Migration: Migrate adminstats from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319484 (10komla) This tool has been disabled from running on the Grid. If you are the maintainer and you want this re-enabled so that you can work on migrating it off the gr... [23:25:28] 10Grid-Engine-to-K8s-Migration: Migrate aivhelperbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319490 (10komla) This tool has been disabled from running on the Grid. If you are the maintainer and you want this re-enabled so that you can work on migrating it off the...