[00:11:32] RESOLVED: ToolsNfsAlmostFull: Toolforge NFS is 0.8556459001503274/1 full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [00:34:57] RESOLVED: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [00:43:08] (03open) 10raymond-ndibe: d/changelog: bump to 16.1.6 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/82 (https://phabricator.wikimedia.org/T362621) [00:46:14] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [00:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:53:52] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [00:54:02] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [01:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:01:32] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [01:01:58] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [01:10:35] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [01:11:39] (03approved) 10raymond-ndibe: d/changelog: bump to 16.1.6 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/82 (https://phabricator.wikimedia.org/T362621) [01:11:43] (03merge) 10raymond-ndibe: d/changelog: bump to 16.1.6 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/82 (https://phabricator.wikimedia.org/T362621) [01:48:52] (03approved) 10raymond-ndibe: cli: Drop support for --canonical [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/70 (https://phabricator.wikimedia.org/T384788) (owner: 10taavi) [01:48:58] (03merge) 10raymond-ndibe: cli: Drop support for --canonical [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/70 (https://phabricator.wikimedia.org/T384788) (owner: 10taavi) [01:50:13] (03update) 10raymond-ndibe: shell: wrap the shell in a launcher for buildservices [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/61 (https://phabricator.wikimedia.org/T360488) (owner: 10dcaro) [05:35:16] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10502876 (10Arinaigu) > >>>! In T376267#10499563, @Arinaigu wrote: >> |**Wikitech account/LDAP:**| arinaigum| >> |**SUL account**| AIgumenshcheva-WMF| >> |**Account linked on [[ https://idm.... [05:39:29] 10Tool-mwcli: Plan command nomenclature - https://phabricator.wikimedia.org/T384898#10502877 (10Samwilson) [07:00:32] 10Tool-events-impact-report: Toolforge EIR quarter plan: Jan-Mar 2025 - https://phabricator.wikimedia.org/T384862#10502964 (10Arinaigu) [07:20:43] 06cloud-services-team, 10Toolforge, 06Community-Tech, 10WS Export: Add 'Content-Length' in ws-export HTTP Response - https://phabricator.wikimedia.org/T384803#10502988 (10Xover) >>! In T384803#10495669, @Samwilson wrote: > I've not looked into this deeply yet, but it appears it may be related to the Toolfo... [08:24:59] (03open) 10samwilson: Update PHP dependencies [toolforge-repos/mwcli] - 10https://gitlab.wikimedia.org/toolforge-repos/mwcli/-/merge_requests/3 [09:37:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-75 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:02:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-75 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:06:12] 10Tool-schedule-deployment, 06serviceops: Extend functionality to support MediaWiki infrastructure Windows and related repos - https://phabricator.wikimedia.org/T385007 (10jijiki) 03NEW [10:06:25] 10Tool-schedule-deployment, 06serviceops: Extend functionality to support MediaWiki infrastructure Windows and related repos - https://phabricator.wikimedia.org/T385007#10503290 (10jijiki) [10:08:00] 10Tool-schedule-deployment, 06serviceops: Extend functionality to support MediaWiki infrastructure Windows and related repos - https://phabricator.wikimedia.org/T385007#10503313 (10jijiki) [10:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:53:31] (03update) 10raymond-ndibe: shell: wrap the shell in a launcher for buildservices [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/61 (https://phabricator.wikimedia.org/T360488) (owner: 10dcaro) [10:55:57] (03update) 10raymond-ndibe: shell: wrap the shell in a launcher for buildservices [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/61 (https://phabricator.wikimedia.org/T360488) (owner: 10dcaro) [10:56:45] (03update) 10raymond-ndibe: shell: wrap the shell in a launcher for buildservices [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/61 (https://phabricator.wikimedia.org/T360488) (owner: 10dcaro) [11:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:58:01] (03open) 10raymond-ndibe: [components-cli] no-op commit just to create a dummy release [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/14 [12:12:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [12:18:35] (03update) 10raymond-ndibe: [jobs-api] remove wait_for_job from runtime methods [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/138 (https://phabricator.wikimedia.org/T359804) [12:19:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:19:33] (03update) 10raymond-ndibe: [jobs-api] remove wait_for_job from runtime methods [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/138 (https://phabricator.wikimedia.org/T359804) [12:24:11] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10503805 (10Andrew) ` Hi, Our engineers confirmed that this should be resolved. Can you let me know how it is looking on your end? Warm regar... [12:24:49] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10503806 (10Andrew) @AntiCompositeNumber shall I reply to Tara that things are resolved on our end? Or do you have remaining concerns? [12:33:15] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10503832 (10aborrero) >>! In T382356#10501284, @aborrero wrote: > reminder: verify VLAN trunk on the NIC of the cloudgw servers.... [12:48:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:49:28] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [12:49:34] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [12:49:42] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [13:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:58:54] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q3): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10504240 (10cmooney) >>! In T372457#10501597, @cmooney wrote: > A downside of this is we don't get the built-in "target" tag anymore... [14:18:57] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] launch-instance-model: support default network ID in the network panel [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110465 (https://phabricator.wikimedia.org/T380081) (owner: 10Andrew Bogott) [14:19:23] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Re-enable the network panel for instance creation [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110464 (https://phabricator.wikimedia.org/T380081) (owner: 10Andrew Bogott) [14:27:32] 06cloud-services-team, 10Data-Services (Quota-requests): User has exceeded the 'max_user_connections' (10) on Toolforge DB replicas - https://phabricator.wikimedia.org/T384119#10504438 (10fnegri) 05Open→03Declined I'm closing as "Declined", I think the current value `max_user_connections=10` is big eno... [14:29:18] 06cloud-services-team: KernelErrors Server cloudcephosd1021 logged kernel errors - https://phabricator.wikimedia.org/T384971#10504461 (10fnegri) 05Open→03Resolved a:03fnegri The server is being reimaged and throwing some expected errors: ` fnegri@cloudcephosd1021:~$ sudo journalctl -k -perr --boot all... [14:33:25] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504480 (10Papaul) @Andrew anything dc-ops need to do on this task? [14:40:42] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T384968#10504618 (10fnegri) 05Open→03Resolved a:03fnegri cloudcephosd1022 and cloudcephosd1023 were rebooted in {T348643} and logged some kernel errors. The `FW version command failed` is a new one I have not seen before. Judging... [14:40:55] 06cloud-services-team: KernelErrors Server cloudcephosd1022 logged kernel errors - https://phabricator.wikimedia.org/T384953#10504639 (10fnegri) →14Duplicate dup:03T384968 [14:40:59] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T384968#10504641 (10fnegri) [14:41:19] 06cloud-services-team: NodeDown Node cloudcephosd1023 has been down for long. - https://phabricator.wikimedia.org/T384955#10504647 (10fnegri) 05Open→03Resolved a:03fnegri Server was temporarily shut down in {T348643}. It's now back up. [14:44:52] 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10504695 (10RoyZuo) Now I wanna sign up. I wanna stick to my username. So if I follow the workaround, create an account "RoyZuo" at https://idm.wikimedi... [14:45:33] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504734 (10Andrew) >>! In T380805#10504480, @Papaul wrote: > @Andrew anything dc-ops need to do on this task? Not immediately! Valerie has already moved and set up two of them, we... [15:14:38] (03merge) 10samwilson: Update PHP dependencies [toolforge-repos/mwcli] - 10https://gitlab.wikimedia.org/toolforge-repos/mwcli/-/merge_requests/3 [15:16:44] 10PAWS: Missing notebooks for an account - https://phabricator.wikimedia.org/T385048 (10Isaac) 03NEW [15:40:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10505041 (10fnegri) a:05fnegri→03None [15:41:36] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10505042 (10fnegri) a:03aborrero [15:44:34] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: openstack: network problems when introducing new networks - https://phabricator.wikimedia.org/T380728#10505048 (10aborrero) today @cmooney reported this was maybe caused by some inconsistency on the edge routing configuration for cloudsw devices. [16:16:05] 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10505206 (10bd808) >>! In T380384#10504695, @RoyZuo wrote: > Now I wanna sign up. > > I wanna stick to my username. So if I follow the workaround, creat... [17:30:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol1005.eqiad.wmnet}' (T384946) [17:33:04] PROBLEM - Host cloudcontrol1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:33:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:35:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol1005.eqiad.wmnet}' (T384946) [17:35:32] RECOVERY - Host cloudcontrol1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:35:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol1006.eqiad.wmnet}' (T384946) [17:36:56] FIRING: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:37:07] FIRING: [2x] KernelErrors: Server cloudcontrol1005 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [17:37:16] 06cloud-services-team: KernelErrors Server cloudcontrol1005 logged kernel errors - https://phabricator.wikimedia.org/T385074 (10phaultfinder) 03NEW [17:37:54] PROBLEM - Host cloudcontrol1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:38:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:39:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol2005-dev.codfw.wmnet}' (T384946) [17:40:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol1006.eqiad.wmnet}' (T384946) [17:41:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol1007.eqiad.wmnet}' (T384946) [17:41:22] RECOVERY - Host cloudcontrol1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [17:42:07] FIRING: [2x] KernelErrors: Server cloudcontrol1006 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [17:42:12] 06cloud-services-team: KernelErrors Server cloudcontrol1006 logged kernel errors - https://phabricator.wikimedia.org/T385075 (10phaultfinder) 03NEW [17:43:50] PROBLEM - Host cloudcontrol1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:43] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol2005-dev.codfw.wmnet}' (T384946) [17:45:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol2004-dev.codfw.wmnet}' (T384946) [17:51:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol1007.eqiad.wmnet}' (T384946) [17:51:20] RECOVERY - Host cloudcontrol1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:51:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol2004-dev.codfw.wmnet}' (T384946) [17:51:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol2006-dev.codfw.wmnet}' (T384946) [17:54:31] FIRING: [2x] KernelErrors: Server cloudcontrol1007 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [17:54:36] 06cloud-services-team: KernelErrors Server cloudcontrol1007 logged kernel errors - https://phabricator.wikimedia.org/T385079 (10phaultfinder) 03NEW [17:55:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol2006-dev.codfw.wmnet}' (T384946) [17:58:08] (03open) 10salelya: init [toolforge-repos/multilingual-missing-articles] - 10https://gitlab.wikimedia.org/toolforge-repos/multilingual-missing-articles/-/merge_requests/1 [17:58:22] RESOLVED: [13x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:04:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol2009-dev.codfw.wmnet}' (T384946) [18:04:40] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:07:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol2009-dev.codfw.wmnet}' (T384946) [18:14:58] PROBLEM - Host cloudrabbit1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:16:20] RECOVERY - Host cloudrabbit1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:18:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082 (10fnegri) 03NEW [18:20:06] (03update) 10salelya: init [toolforge-repos/multilingual-missing-articles] - 10https://gitlab.wikimedia.org/toolforge-repos/multilingual-missing-articles/-/merge_requests/1 [18:20:08] PROBLEM - Host cloudrabbit1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505800 (10fnegri) [18:21:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505802 (10fnegri) [18:21:44] RECOVERY - Host cloudrabbit1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:21:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505803 (10fnegri) [18:22:07] FIRING: [2x] KernelErrors: Server cloudrabbit1001 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [18:22:12] 06cloud-services-team: KernelErrors Server cloudrabbit1001 logged kernel errors - https://phabricator.wikimedia.org/T385083 (10phaultfinder) 03NEW [18:23:36] PROBLEM - Host cloudrabbit1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' (T384946) [18:24:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' (T384946) [18:25:02] RECOVERY - Host cloudrabbit1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [18:27:07] FIRING: [2x] KernelErrors: Server cloudrabbit1002 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudrabbit1002 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [18:27:16] 06cloud-services-team: KernelErrors Server cloudrabbit1002 logged kernel errors - https://phabricator.wikimedia.org/T385085 (10phaultfinder) 03NEW [18:27:53] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505854 (10fnegri) [18:28:05] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505855 (10fnegri) [18:29:06] PROBLEM - Host cloudvirtlocal1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:34] RECOVERY - Host cloudvirtlocal1003 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [18:32:07] FIRING: [2x] KernelErrors: Server cloudvirtlocal1003 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirtlocal1003 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [18:32:14] 06cloud-services-team: KernelErrors Server cloudvirtlocal1003 logged kernel errors - https://phabricator.wikimedia.org/T385087 (10phaultfinder) 03NEW [18:35:18] PROBLEM - Host cloudvirtlocal1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:38] RECOVERY - Host cloudvirtlocal1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:36:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505896 (10fnegri) Restarting all kind containers, haproxy is still having issues connecting to the k8s control pla... [18:39:31] FIRING: [2x] KernelErrors: Server cloudvirtlocal1002 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirtlocal1002 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [18:39:41] 06cloud-services-team: KernelErrors Server cloudvirtlocal1002 logged kernel errors - https://phabricator.wikimedia.org/T385088 (10phaultfinder) 03NEW [18:41:32] PROBLEM - Host cloudvirtlocal1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:56] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505919 (10fnegri) a:03fnegri [18:42:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082#10505922 (10fnegri) 05Open→03In progress p:05Triage→03High [18:42:55] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10505926 (10fnegri) 05Open→03In progress [18:42:56] 06cloud-services-team, 10Toolforge: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#10505928 (10fnegri) [18:43:10] RECOVERY - Host cloudvirtlocal1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:44:31] FIRING: [2x] KernelErrors: Server cloudvirtlocal1001 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudvirtlocal1001 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [18:44:36] 06cloud-services-team: KernelErrors Server cloudvirtlocal1001 logged kernel errors - https://phabricator.wikimedia.org/T385091 (10phaultfinder) 03NEW [18:45:26] RESOLVED: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:56:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [19:00:23] (03update) 10raymond-ndibe: [toolforge-weld] add custom resources version to k8sclient [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/51 (https://phabricator.wikimedia.org/T359650) [19:03:48] PROBLEM - Host cloudservices1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:22] RECOVERY - Host cloudservices1006 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [19:06:30] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:06:34] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:07:20] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.030 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:07:24] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.031 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:08:42] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:31] FIRING: [2x] KernelErrors: Server cloudservices1006 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:09:42] 06cloud-services-team: KernelErrors Server cloudservices1006 logged kernel errors - https://phabricator.wikimedia.org/T385094 (10phaultfinder) 03NEW [19:10:10] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:11:14] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [19:11:26] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:11:26] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:11:28] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:12:07] FIRING: [2x] KernelErrors: Server cloudservices1005 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:12:11] 06cloud-services-team: KernelErrors Server cloudservices1005 logged kernel errors - https://phabricator.wikimedia.org/T385095 (10phaultfinder) 03NEW [19:12:14] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is OK: OK: UP (pid=3327) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [19:12:16] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.072 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:12:16] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.063 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:12:18] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.035 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:28:52] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q3): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10506098 (10cmooney) >>! In T372457#10467497, @dcaro wrote: > @cmooney Just noticed that all the `drop` and `discards` metrics there... [19:32:12] FIRING: KernelErrors: Server cloudbackup1002-dev logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:33:10] PROBLEM - Host cloudbackup1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:14] PROBLEM - Host cloudbackup1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:56] PROBLEM - Host cloudbackup2003 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:57] PROBLEM - Host cloudbackup2004 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:44] RECOVERY - Host cloudbackup1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:36:26] RECOVERY - Host cloudbackup1004 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [19:37:07] FIRING: [4x] KernelErrors: Server cloudbackup1003 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:37:16] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T385097 (10phaultfinder) 03NEW [19:37:26] RECOVERY - Host cloudbackup2003 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [19:38:26] RECOVERY - Host cloudbackup2004 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [19:38:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:42:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [19:42:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:43:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:43:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [19:44:37] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [19:49:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' (T384946) [19:49:32] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' (T384946) [19:51:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:54:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:54:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' (T384946) [19:56:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:57:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [20:03:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:04:13] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:05:16] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [20:07:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [20:10:20] PROBLEM - Host cloudcephosd1023 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:20] RECOVERY - Host cloudcephosd1023 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:14:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:14:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:15:43] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T385100 (10Danners430) 03NEW [20:24:12] (03open) 10raymond-ndibe: [jobs-api] use pydantic for core job model [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T359804) [20:29:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [20:30:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:32:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [20:32:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:37:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:38:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:39:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:42:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:43:40] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [20:46:38] 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10506288 (10RoyZuo) Thank you very much for the explanations! I will try it out after this weekend. Unfortunately ran out of time for now. [20:51:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2005-dev.codfw.wmnet}' [20:54:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2005-dev.codfw.wmnet}' [20:59:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-45 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:09:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-45 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:20:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:27:07] FIRING: [2x] KernelErrors: Server cloudbackup1002-dev logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [21:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:32:07] RESOLVED: KernelErrors: Server cloudcephosd1021 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1021 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [22:13:54] 10PAWS: Missing notebooks for an account - https://phabricator.wikimedia.org/T385048#10506635 (10Isaac) just adding that the user in question confirmed that she does not remember deleting anything herself [22:42:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [22:42:06] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T385123 (10phaultfinder) 03NEW [22:46:43] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Clean up Toolforge tools.lingua-libre ? - https://phabricator.wikimedia.org/T385124 (10Yug) 03NEW [22:47:38] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Clean up Toolforge directory tools.lingua-libre ? - https://phabricator.wikimedia.org/T385124#10506775 (10Yug) [22:48:32] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Clean up Toolforge directory tools.lingua-libre ? - https://phabricator.wikimedia.org/T385124#10506776 (10Yug) [22:50:47] 06cloud-services-team, 10Toolforge, 10Lingua-Libre: Clean up Toolforge directory tools.lingua-libre ? - https://phabricator.wikimedia.org/T385124#10506783 (10bd808) The `$HOME/replica.my.cnf` file is infrastructure maintained by Toolforge. If it is removed a monitoring job will notice and recreate it. Is the... [23:45:59] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [23:49:30] (03update) 10raymond-ndibe: [jobs-api] use pydantic for core job model [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T359804) [23:49:56] FIRING: SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:53:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [23:54:56] RESOLVED: SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:59:15] 10Cloud-Services, 10Lingua-Libre: Migrate from WMFR-OVH server to WMF Toolforge or WMF Cloud VPS ? - https://phabricator.wikimedia.org/T385064#10506910 (10Yug) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and...