[00:07:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [00:08:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:08:49] FIRING: NeutronAgentDown: Neutron neutron-metadata-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [00:10:22] FIRING: [7x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:15:22] RESOLVED: [7x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:19:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [00:27:11] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:54:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [00:54:22] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:59:10] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [03:20:40] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [03:33:17] FIRING: JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [03:34:22] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:38:17] FIRING: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:22] FIRING: [19x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:44:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:36:47] RESOLVED: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:40] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [05:41:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [05:45:40] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [05:46:52] FIRING: [29x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:48:28] FIRING: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:51:07] RESOLVED: [21x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:52:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [05:53:28] RESOLVED: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:54:22] FIRING: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [05:59:22] RESOLVED: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [06:08:41] FIRING: CloudVPSDesignateLeaks: Detected 13 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 13 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:50:27] 06cloud-services-team, 10Horizon: Page on cloudweb/horizon down - https://phabricator.wikimedia.org/T411470#11427029 (10fgiunchedi) I dug into this a little, currently: * the service::catalog entry for `labweb-ssl` is `page: false` because that would page SRE, not WMCS. Proper fix is resolving (by yours truly... [09:50:29] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 10VideoCutTool: [alerting] Create alerts for cloud-vps/VideoCutTool app - https://phabricator.wikimedia.org/T409668#11427249 (10fnegri) 05Resolved→03In progress [09:51:36] 06cloud-services-team, 10Cloud-VPS: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11427255 (10taavi) [09:52:11] 10Cloud-VPS, 06tools-infrastructure-team: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590 (10taavi) 03NEW [09:58:05] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 10VideoCutTool: [alerting] Create alerts for cloud-vps/VideoCutTool app - https://phabricator.wikimedia.org/T409668#11427290 (10fnegri) 05In progress→03Resolved > sure @fnegri can you please update the runbook url with this Done! [10:10:42] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'meta' [10:15:26] 10Cloud-VPS, 06tools-infrastructure-team: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590#11427357 (10taavi) a:03taavi [10:16:26] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'meta' [10:35:57] 06cloud-services-team, 06Wikimedia Enterprise, 10Wikimedia Enterprise Volunteer Request: Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427414 (10RThomas-WMF) Fixed {F70833355} [10:36:38] 06cloud-services-team, 06Wikimedia Enterprise, 10Wikimedia Enterprise Volunteer Request: Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427416 (10RThomas-WMF) 05Open→03In progress p:05Triage→03Medium a:03RThomas-WMF [10:39:04] 06cloud-services-team, 10Wikimedia Enterprise Volunteer Request, 06Wikimedia Enterprise (WME Kanban): Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427427 (10RThomas-WMF) [10:46:28] (03open) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) [10:46:32] (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) [10:47:38] (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) [10:48:17] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 [10:48:30] (03PS1) 10Majavah: vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 [10:48:43] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 [10:48:59] (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) [10:52:23] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge: Move all Toolforge alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505#11427507 (10fnegri) 05Open→03In progress [10:53:16] 06cloud-services-team, 10Horizon: Page on cloudweb/horizon down - https://phabricator.wikimedia.org/T411470#11427511 (10taavi) [10:54:11] (03approved) 10filippo: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) (owner: 10taavi) [10:54:58] (03merge) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) [10:55:02] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [10:55:33] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [11:19:55] 10Cloud-VPS, 06tools-infrastructure-team, 13Patch-For-Review: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590#11427641 (10taavi) 05Open→03Resolved [11:58:01] 10Toolforge, 06tools-infrastructure-team: Publish machine-readable information for Toolforge worker IPs - https://phabricator.wikimedia.org/T411610 (10taavi) 03NEW [12:21:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:21:59] (03CR) 10FNegri: [C:03+1] "Thanks for spotting this!" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah) [12:22:49] (03CR) 10Majavah: [C:03+2] vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah) [12:26:10] (03Merged) 10jenkins-bot: vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah) [12:31:29] 10Toolforge, 06tools-infrastructure-team: Publish machine-readable information for Toolforge worker IPs - https://phabricator.wikimedia.org/T411610#11428003 (10taavi) 05Open→03Resolved [12:45:27] 06cloud-services-team, 13Patch-For-Review: Audit and standardize on UTC timezone for grafana.wmcloud.org dashboards - https://phabricator.wikimedia.org/T411274#11428076 (10taavi) 05Open→03Resolved a:03taavi After merging the default settigns patch above I went through all grafana.wmcloud.org dashboar... [13:36:51] (03update) 10miiswom: Proof of concept: Add author form [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/131 [13:37:59] 06cloud-services-team, 10Cloud-VPS: Octavia network public access inconsistency - https://phabricator.wikimedia.org/T411509#11428237 (10taavi) p:05Triage→03Medium [14:31:48] RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:50:17] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (T375217) [14:50:20] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (exit_code=0) (T375217) [14:50:22] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [14:51:33] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (T375217) [14:51:38] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (exit_code=0) (T375217) [14:52:42] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217) [14:58:50] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=99) [17:10:34] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217) [17:10:39] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [17:17:10] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0) [17:18:04] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217) [17:18:08] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [17:23:51] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0) [17:24:42] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217) [17:26:38] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [17:28:55] (03open) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [17:28:58] (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [17:29:24] (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [17:30:11] (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [17:31:42] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=99) [17:32:15] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217) [17:32:59] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [17:39:13] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 484 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [17:39:27] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0) [18:20:30] (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [18:20:31] (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505) [18:20:32] (03open) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505) [18:20:38] (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505) [18:20:39] (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505) [18:35:35] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001'] [18:36:24] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001'] [18:42:51] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217) [18:42:56] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [18:43:10] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [18:46:20] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217) [18:52:36] (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505) [19:01:48] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [19:13:35] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.add_node_to_cluster [19:24:25] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.add_node_to_cluster (exit_code=0) [19:31:44] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217) [19:31:49] T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217 [19:49:14] RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.387 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [19:49:36] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [19:52:17] FIRING: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:57:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:24:17] FIRING: JobUnavailable: Reduced availability for job pdns_rec in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:57:00] 06cloud-services-team, 10Toolforge (Quota-requests): Elasticsearch credential request for gutensearch - https://phabricator.wikimedia.org/T411445#11430302 (10Ijon) Thank you, @taavi -- and by "my credentials" do you mean the same credentials from replica.my.cnf? Or were other credentials sent to me? [22:31:30] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728 (10SomeRandomDeveloper) 03NEW [22:32:10] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431025 (10SomeRandomDeveloper) [22:40:47] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:07:16] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431117 (10Ladsgroup) Well, I can't even ssh into the host to check what's going on 😢 [23:08:36] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431120 (10Dzahn) please leave it for this moment. This is good timing because I wanted to try and extend the disk anyways and basically announce downtime.. then it was already down. [23:13:23] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431127 (10Ladsgroup) ah okay, I leave it now. FWIW it's inode: ` ladsgroup@codesearch9:~$ df -i | grep -i srv /dev/sdb 5242880 5242879 1 100% /srv ` [23:20:16] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431132 (10Dzahn) Yes, this is still T411047 and follow-up after we got more quota. (linked from there) shutting instance down to attempt resizing volume .. in progress. [23:20:59] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431139 (10Dzahn) [23:21:00] 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431138 (10Dzahn) [23:28:43] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431154 (10Dzahn) 05Open→03Resolved a:03Dzahn successfully resized /dev/sda to double its size (80 -> 160GB) in Horizon (possible after we got the project quota) remounted volume and ran `re... [23:29:55] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431161 (10Dzahn) - shutdown -h now - click "resize volume" in web UI - start instance - volume gets mounted automatically - resize2fs /dev/sda - mount -o remount /dev/sda [23:30:42] 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431162 (10Dzahn) - successfully resized /dev/sda to double its size (80 -> 160GB) in Horizon (possible after we got the project quota) - remounted volume and... [23:31:06] 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431165 (10Dzahn) 05Open→03Resolved a:03Dzahn ` df -i /srv/ Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda 10485760 5253986...