[00:07:54] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services
[00:08:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[00:08:49] <jinxer-wm>	 FIRING: NeutronAgentDown: Neutron neutron-metadata-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown
[00:10:22] <jinxer-wm>	 FIRING: [7x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[00:15:22] <jinxer-wm>	 RESOLVED: [7x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[00:19:00] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services
[00:27:11] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[00:54:10] <jinxer-wm>	 FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[00:54:22] <jinxer-wm>	 FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[00:59:10] <jinxer-wm>	 FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[03:20:40] <jinxer-wm>	 RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[03:33:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:34:10] <jinxer-wm>	 FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[03:34:22] <jinxer-wm>	 FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[03:38:17] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:39:22] <jinxer-wm>	 FIRING: [19x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[03:44:22] <jinxer-wm>	 RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[04:36:47] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:54:40] <jinxer-wm>	 FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[05:41:50] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services
[05:45:40] <jinxer-wm>	 RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[05:46:52] <jinxer-wm>	 FIRING: [29x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[05:48:28] <wmcs-alerts>	 FIRING: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:51:07] <jinxer-wm>	 RESOLVED: [21x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[05:52:13] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services
[05:53:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:54:22] <jinxer-wm>	 FIRING: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[05:59:22] <jinxer-wm>	 RESOLVED: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[06:08:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 13 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[06:18:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 13 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[08:50:27] <wikibugs>	 06cloud-services-team, 10Horizon: Page on cloudweb/horizon down - https://phabricator.wikimedia.org/T411470#11427029 (10fgiunchedi) I dug into this a little, currently:  * the service::catalog entry for `labweb-ssl` is `page: false` because that would page SRE, not WMCS. Proper fix is resolving (by yours truly...
[09:50:29] <wikibugs>	 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 10VideoCutTool: [alerting] Create alerts for cloud-vps/VideoCutTool app - https://phabricator.wikimedia.org/T409668#11427249 (10fnegri) 05Resolved→03In progress
[09:51:36] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11427255 (10taavi)
[09:52:11] <wikibugs>	 10Cloud-VPS, 06tools-infrastructure-team: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590 (10taavi) 03NEW
[09:58:05] <wikibugs>	 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 10VideoCutTool: [alerting] Create alerts for cloud-vps/VideoCutTool app - https://phabricator.wikimedia.org/T409668#11427290 (10fnegri) 05In progress→03Resolved > sure @fnegri can you please update the runbook url with this  Done!
[10:10:42] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'meta'
[10:15:26] <wikibugs>	 10Cloud-VPS, 06tools-infrastructure-team: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590#11427357 (10taavi) a:03taavi
[10:16:26] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'meta'
[10:35:57] <wikibugs>	 06cloud-services-team, 06Wikimedia Enterprise, 10Wikimedia Enterprise Volunteer Request: Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427414 (10RThomas-WMF) Fixed   {F70833355}
[10:36:38] <wikibugs>	 06cloud-services-team, 06Wikimedia Enterprise, 10Wikimedia Enterprise Volunteer Request: Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427416 (10RThomas-WMF) 05Open→03In progress p:05Triage→03Medium a:03RThomas-WMF
[10:39:04] <wikibugs>	 06cloud-services-team, 10Wikimedia Enterprise Volunteer Request, 06Wikimedia Enterprise (WME Kanban): Toolforge no longer has IP-based access to Wikimedia Enterprise - https://phabricator.wikimedia.org/T410994#11427427 (10RThomas-WMF)
[10:46:28] <wikibugs>	 (03open) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590)
[10:46:32] <wikibugs>	 (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590)
[10:47:38] <wikibugs>	 (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590)
[10:48:17] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285
[10:48:30] <wikibugs>	 (03PS1) 10Majavah: vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477
[10:48:43] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285
[10:48:59] <wikibugs>	 (03update) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590)
[10:52:23] <wikibugs>	 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge: Move all Toolforge alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505#11427507 (10fnegri) 05Open→03In progress
[10:53:16] <wikibugs>	 06cloud-services-team, 10Horizon: Page on cloudweb/horizon down - https://phabricator.wikimedia.org/T411470#11427511 (10taavi)
[10:54:11] <wikibugs>	 (03approved) 10filippo: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590) (owner: 10taavi)
[10:54:58] <wikibugs>	 (03merge) 10taavi: cloudinfra: New security group for metadata web hosts [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/285 (https://phabricator.wikimedia.org/T411590)
[10:55:02] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch
[10:55:33] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch
[11:19:55] <wikibugs>	 10Cloud-VPS, 06tools-infrastructure-team, 13Patch-For-Review: Publish machine-readable version of Cloud VPS IP space - https://phabricator.wikimedia.org/T411590#11427641 (10taavi) 05Open→03Resolved
[11:58:01] <wikibugs>	 10Toolforge, 06tools-infrastructure-team: Publish machine-readable information for Toolforge worker IPs - https://phabricator.wikimedia.org/T411610 (10taavi) 03NEW
[12:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:21:59] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Thanks for spotting this!" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah)
[12:22:49] <wikibugs>	 (03CR) 10Majavah: [C:03+2] vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah)
[12:26:10] <wikibugs>	 (03Merged) 10jenkins-bot: vps: Properly separate commit message header from body [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1214477 (owner: 10Majavah)
[12:31:29] <wikibugs>	 10Toolforge, 06tools-infrastructure-team: Publish machine-readable information for Toolforge worker IPs - https://phabricator.wikimedia.org/T411610#11428003 (10taavi) 05Open→03Resolved
[12:45:27] <wikibugs>	 06cloud-services-team, 13Patch-For-Review: Audit and standardize on UTC timezone for grafana.wmcloud.org dashboards - https://phabricator.wikimedia.org/T411274#11428076 (10taavi) 05Open→03Resolved a:03taavi After merging the default settigns patch above I went through all grafana.wmcloud.org dashboar...
[13:36:51] <wikibugs>	 (03update) 10miiswom: Proof of concept: Add author form [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/131
[13:37:59] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Octavia network public access inconsistency - https://phabricator.wikimedia.org/T411509#11428237 (10taavi) p:05Triage→03Medium
[14:31:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:50:17] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (T375217)
[14:50:20] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (exit_code=0) (T375217)
[14:50:22] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[14:51:33] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (T375217)
[14:51:38] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.remove_node_from_hiera (exit_code=0) (T375217)
[14:52:42] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217)
[14:58:50] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=99)
[17:10:34] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217)
[17:10:39] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[17:17:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0)
[17:18:04] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217)
[17:18:08] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[17:23:51] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0)
[17:24:42] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217)
[17:26:38] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[17:28:55] <wikibugs>	 (03open) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[17:28:58] <wikibugs>	 (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[17:29:24] <wikibugs>	 (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[17:30:11] <wikibugs>	 (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[17:31:42] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=99)
[17:32:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (T375217)
[17:32:59] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[17:39:13] <icinga-wm>	 PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 484 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[17:39:27] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.depool_and_remove_node (exit_code=0)
[18:20:30] <wikibugs>	 (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[18:20:31] <wikibugs>	 (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505)
[18:20:32] <wikibugs>	 (03open) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505)
[18:20:38] <wikibugs>	 (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505)
[18:20:39] <wikibugs>	 (03update) 10fnegri: Import existing NFS and ToolsDB alert rules [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/51 (https://phabricator.wikimedia.org/T410505)
[18:35:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001']
[18:36:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001']
[18:42:51] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217)
[18:42:56] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[18:43:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99)
[18:46:20] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217)
[18:52:36] <wikibugs>	 (03update) 10fnegri: Clean up and adapt imported alerts [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52 (https://phabricator.wikimedia.org/T410505)
[19:01:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0)
[19:13:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.etcd.add_node_to_cluster
[19:24:25] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.etcd.add_node_to_cluster (exit_code=0)
[19:31:44] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T375217)
[19:31:49] <stashbot>	 T375217: Complete upgrading WMCS bare metal hosts to Trixie - https://phabricator.wikimedia.org/T375217
[19:49:14] <icinga-wm>	 RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.387 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[19:49:36] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0)
[19:52:17] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:57:17] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:24:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdns_rec in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:57:00] <wikibugs>	 06cloud-services-team, 10Toolforge (Quota-requests): Elasticsearch credential request for gutensearch - https://phabricator.wikimedia.org/T411445#11430302 (10Ijon) Thank you, @taavi -- and by "my credentials" do you mean the same credentials from replica.my.cnf? Or were other credentials sent to me?
[22:31:30] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728 (10SomeRandomDeveloper) 03NEW
[22:32:10] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431025 (10SomeRandomDeveloper)
[22:40:47] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:07:16] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431117 (10Ladsgroup) Well, I can't even ssh into the host to check what's going on 😢
[23:08:36] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431120 (10Dzahn) please leave it for this moment.  This is good timing because I wanted to try and extend the disk anyways and basically announce downtime.. then it was already down.
[23:13:23] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431127 (10Ladsgroup) ah okay, I leave it now. FWIW it's inode: ` ladsgroup@codesearch9:~$ df -i | grep -i srv /dev/sdb       5242880 5242879       1  100% /srv `
[23:20:16] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431132 (10Dzahn) Yes, this is still T411047 and follow-up after we got more quota. (linked from there)  shutting instance down to attempt resizing volume .. in progress.
[23:20:59] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431139 (10Dzahn)
[23:21:00] <wikibugs>	 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431138 (10Dzahn)
[23:28:43] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431154 (10Dzahn) 05Open→03Resolved a:03Dzahn successfully resized /dev/sda to double its size (80 -> 160GB) in Horizon (possible after we got the project quota)  remounted volume and ran `re...
[23:29:55] <wikibugs>	 10VPS-project-Codesearch: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728#11431161 (10Dzahn) - shutdown -h now  - click "resize volume" in web UI - start instance - volume gets mounted automatically - resize2fs /dev/sda - mount -o remount /dev/sda
[23:30:42] <wikibugs>	 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431162 (10Dzahn) - successfully resized /dev/sda to double its size (80 -> 160GB) in Horizon (possible after we got the project quota)  - remounted volume and...
[23:31:06] <wikibugs>	 10VPS-project-Codesearch, 06collaboration-services: "error: No space left on device" for codesearch9:/srv - https://phabricator.wikimedia.org/T411047#11431165 (10Dzahn) 05Open→03Resolved a:03Dzahn `  df -i /srv/ Filesystem       Inodes   IUsed   IFree IUse% Mounted on /dev/sda       10485760 5253986...