[00:04:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:10:56] FIRING: SystemdUnitDown: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:15:56] FIRING: [2x] SystemdUnitDown: The service unit purge_vm_backup.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:44:52] (03PS1) 10Bovimacoco: T388189 BE - Curate accepted language codes Bug= T388189 [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1133584 [01:06:03] (03PS1) 10Bovimacoco: T388191 BE - search for lexemes forms matching word Bug :T388191 [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1133586 [01:16:14] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [01:28:50] (03PS1) 10Bovimacoco: T386329 = Remove app pycache files from git bug: T386329 [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1133587 [01:33:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:35:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [01:40:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [01:41:47] PROBLEM - Host cloudbackup1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:43:25] RECOVERY - Host cloudbackup1003 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [01:45:56] FIRING: [2x] SystemdUnitDown: The service unit purge_vm_backup.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:56:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:05:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:06:05] 06cloud-services-team: SystemdUnitDown The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://phabricator.wikimedia.org/T390921 (10phaultfinder) 03NEW [02:10:56] RESOLVED: SystemdUnitDown: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:32:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:47:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:48:32] 10Tool-fa-speed, 06Future-Audiences: [2025 Apr 01-Apr11] Committed sprint work - https://phabricator.wikimedia.org/T390687#10706541 (10DLin-WMF) Hi @Aklapper ! Wanted to ask how I might get admin access? I would like to organize Future Audiences' Phab board, and it seems that I need additional permissions. Tha... [03:11:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [03:25:40] 10Tool-fa-speed, 06Future-Audiences: [2025 Apr 01-Apr11] Committed sprint work - https://phabricator.wikimedia.org/T390687#10706568 (10Anoop) >>! In T390687#10706541, @DLin-WMF wrote: > Hi @Aklapper ! Wanted to ask how I might get admin access? I would like to organize Future Audiences' Phab board, and it seem... [03:26:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [03:28:21] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/46 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:28:23] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/46 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:28:25] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/46 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:28:32] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/234 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:28:33] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/234 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:28:36] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/234 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:29:31] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/69 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:29:37] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/69 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:29:40] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/69 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:31:02] (03update) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: maintain-harbor: bump to 0.0.20-20250403032838-697134e5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/729 [03:31:06] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: maintain-harbor: bump to 0.0.20-20250403032838-697134e5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/729 [03:31:19] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: builds-builder: bump to 0.0.129-20250403032952-133498e3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/730 [03:32:01] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/62 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:04] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/62 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:08] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/62 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:12] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/28 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:12] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/28 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:16] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/28 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:26] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/20 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:27] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/20 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:30] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/20 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:41] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/151 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:42] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/151 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:46] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/151 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:51] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/73 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:51] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/73 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:32:54] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/73 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:00] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:01] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:05] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:10] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/22 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:11] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/22 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:15] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/22 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:33:35] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [03:34:05] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/26 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:07] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/26 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:11] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/26 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:26] (03update) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:27] (03approved) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:30] (03merge) 10raymond-ndibe: pre-commit: Autoupdate [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/19 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:37:12] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-emailer: bump to 0.0.55-20250403033317-d07b1ca6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/731 [03:37:16] (03update) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.361-20250403033301-9e9d8a56 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/732 [03:37:17] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.361-20250403033301-9e9d8a56 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/732 [03:38:44] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: envvars-admission: bump to 0.0.27-20250403033242-e9f9396b [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/733 [03:39:18] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: volume-admission: bump to 0.0.65-20250403033228-feb1ec0f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/734 [03:39:41] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: registry-admission: bump to 0.0.59-20250403033355-18ac6c34 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/735 [03:56:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:01:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:06:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:21:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:47:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:52:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:53:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:57:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [06:33:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:35:18] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10706648 (10Danielyepezgarces) DNS changed to the required one, but it constantly redirects to [[ https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_introduction|Help:Clou... [06:36:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:43:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:46:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:50:07] 10Tool-fa-speed, 06Future-Audiences: [2025 Apr 01-Apr11] Committed sprint work - https://phabricator.wikimedia.org/T390687#10706656 (10Aklapper) @DLin-WMF: For organizing an existing board, membership in #Trusted-Contributors might be sufficient, so I added you. (Please ask general questions about Phab unrelat... [06:59:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:04:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:57:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:23:47] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for A search engine for translations from the English Wiktionary - https://phabricator.wikimedia.org/T390456#10706847 (10Mndetatsin) >>! In T390456#10705255, @Erutuon wrote: > Unfortunately this tool is out of date. It is showing t... [08:33:13] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706879 (10fgiunchedi) >>! In T390822#10706184, @Andrew wrote: > @fgiunchedi are you already using a puppetless base image? If not, would you like to? I did try the puppetless image how... [08:43:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:46:53] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-13 [08:47:02] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10706906 (10aborrero) hey @Danielyepezgarces would it be OK if we only configure this for the mentioned domains? ` - 'quote.wikipeoplestats.org' - 'source.wikipeoplestats.org'... [08:51:07] !log fnegri@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-13 [09:02:18] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:07:18] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:16:45] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:21:35] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:22:23] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:27:52] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:28:28] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:31:34] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:34:40] PROBLEM - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/dumps - 280 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:35:30] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:37:18] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:58:03] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:06:31] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:11:01] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:23:18] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:24:54] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:33:42] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:34:35] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:35:39] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:37:24] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [10:41:39] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:43:48] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [12:01:18] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954 (10Ladsgroup) 03NEW [12:03:09] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955 (10fnegri) 03NEW [12:09:04] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707568 (10Marostegui) Yeah, let's wait until the split in production is done, otherwise we have to replicate the entire dataset as well. [12:11:21] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955#10707576 (10taavi) It looks like the 20250103 dump for enwiki failed to generate somehow? Can we drop that directory entirely from distribution? [12:12:14] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707578 (10Ladsgroup) [12:14:35] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707580 (10taavi) We can add a CNAME for `x3.{analytics,web}.db.svc.wikimedia.cloud` to the s8 address.. but is that going to be the endpoint users are going to use... [12:14:50] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707581 (10Ladsgroup) [12:20:12] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707592 (10Ladsgroup) >>! In T390954#10707580, @taavi wrote: > We can add a CNAME for `x3.{analytics,web}.db.svc.wikimedia.cloud` to the s8 address.. but is that goi... [12:21:26] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10707595 (10Danielyepezgarces) It would not fulfill the same function, for example Cloudflare does not support SSL for subdomain wildcards, that is why I generated the certific... [12:22:42] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/14 [12:26:29] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10707639 (10aborrero) If we enable this setup the TLS certificate will be managed using acme-chief from within the Cloud VPS proxy setup. I don't think the Cloudflare certifica... [12:29:17] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata, 13Patch-For-Review: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707645 (10taavi) So what I was trying to ask for was an equivalent to the `wikidatawiki.{analytics,web}.db.svc.wikimedia.cloud` service names... [12:31:43] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata, 13Patch-For-Review: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707655 (10Marostegui) >>! In T390954#10707645, @taavi wrote: > So what I was trying to ask for was an equivalent to the `wikidatawiki.{analyti... [12:36:56] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata, 13Patch-For-Review: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707684 (10Ladsgroup) >>! In T390954#10707645, @taavi wrote: > So what I was trying to ask for was an equivalent to the `wikidatawiki.{analytic... [12:40:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:17:47] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:24:41] (03PS3) 10Andrew Bogott: upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 [13:24:41] (03PS1) 10Andrew Bogott: setup.py: pin spicerack version [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133908 [13:29:35] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955#10707921 (10fnegri) Would that be `/srv/dumps/xmldatadumps/public/enwiki/20250103` on both clouddumps1001 and clouddumps1002? Or just one and the other will sync automatically? [13:30:42] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707925 (10taavi) `lang=shell-session taavi@tools-bastion-12:~ $ host x3.analytics.db.svc.wikimedia.cloud x3.analytics.db.svc.wikimedia.cloud is an alias for s8.anal... [13:33:37] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10707949 (10taavi) [13:35:59] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955#10707956 (10fnegri) Or should we delete it somewhere upstream of clouddumps? cc #dumps-generation [13:50:28] (03CR) 10Jelto: "sounds good to me! See Id8979165b96d737addc676f3abf3f088a48eda48." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:03:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:13:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:16:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "tools" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390214#10708138 (10fnegri) [14:22:43] RECOVERY - toolschecker: Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:24:27] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955#10708185 (10fnegri) Following a nice suggestion from @BTullis I moved the dirs in clouddumps100[12] to /root/temp-T390955 so we have a backup in case we need it. Ben expects the di... [14:25:07] 10Tool-fault-tolerance: Low priority: new elastic hosts not showing in web UI - https://phabricator.wikimedia.org/T390902#10708188 (10Ladsgroup) We need to update the records of hosts manually from time to time. The plan is to actually productionize the service soonTM so we can leave the netbox token in the code... [14:25:23] 06cloud-services-team, 10Cloud-VPS: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955#10708196 (10fnegri) 05Open→03Resolved a:03fnegri And the alert is gone! I'll mark this as Resolved. [14:44:48] (03PS2) 10Andrew Bogott: setup.py: pin spicerack version [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133908 [14:44:48] (03PS4) 10Andrew Bogott: upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 [14:48:21] 06cloud-services-team, 10Toolforge, 03Wikimedia-Hackathon-2025: [Session] Introducing and exploring Toolforge UI with prospective users - https://phabricator.wikimedia.org/T383149#10708345 (10debt) Sounds great, I've got you scheduled - let me know if it needs to be changed! [14:56:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "tools" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390214#10708394 (10fnegri) This upgrade is scheduled for Monday, April 7th at 11:00 UTC. [15:04:15] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10708452 (10Andrew) a:05aborrero→03Andrew [15:14:54] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10708600 (10bd808) >>! In T390800#10707595, @Danielyepezgarces wrote: > It would not fulfill the same function, for example Cloudflare does not support SSL for subdomain wildca... [15:49:41] 10Tools, 10Wikidata, 07Security: Blocked Wikidata user sockpuppets are doing automated misconduct with QuickStatements - https://phabricator.wikimedia.org/T386978#10709010 (10taavi) [15:53:30] 10Tools, 10Wikidata, 07Security: Blocked Wikidata user sockpuppets are doing automated misconduct with QuickStatements - https://phabricator.wikimedia.org/T386978#10709051 (10taavi) Following up from my comments from the private duplicate: if the only impact from those users is that they're DoSing the tool i... [16:07:33] (03CR) 10FNegri: [C:03+1] setup.py: pin spicerack version [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133908 (owner: 10Andrew Bogott) [16:10:56] (03CR) 10FNegri: upgrade_openstack_node: don't lock tables when backing up (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 (owner: 10Andrew Bogott) [16:20:32] 10Tools: templatecount - https://phabricator.wikimedia.org/T390963#10709235 (10Pppery) 05Open→03Invalid https://commons.wikimedia.org/wiki/Special:WhatLinksHere?target=Template%3APD-GallicaScan&namespace=&hidelinks=1&limit=50 [16:58:36] (03update) 10fnegri: [jobs-api] move core logic to seperate core module [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804 https://phabricator.wikimedia.org/T359808) (owner: 10raymond-ndibe) [17:29:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:40:22] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service trove-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:41:22] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service trove-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [17:41:28] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T390877#10709621 (10phaultfinder) [17:44:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:15:22] RESOLVED: [3x] HAProxyBackendUnavailable: HAProxy service trove-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:16:22] RESOLVED: [2x] HAProxyServiceUnavailable: HAProxy service trove-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [19:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:26:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:26:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudrabbit1001.eqiad.wmnet' (T381499) [19:26:40] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:33:38] PROBLEM - Host cloudrabbit1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudrabbit1001.eqiad.wmnet' (T381499) [19:34:48] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:35:06] RECOVERY - Host cloudrabbit1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [19:35:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudrabbit1002.eqiad.wmnet' (T381499) [19:44:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudrabbit1002.eqiad.wmnet' (T381499) [19:44:23] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:45:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudrabbit1003.eqiad.wmnet' (T381499) [19:53:28] PROBLEM - Host cloudrabbit1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudrabbit1003.eqiad.wmnet' (T381499) [19:54:58] RECOVERY - Host cloudrabbit1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:55:00] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:59:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [19:59:39] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment eqiad1 for all services [20:00:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [20:00:49] FIRING: [48x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:11:40] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [20:11:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [20:12:15] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [20:12:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [20:19:02] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:02] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:02] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] FIRING: CephSlowOps: Ceph cluster in eqiad has 486 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: CloudVirtDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [20:41:59] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: InstanceDown: Project cloudinfra instance syslog-server-audit02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [4x] InstanceDown: Project tools instance tools-elastic-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] FIRING: [2x] InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] FIRING: PawsNFSDown: No paws nfs services running found - https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsNFSDown [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] FIRING: InstanceDown: Project project-proxy instance proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] FIRING: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: CephClusterInError: #page Ceph cluster in eqiad is in error status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInError - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInError [20:41:59] RESOLVED: CloudVirtDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: [5x] InstanceDown: Project gitlab-runners instance runner-1022 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [16x] InstanceDown: Project toolsbeta instance toolsbeta-cumin-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] PROBLEM - nova-compute proc maximum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1057 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1064 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1067 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1066 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1062 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: [3x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RESOLVED: [48x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1063 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc maximum on cloudvirt1032 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:41:59] FIRING: [8x] NeutronAgentDown: Neutron neutron-metadata-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:41:59] FIRING: [7x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:41:59] FIRING: [9x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [26x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:41:59] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [20:41:59] FIRING: [48x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:41:59] FIRING: [9x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service radosgw-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [20:41:59] FIRING: [29x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:59] FIRING: [4x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:42:03] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [20:42:03] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [20:42:09] FIRING: [3x] InstanceDown: Project paws instance bastion is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:42:09] RESOLVED: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [20:42:09] FIRING: [95x] InstanceDown: Project tools instance abogott-nstesting is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:42:13] FIRING: [14x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:42:16] FIRING: [4x] InstanceDown: Project project-proxy instance maps-proxy-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:43:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:43:28] RESOLVED: [10x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:44:50] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:45:09] RESOLVED: CephClusterInError: #page Ceph cluster in eqiad is in error status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInError - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInError [20:45:22] RESOLVED: [2x] HAProxyServiceUnavailable: HAProxy service radosgw-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [20:45:28] RESOLVED: [9x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:45:28] RESOLVED: [30x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:59] RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:45:59] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:46:00] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:46:00] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:46:03] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:46:13] RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:46:22] RESOLVED: [4x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:46:27] RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [20:46:28] RESOLVED: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:46:28] RESOLVED: [3x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:46:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:03] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [20:47:03] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [20:47:09] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:09] RESOLVED: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:09] RESOLVED: [3x] InstanceDown: Project paws instance bastion is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:47:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 75 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:47:12] FIRING: [95x] InstanceDown: Project tools instance abogott-nstesting is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:47:16] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:47:20] RESOLVED: [4x] InstanceDown: Project project-proxy instance maps-proxy-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:47:23] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:27] RESOLVED: [14x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:47:30] RESOLVED: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:34] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:47:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 almost out of cpu #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [20:48:19] RESOLVED: [3x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1055 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:49:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:49:51] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:52:07] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-55 [20:52:25] RESOLVED: PawsNFSDown: No paws nfs services running found - https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsNFSDown [20:52:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 almost out of cpu #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [20:53:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-55 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:54:50] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:55:56] RESOLVED: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:57:59] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-55 [20:59:50] RESOLVED: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:02:09] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-55 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:03:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-55 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [21:04:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:18:32] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#10710700 (10Pppery) (Should we have a #vps-project-phabricator -> #collaboration-services bot? It seems that otherwise tasks here get ignored) [21:19:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-68 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:36:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirtlocal1003.eqiad.wmnet' (T381499) [21:36:07] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [21:36:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [21:41:17] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-68 [21:43:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirtlocal1003.eqiad.wmnet' (T381499) [21:43:15] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [21:43:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:44:03] FIRING: [18x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProces [21:46:36] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-68 [21:49:03] FIRING: [23x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProces [21:49:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirtlocal1002.eqiad.wmnet' (T381499) [21:49:16] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [21:49:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2004-dev.codfw.wmnet}' [21:52:43] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-32, tools-k8s-worker-nfs-70, tools-k8s-worker-nfs-57, tools-k8s-worker-nfs-74 [21:54:03] FIRING: [14x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProces [21:54:34] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2004-dev.codfw.wmnet}' [21:56:53] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirtlocal1002.eqiad.wmnet' (T381499) [21:57:03] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [21:59:03] RESOLVED: [14x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProc [22:00:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirtlocal1001.eqiad.wmnet' (T381499) [22:02:18] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-14, tools-k8s-worker-nfs-71 [22:07:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirtlocal1001.eqiad.wmnet' (T381499) [22:07:31] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [22:09:16] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-32, tools-k8s-worker-nfs-70, tools-k8s-worker-nfs-57, tools-k8s-worker-nfs-74 [22:11:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [22:12:18] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43 [22:12:59] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-14, tools-k8s-worker-nfs-71 [22:14:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:16:52] !log root@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-33 [22:17:48] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43 [22:22:10] !log root@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-33 [22:22:27] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-14 [22:23:33] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-14 [22:25:01] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43 [22:25:51] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43 [22:26:36] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all nodes [22:34:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-22 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:36:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [22:37:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-78 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:43:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [22:48:58] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-76 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:08:58] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-71 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:13:58] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-71 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:26:58] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-68 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:27:13] RESOLVED: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-68 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:33:13] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-67 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:36:58] RESOLVED: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-67 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:47:23] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#10711197 (10Dzahn) The mail config appears to be: ` $mail_config = [ { 'key' => 'wikimedia-smtp',...