[00:07:22] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [00:07:23] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [00:07:27] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [00:12:23] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [00:12:23] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [00:12:28] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [00:21:50] FIRING: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:32:53] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [00:37:53] RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [00:40:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:49:11] FIRING: SystemdUnitDown: The systemd unit neutron-openvswitch-agent.service on node cloudvirt1062 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:39:37] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089280 (10Andrew) Deep in the guts of the heat agent it is trying to get json containing a cert, and... [02:04:28] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089289 (10Andrew) Testing by hand, that curl works some of the time and fails some of the time. So ma... [02:10:51] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089290 (10Andrew) On the server side, a failed curl looks like this: ` {"message": "Server-side err... [02:51:19] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089302 (10Andrew) I didn't do a rabbitmq overhaul. This is silly, but I restarted all the magnum-cond... [03:03:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:42:28] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-harbor-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:47:28] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-harbor-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:49:11] FIRING: SystemdUnitDown: The systemd unit neutron-openvswitch-agent.service on node cloudvirt1062 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:08:56] FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:18:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:37:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-harbor-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:55:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [06:18:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:57:08] 10Data-Services, 06Data-Engineering, 06DBA: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912#10089343 (10Count_Count) Can this be completed? cswikivoyage is still not listed in the meta_p.wiki table while events are being sent out. This causes problems for my spamc... [07:03:56] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:04:05] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T373237 (10phaultfinder) 03NEW [07:21:57] FIRING: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [07:28:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:46:57] RESOLVED: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [08:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:13:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:13:39] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:18:39] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:39:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:55:44] FIRING: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state [09:59:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:00:43] RESOLVED: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state [10:02:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:52:29] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:04:11] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:11:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:16:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:19:33] 10Tool-stimmberechtigung: Merge Hgzh Github Pull Request to Tool-stimmberechtigung - https://phabricator.wikimedia.org/T373240 (10doctaxon) 03NEW [11:20:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:23:28] 10Tool-stimmberechtigung: Solve bug in Tool-stimmberechtigung as reported - https://phabricator.wikimedia.org/T373241 (10doctaxon) 03NEW [11:23:49] 10Tool-stimmberechtigung: Solve bug in Tool-stimmberechtigung as reported - https://phabricator.wikimedia.org/T373241#10089408 (10doctaxon) a:03Count_Count [11:27:19] 10Tool-stimmberechtigung: Migrate tool-stimmberechtigung from GitHub to Wikimedia Gitlab - https://phabricator.wikimedia.org/T373242 (10doctaxon) 03NEW [12:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:55:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:15:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:21:27] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 (10ArthurPSmith) 03NEW [13:28:03] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089437 (10LucasWerkmeister) Side note: `tools.db.svc.wikimedia.cloud` is currently an alias for `tools.db.svc.eqiad.wmflabs`, so the legacy hostname in the error message is... [13:29:06] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089438 (10ArthurPSmith) Ah - thanks for the note on the proper db name. Where should I have been watching to be notified about that? [13:32:44] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089439 (10ArthurPSmith) I've updated the hostname there, let's see if that helps. [13:33:39] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089440 (10LucasWerkmeister) >>! In T373243#10089438, @ArthurPSmith wrote: > Ah - thanks for the note on the proper db name. Where should I have been watching to be notified... [13:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:55:08] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089441 (10ArthurPSmith) Changing the db hostname didn't solve the problem: "Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.s... [13:57:41] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089442 (10ArthurPSmith) I am subscribed to cloud-announce. Here's at least one message on cloud-announce mentioning the name change - I guess I missed it (though the email... [14:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:56:58] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373247 (10ActivelyDisinterested) 03NEW [14:57:43] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373247#10089518 (10Pppery) →14Duplicate dup:03T373233 [14:59:11] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089520 (10Pppery) [15:04:11] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:25:40] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373248 (10Pigsonthewing) 03NEW [15:27:40] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089540 (10Pppery) [15:28:10] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373248#10089538 (10Pppery) →14Duplicate dup:03T373233 [15:28:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:57:57] FIRING: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [16:07:45] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10089543 (10Soda) CropTool has been having similar issues and is unable to connect to mediawiki.org and/or commons.wikimedia.org. See [[https://commons.wikimedia.org/wiki/Com... [16:07:57] RESOLVED: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [16:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:36:57] FIRING: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [16:46:57] RESOLVED: [2x] HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [16:59:35] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373249 (10Significa_liberdade) 03NEW [17:01:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:13:34] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089560 (10JJMC89) [17:13:35] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373249#10089558 (10JJMC89) →14Duplicate dup:03T373233 [17:16:40] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250 (10Urbanecm) 03NEW [17:16:49] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10089572 (10Urbanecm) p:05Triage→03Unbreak! [17:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:51:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:51:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:04:11] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:11:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:06:44] (03CR) 10Lokal Profil: "> Patch Set 1: Code-Review+1" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil) [20:15:11] 10Toolforge: Toolforge job fails to find library installed via aptfile - https://phabricator.wikimedia.org/T373251 (10tchin) 03NEW [20:16:13] 10Toolforge: Toolforge job fails to find library installed via aptfile - https://phabricator.wikimedia.org/T373251#10089612 (10tchin) [20:18:38] 10Toolforge (Toolforge iteration 14): Toolforge Aptfile not producing working copy of `ffmpeg` - https://phabricator.wikimedia.org/T365633#10089614 (10tchin) > @tchin can you open a new task with the code/packages that you are seeing issues with? Sure here it is: T373251 [20:21:01] (03CR) 10Lokal Profil: [C:03+1] "looks good. Just reacting to `build-php.sh` showing up in the change-set despite no difference in code. did permission flags or similar ch" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 (owner: 10Jean-Frédéric) [20:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:23:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:13:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:04:12] FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks