[00:01:29] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-89.tools.eqiad1.wikimedia.cloud to the cluster [00:01:29] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:02:27] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [00:04:12] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [00:04:56] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker role in the tools cluster [00:05:38] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-90.tools.eqiad1.wikimedia.cloud to the cluster [00:05:38] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:05:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [00:06:17] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-91.tools.eqiad1.wikimedia.cloud to the cluster [00:06:17] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:07:26] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [00:08:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:08:11] !log taavi@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker role in the tools cluster [00:08:18] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [00:13:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:15:20] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-92.tools.eqiad1.wikimedia.cloud to the cluster [00:15:20] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:18:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:18:15] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-93.tools.eqiad1.wikimedia.cloud to the cluster [00:18:15] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:18:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:21:08] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-94.tools.eqiad1.wikimedia.cloud to the cluster [00:21:08] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [00:23:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:24:00] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:28:26] 10Toolforge: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 (10bd808) Not sure what levels to trigger on, but pending pods is probably one thing we should at least soft alert on. Grafana shows we peaked around 450 pending today which seems like far t... [00:28:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:38:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:48:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:50:49] 10Grid-Engine-to-K8s-Migration: Migrate deletion-notification-bot-2 from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T352564 (10mdaniels5757) Acknowledging that this is happening. I will have the time to migrate in the next month or so, but probably not before the 14th. [02:40:19] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10MBH) @komla Where could I request a quota increase? And what could you say about point 3? Thanks for the help with point 2. [02:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:55:14] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10JJMC89) > Where could I request a quota increase? See #toolforge-quota-requests [03:33:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:00:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:05:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:09:23] 10cloud-services-team: Shinken is unavailable (404 - no proxy is configured) - https://phabricator.wikimedia.org/T352594 (10valerio.bozzolan) p:05Triage→03Low [10:39:13] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [10:41:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:50:07] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-95.tools.eqiad1.wikimedia.cloud to the cluster [10:50:07] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [10:50:41] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [11:02:43] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-96.tools.eqiad1.wikimedia.cloud to the cluster [11:02:43] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [11:06:58] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_grid_node for tools-sgeexec-10-13, tools-sgeweblight-10-20 [11:15:17] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_grid_node for tools-sgeweblight-10-22 [11:18:17] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [11:23:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-22 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:28:03] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-22 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:28:48] !log taavi@cloudcumin1001 admin Added a new k8s worker tools-k8s-worker-97.tools.eqiad1.wikimedia.cloud to the cluster [11:28:48] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [11:42:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:47:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:15:39] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [12:21:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-96 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:21:03] (InstanceDown) firing: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:21:25] (NodeDown) firing: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [12:21:25] (NodeDown) firing: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [12:21:30] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10phaultfinder) [12:26:38] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10taavi) ` ------------------------------------------------------------------------------- Record: 6 Date/Time: 12/02/2023 12:14:27 Source: system Severity: Critical Description: CPU 2 has a thermal trip (over-temperature... [12:28:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:30:12] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10taavi) `lang=shell-session taavi@cloudcontrol1006 ~ $ os server list --all --host cloudvirt1063 +--------------------------------------+-------------------------------+--------+----------------------------------------+----------------... [12:30:13] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [12:30:25] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:31:25] (NodeDown) resolved: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [12:31:25] (NodeDown) resolved: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [12:35:53] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:36:03] (InstanceDown) resolved: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:36:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-96 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:37:03] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [12:48:40] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:52:03] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [14:41:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:44:29] 10cloud-services-team: Shinken is unavailable (404 - no proxy is configured) - https://phabricator.wikimedia.org/T352594 (10bd808) Shinken was removed from Cloud VPS by {T236547}. Our replacement is broadly described at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS. [14:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:45:06] 10Tool-bub2: Non-alphanumeric titles or authors not getting uploaded to IA - https://phabricator.wikimedia.org/T352580 (10wassan.anmol117) 05Open→03Resolved a:03wassan.anmol117 This is fixed. [17:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:16:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown