[00:08:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:18:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:26:45] 10Cloud-VPS, 10cloud-services-team: Implement mail aliases for Cloud-VPS projects (@wmcloud.org) - https://phabricator.wikimedia.org/T47828 (10bd808) >>! In T47828#7203503, @taavi wrote: > 2. Add support for username@wmcloud.org forwarding, like username@toolforge.org currently forwards to the LDA... [00:30:33] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:31:34] 10Cloud-VPS, 10cloud-services-team: Implement mail aliases for Cloud-VPS projects (@wmcloud.org) - https://phabricator.wikimedia.org/T47828 (10bd808) When poking at {T347512} I realized that we will probably also need to create SPF records for each `.wmcloud.org` subdomain if we go with t... [00:35:34] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:40:33] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:55:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:00:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:33:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:53:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:03:07] 10Grid-Engine-to-K8s-Migration: Migrate coverage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319649 (10CodeReviewBot) legoktm merged https://gitlab.wikimedia.org/toolforge-repos/coverage/-/merge_requests/1 Rewrite it in Rust & move to Toolforge jobs [06:55:21] 10Grid-Engine-to-K8s-Migration: Migrate coverage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319649 (10Legoktm) 05Open→03Resolved ` tools.coverage@tools-sgebastion-10:~$ crontab -l no crontab for tools.coverage tools.coverage@tools-sgebastion-10:~$ toolforge jobs li... [08:16:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:21:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:34:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:38:37] (CephSlowOps) firing: Ceph cluster in eqiad has 19 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:38:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352436 (10phaultfinder) [09:43:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 19 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:52:17] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.remove_user_from_project for user 'jbond' (T352508) [09:52:28] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.remove_user_from_project (exit_code=0) for user 'jbond' (T352508) [09:53:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:57:52] (03PS1) 10Majavah: Remove John's root key [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) [09:57:58] (03PS1) 10Majavah: Remove root keys for some former staff [labs/private] - 10https://gerrit.wikimedia.org/r/979304 [10:14:23] 10cloud-services-team, 10Observability-Alerting: Alertmanager Phabricator integration for WMCS alerts is too spammy - https://phabricator.wikimedia.org/T352059 (10fgiunchedi) Adding my post-merge comment here too: you can also use group_by in child routes if that's desired, i.e. change grouping based on some c... [10:19:09] 10cloud-services-team, 10Observability-Alerting: Automatically close stale alertmanager created tasks - https://phabricator.wikimedia.org/T352079 (10fgiunchedi) Agreed that'd be nice, currently phalerts explicitly does not support that, though maybe we can tackle this feature as part of {T351389} (essentially... [10:29:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) (owner: 10Majavah) [10:29:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [labs/private] - 10https://gerrit.wikimedia.org/r/979304 (owner: 10Majavah) [10:30:00] (03CR) 10Majavah: [V: 03+2 C: 03+2] Remove John's root key [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) (owner: 10Majavah) [10:30:07] (03CR) 10Majavah: [V: 03+2 C: 03+2] Remove root keys for some former staff [labs/private] - 10https://gerrit.wikimedia.org/r/979304 (owner: 10Majavah) [11:41:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) [11:41:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [11:43:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) 05In progress→03Resolved Both codfw and eqiad are now running Antelope! [11:49:15] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [11:50:40] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:56:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:56:43] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [11:57:56] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [12:01:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:34:12] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm completed: - cloudvirt1046 (**WARN**... [12:43:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:43:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) 05In progress→03Resolved I tried reimaging again and it worked! [12:45:13] 10Cloud-VPS: tf-infra-test to tofu - https://phabricator.wikimedia.org/T352528 (10rook) [12:48:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:55:18] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10taavi) a:03taavi [12:55:56] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10taavi) [12:56:18] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10taavi) [12:57:41] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Observability-Alerting, 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10taavi) [12:57:43] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Observability-Alerting, 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10taavi) [13:34:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:40:42] (NeutronAgentDown) resolved: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:41:10] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:46:10] (NeutronAgentDown) resolved: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:46:27] (PrometheusRestarted) firing: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [13:47:10] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:50:58] (NeutronAgentDown) resolved: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:52:10] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:53:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:54:32] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [13:55:58] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [13:56:27] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [14:07:46] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:46] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:47] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:28] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [14:12:12] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:12] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:40] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:27] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171) [14:18:33] T351171: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 [14:19:32] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:51] !log fnegri@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=97) (T351171) [14:20:27] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171) [14:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:21:27] (PrometheusRestarted) resolved: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [14:24:45] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T351171) [14:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:24:51] T351171: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 [14:31:32] 10Tool-bub2: Redesign the FAQs page - https://phabricator.wikimedia.org/T340385 (10PMenon-WMF) 05In progress→03Resolved Hi @Aklapper, thanks for the reminder, but a [[ https://github.com/coderwassananmol/BUB2/pull/209 | patch has been merged ]] for this task. So marking this as resolved. [14:31:49] 10Tool-bub2: Redesign the UI to be more minimalistic and cleaner - https://phabricator.wikimedia.org/T340387 (10PMenon-WMF) [14:32:56] 10Tool-bub2: Fix peer dependencies and remove deprecation warnings - https://phabricator.wikimedia.org/T344116 (10PMenon-WMF) 05Open→03Resolved [14:35:14] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Add search bar in queue - https://phabricator.wikimedia.org/T315134 (10PMenon-WMF) 05Open→03Resolved a:05DO-NOT-CHANGE→03None PR by @Okerekechinweotito [[ https://github.com/coderwassananmol/BUB2/pull/180 | merged ]] [14:35:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:38:48] 10Cloud-VPS: tf-infra-test to tofu - https://phabricator.wikimedia.org/T352528 (10rook) https://github.com/toolforge/tf-infra-test/pull/7 [14:38:57] 10Cloud-VPS: tf-infra-test to tofu - https://phabricator.wikimedia.org/T352528 (10rook) 05Open→03Resolved [14:40:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:41:02] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352436 (10taavi) 05Open→03Resolved a:03taavi [14:41:15] 10cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T352323 (10taavi) 05Open→03Resolved a:03taavi [14:41:20] 10cloud-services-team: SystemdUnitDown Unit backup_glance_images.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T352261 (10taavi) 05Open→03Resolved a:03taavi [14:41:28] 10cloud-services-team: SystemdUnitDownForLong - https://phabricator.wikimedia.org/T352185 (10taavi) 05Open→03Resolved a:03taavi [14:41:36] 10cloud-services-team: SystemdUnitDownForLong Unit purge_vm_backup.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T352158 (10taavi) 05Open→03Resolved a:03taavi [14:41:38] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:41:39] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:43:06] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:09:54] (PawsJupyterHubDown) firing: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:10:19] (HAProxyBackendUnavailable) firing: (10) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:10:20] (HAProxyServiceUnavailable) firing: (6) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:10:35] (HarborDown) resolved: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [15:10:36] (HarborDown) resolved: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [15:11:50] (PawsJupyterHubDown) resolved: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:11:54] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:13:40] (GaleraClusterSizeMismatch) firing: (2) Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:15:04] (SystemdUnitDown) firing: (5) The service unit nova-api-metadata.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:15:20] (HAProxyBackendUnavailable) firing: (10) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:15:20] (HAProxyServiceUnavailable) resolved: (6) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:17:11] (NeutronAgentDown) firing: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:20:19] (SystemdUnitDown) firing: (5) The service unit nova-api-metadata.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:22:14] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:15] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:35:57] (NeutronAgentDown) resolved: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:41:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-32.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [15:49:31] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [15:49:33] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [15:49:39] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [15:49:42] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [15:51:56] (ToolsGridQueueProblem) resolved: Grid queue webgrid-lighttpd@tools-sgeweblight-10-32.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [15:53:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:55:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:56:26] 10Cloud-VPS, 10cloud-services-team: Alert when Galera is applying no writes - https://phabricator.wikimedia.org/T352552 (10taavi) [15:56:59] 10Cloud-VPS, 10cloud-services-team: Alert when Galera is applying no writes - https://phabricator.wikimedia.org/T352552 (10taavi) p:05Triage→03High a:03taavi [15:59:06] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:19] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:09:31] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:11:22] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Host rebooted by fnegri@cumin1001 with reason: Rebooting to test the host is stable [16:12:55] 10Cloud-VPS, 10cloud-services-team: Alert when Galera is applying no writes - https://phabricator.wikimedia.org/T352552 (10taavi) 05Open→03Resolved [16:12:57] 10Cloud-VPS, 10cloud-services-team: 2023-12-01 Cloud VPS network outage - https://phabricator.wikimedia.org/T352539 (10taavi) 05In progress→03Resolved a:03taavi [16:12:59] 10Cloud-VPS, 10cloud-services-team: 2023-12-01 Cloud VPS network outage - https://phabricator.wikimedia.org/T352539 (10taavi) [16:19:45] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171) [16:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:19:51] T351171: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 [16:19:54] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T351171) [16:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:20:01] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:10] 10Cloud-VPS: SPF records for wmcloud.org and wmflabs.org are out of sync - https://phabricator.wikimedia.org/T352555 (10bd808) [16:26:51] (03Abandoned) 10Ladsgroup: Use ORES in production instead of the cloud VPS setup [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/454866 (https://phabricator.wikimedia.org/T202653) (owner: 10Ladsgroup) [16:30:19] (HAProxyBackendUnavailable) firing: (15) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:35:19] (HAProxyBackendUnavailable) firing: (15) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:38:40] (GaleraClusterSizeMismatch) resolved: (2) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:40:19] (HAProxyBackendUnavailable) resolved: (15) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:40:40] (NeutronAgentDown) firing: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [16:46:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:48:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [16:57:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:58:19] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:00:05] 10Cloud-VPS, 10cloud-services-team: SPF records for wmcloud.org and wmflabs.org are out of sync - https://phabricator.wikimedia.org/T352555 (10bd808) [17:00:40] (NeutronAgentDown) resolved: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:01:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:08:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:10:37] 10Cloud-VPS, 10cloud-services-team: 2023-12-01 Cloud VPS network outage - https://phabricator.wikimedia.org/T352539 (10fnegri) [17:11:56] 10Cloud-VPS, 10cloud-services-team: 2023-12-01 Cloud VPS network outage - https://phabricator.wikimedia.org/T352539 (10fnegri) [17:13:20] (HAProxyBackendUnavailable) resolved: (4) HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:15:22] 10Cloud-VPS, 10cloud-services-team: 2023-12-01 Cloud VPS network outage - https://phabricator.wikimedia.org/T352539 (10fnegri) [17:34:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:37:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:43:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:48:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:53:20] (HAProxyBackendUnavailable) resolved: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:58:23] 10Toolforge Build Service: Add command/arguments to allow a script to wait on build completion/failure - https://phabricator.wikimedia.org/T352561 (10bd808) [18:02:30] 10Grid-Engine-to-K8s-Migration: Migrate commons-delinquent from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319640 (10mdaniels5757) This is actually NOT done completely, the category removal job still needs to be migrated. I will likely have time to get around to it in the n... [18:03:13] 10Grid-Engine-to-K8s-Migration: Migrate mdanielsbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319885 (10mdaniels5757) Acknowledging that this exists. I will have the time to migrate in the next month or so, but probably not before the 14th. [18:03:18] 10Grid-Engine-to-K8s-Migration: Migrate mdanielsbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319885 (10mdaniels5757) Acknowledging that this exists. I will have the time to migrate in the next month or so, but probably not before the 14th. [18:03:57] 10Grid-Engine-to-K8s-Migration: Migrate commons-delinquent from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319640 (10taavi) 05Resolved→03Open >>! In T319640#9375859, @mdaniels5757 wrote: > In the meantime, can someone please re-open this (and perhaps add me to Trusted-C... [18:17:23] 10Grid-Engine-to-K8s-Migration: Migrate deletion-notification-bot-2 from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T352564 (10mdaniels5757) [19:14:17] 10Toolforge (Toolforge iteration 02): Add command/arguments to allow a script to wait on build completion/failure - https://phabricator.wikimedia.org/T352561 (10Slst2020) [19:16:37] (CephSlowOps) firing: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:16:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [19:21:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:53:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:18:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:00:28] (CephSlowOps) firing: Ceph cluster in eqiad has 1091 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [21:00:28] (ProbeDown) firing: (4) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:01:10] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 340 bytes in 60.072 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:01:10] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.669 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:01:40] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [21:02:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:03:33] (SystemdUnitDown) firing: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:04:45] (ProbeDown) resolved: (4) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:07:31] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:07:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:08:33] (SystemdUnitDown) resolved: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:08:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 288 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [21:16:51] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [21:23:55] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [21:28:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:34:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:42:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-72 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:47:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-72 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:48:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:03:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-65 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:08:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-65 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:11:38] 10Tool-bub2: Google Books and Trove integration to Commons - https://phabricator.wikimedia.org/T352578 (10Okerekechinweotito) [22:23:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-54 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:28:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-54 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:35:33] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-49 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:40:33] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-49 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:42:17] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) @taavi Sorry, could you please help me to see if it is because of my configuration that k8s-20201008.fix-anchor.archives.simple will not wait for k8s-20201008.fix-an... [22:46:15] 10Tool-bub2: Non-alphanumeric titles or authors not getting uploaded to IA - https://phabricator.wikimedia.org/T352580 (10wassan.anmol117) [22:51:37] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers [23:47:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:50:04] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [23:52:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [23:55:04] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster