[00:08:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:18:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:44:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:41:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:38:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:44:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:46:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:41:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:23:07] 10Data-Services, 10DBA, 10Data-Engineering: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10Marostegui) @btullis can you give this some priority? It's been sitting here for a while and {T349424} is blocked on it. [06:44:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:51:25] 10Cloud-VPS, 10cloud-services-team, 10User-fgiunchedi: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10fgiunchedi) See also {T343885} which I believe is related [06:51:34] 10Cloud-VPS, 10cloud-services-team: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10fgiunchedi) [06:53:37] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) As far as I'm concerned the `prometheus` hosts bits of... [06:57:05] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10fgiunchedi) [07:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:24:40] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudmetrics1003.eqiad.wmnet wit... [07:38:26] PROBLEM - Check unit status of wmcs_monitoring_graphite_rsync on cloudmetrics1004 is CRITICAL: CRITICAL: Status of the systemd unit wmcs_monitoring_graphite_rsync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:38:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:42:52] PROBLEM - Check systemd state on cloudmetrics1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmcs_monitoring_graphite_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:55:13] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS... [07:55:51] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudmetrics1004.eqiad.wmnet wit... [07:58:15] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [08:00:20] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [08:26:08] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS... [08:26:36] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10taavi) [08:46:48] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [08:47:11] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [08:55:42] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [08:55:58] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [09:04:32] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [09:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:05:35] !log admin dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [09:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:11:24] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:18:57] !log toolsbeta dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:21:04] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:21:33] !log toolsbeta dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:26:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:30:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:32:44] 10Cloud-VPS, 10cloud-services-team: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266 (10taavi) 05Open→03Resolved [09:32:49] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10taavi) [09:35:52] 10Cloud-VPS, 10cloud-services-team: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10taavi) a:03taavi This seems to be fallout of the cloudlb migration. Thanks for spotting, I'll write a patch to fix the alert. [09:36:18] 10cloud-services-team (FY2023/2024-Q1), 10Epic, 10Goal: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10taavi) [09:36:20] 10Cloud-VPS, 10cloud-services-team: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10taavi) [09:45:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:49:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:08:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:09:31] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:09:49] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [10:10:22] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [10:14:31] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:17:11] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [10:18:09] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [10:18:23] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:18:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:51:48] 10Cloud-VPS, 10cloud-services-team: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10fgiunchedi) 05Open→03Resolved Alert is gone, resolving [11:51:51] 10cloud-services-team (FY2023/2024-Q1), 10Epic, 10Goal: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10fgiunchedi) [11:51:57] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [12:03:37] 10Grid-Engine-to-K8s-Migration: Migrate ytcleaner from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320199 (10Mbch331) 05Open→03Resolved Bot now runs in k8s and no longer in gridengine [12:11:34] (03PS3) 10David Caro: mypy: skip build directory [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966132 [12:11:36] (03PS3) 10David Caro: alerts: don't fail if host already downtimed or uptimed [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966133 [12:11:38] (03PS7) 10David Caro: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) [12:11:40] (03CR) 10David Caro: openstack: don't pass the new project when creating it (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [12:11:42] (03PS3) 10David Caro: ceph: Adapt to multi-level crush tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966135 (https://phabricator.wikimedia.org/T331145) [12:11:44] (03PS3) 10David Caro: ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) [12:11:46] (03PS1) 10David Caro: ceph: add missing cumin_params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969321 [12:11:47] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [12:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:13:03] !log admin dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [12:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:15:34] (03CR) 10CI reject: [V: 04-1] ceph: add missing cumin_params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969321 (owner: 10David Caro) [12:26:40] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [12:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:29:33] 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10taavi) p:05Triage→03High [12:49:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:53:35] 10Cloud Services Proposals: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10dcaro) So to try to keep the subject going, I'll define a bit what 'incident response process' means to me (feel free to tell me otherwise!): An Incidence response process is a document that d... [12:57:02] 10Tools: 'deletion-notification-bot-2' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349898 (10taavi) [13:07:22] 10Tools: 'digero' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349899 (10taavi) [13:09:42] 10Tools: eatchabot using a lot of NFS storage - https://phabricator.wikimedia.org/T284968 (10taavi) [13:09:44] 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10taavi) [13:09:54] !log admin dcaro@urcuchillay END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [13:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:18:53] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, 10Design: [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901 (10KColeman-WMF) [13:21:34] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [13:21:35] (03CR) 10FNegri: [C: 04-1] openstack: don't pass the new project when creating it (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [13:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:26:00] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, and 2 others: [Design] Create user flows for different GUC scenarios - https://phabricator.wikimedia.org/T349902 (10KColeman-WMF) [13:27:59] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, 10Design: [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901 (10KColeman-WMF) [13:28:21] 10Tool-Global-user-contributions, 10Design-Research, 10IP Masking, 10Stewards-and-global-tools, and 3 others: [Design research] Understand usage of current GUC tool - https://phabricator.wikimedia.org/T347618 (10KColeman-WMF) 05Open→03Resolved [13:28:29] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, 10Epic: [Epic] Implement global user contributions feature - https://phabricator.wikimedia.org/T337089 (10KColeman-WMF) [13:30:07] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, and 2 others: [Design] Create user flows for different GUC scenarios - https://phabricator.wikimedia.org/T349902 (10KColeman-WMF) a:03KColeman-WMF [13:36:51] PROBLEM - Host cloudcephosd1021 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:41] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, and 2 others: [Design] Comparative review - https://phabricator.wikimedia.org/T349907 (10KColeman-WMF) [13:40:11] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, and 2 others: [Design] Comparative review - https://phabricator.wikimedia.org/T349907 (10KColeman-WMF) a:03KColeman-WMF [13:40:19] 10Tool-Global-user-contributions, 10IP Masking, 10Stewards-and-global-tools, 10XTools, 10Design: [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901 (10KColeman-WMF) [13:44:05] RECOVERY - Host cloudcephosd1021 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [13:44:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 5 others: Cloud Ceph outage 2023-02-13 - https://phabricator.wikimedia.org/T329535 (10Aklapper) >>! In T329535#9246104, @dcaro wrote: > I'll close this for now @dcaro: But you didn't... [13:52:37] (CephSlowOps) firing: Ceph cluster in eqiad has 43 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:52:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [13:56:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:01:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:02:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 42 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:02:45] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:18:42] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:23:02] 10Tools: is 'img-usage' tool still in use? - https://phabricator.wikimedia.org/T349912 (10taavi) [14:29:49] 10Cloud Services Proposals, 10Toolforge: Decision request - Toolforge external infrastructure domain usage - https://phabricator.wikimedia.org/T306039 (10nskaggs) @taavi This has been added to the next weekly agenda per https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Decision_Making#How_does_the_d... [14:31:32] 10Tools: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913 (10taavi) [14:34:13] 10Tools: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913 (10taavi) I have truncated the `/data/project/hoiscript/ndlcrawl.err` log file which was 225G alone. Please reduce the logging verbosity or add a cleanup job to ensure it doesn't grow that huge again.... [14:39:31] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:44:31] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:49:13] 10Tools: 'digero' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349899 (10jberkel) We don't really need to keep all the old dumps around, I've started the deletion of all dump files before 2023. Different files are needed different purposes: for the stats, and for the "wante... [14:49:31] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:09:59] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 5 others: Cloud Ceph outage 2023-02-13 - https://phabricator.wikimedia.org/T329535 (10dcaro) 05In progress→03Resolved :facepalm: [15:10:14] 10Toolforge, 10cloud-services-team (FY2022/2023-Q3), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: Toolforge grid: start webservices after outage - https://phabricator.wikimedia.org/T329611 (10dcaro) [15:10:16] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T329581 (10dcaro) [15:10:20] 10Cloud-VPS, 10cloud-services-team (FY2022/2023-Q3), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned: grafana.wmcloud.org offline following cloud wide outage - https://phabricator.wikimedia.org/T329590 (10dcaro) [15:10:24] 10Cloud-VPS, 10cloud-services-team: gerrit copy of cloud/instance-puppet stopped replicating - https://phabricator.wikimedia.org/T329589 (10dcaro) [15:10:28] 10cloud-services-team (Hardware), 10DC-Ops, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4 - https://phabricator.wikimedia.org/T329498 (10dcaro) [15:10:36] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10dcaro) a:03dcaro [15:26:58] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.undrain_node [15:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:27:07] !log admin dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:27:56] 10Tools: 'wikitanvirbot' tool missing pywikibot config - https://phabricator.wikimedia.org/T349916 (10taavi) [15:28:03] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.undrain_node [15:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:29:41] !log admin dcaro@urcuchillay END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [15:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:29:50] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.undrain_node [15:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:31:26] !log admin dcaro@urcuchillay END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:34:31] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [15:38:42] !log admin dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [15:38:49] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [15:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:40:52] 10Tools: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913 (10Hoi) I have deleted the stale logs. The 20k PDF are public domain books to be uploaded to Commons. I think it would take a few more days to complete. [15:46:43] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] Enable disk failure prediciton - https://phabricator.wikimedia.org/T349694 (10dcaro) [15:49:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:26:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:52:53] 10Data-Services, 10DBA, 10Data-Engineering: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10BTullis) 05Open→03Resolved a:03BTullis Apologies for the delay. This work is done now. Since the reorg I'm now in the #data-platform-sre team, so I didn't see this tick... [17:01:26] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:13:54] RECOVERY - Disk space on cloudbackup2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2001&var-datasource=codfw+prometheus/ops [18:39:41] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: New Cloud VPS instance has "Failed to start Execute cloud user/final script." in its error log - https://phabricator.wikimedia.org/T349937 (10Tgr) [18:40:52] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: New Cloud VPS instance has "Failed to start Execute cloud user/final script." in its error log - https://phabricator.wikimedia.org/T349937 (10Tgr) I also got this error as part of the login message, but it seems unrelated: ` -bash: warning: setlocale: LC_ALL: cannot... [18:49:31] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:49:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:10:37] (CephSlowOps) firing: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:10:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [20:15:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [21:01:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:12:51] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: New Cloud VPS instance has "Failed to start Execute cloud user/final script." in its error log - https://phabricator.wikimedia.org/T349937 (10Tgr) 05Open→03Invalid Okay so I found https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_pu... [21:49:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:06:02] (03PS2) 10Dwisehaupt: Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) [22:08:10] (03CR) 10Dwisehaupt: "jgreen, if you can verify these, we can go ahead and merge this up." [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:53:43] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse