[00:05:10] 10Toolforge (Toolforge iteration 02): maintain-harbor: code refactor for readability and quality - https://phabricator.wikimedia.org/T351277 (10Raymond_Ndibe) [00:08:53] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: maintain-harbor: code refactor for readability and quality - https://phabricator.wikimedia.org/T351277 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/19 [maintain-harbor] minor... [00:09:10] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:17:10] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:19:10] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:28:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:33:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:43:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:49:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:51:59] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:54:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:58:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:03:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:29:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:34:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:53:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:03] 10cloud-services-team: PuppetFailure cloudcontrol2004-dev:9100 Puppet failure on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T351280 (10phaultfinder) [02:06:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:13:43] 10cloud-services-team: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10Andrew) [02:16:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:26:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:31:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:38:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:43:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:14:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:15:57] 10cloud-services-team, 10Goal, 10Surveys: 2022 Cloud Services Survey - https://phabricator.wikimedia.org/T322500 (10komla) [03:16:23] 10cloud-services-team, 10Goal, 10Surveys: 2022 Cloud Services Survey - https://phabricator.wikimedia.org/T322500 (10komla) 05In progress→03Resolved [03:19:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:24:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:44:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:58:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:08:59] 10cloud-services-team: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10Andrew) update: I'm now pretty sure that this is not a blocked port issue. [04:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:13:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:52:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:48:05] 10Tool-iw, 10Toolforge: iw.toolforge.org does not support URL-encoded query parameters ([[toolforge:foo?bar]]) - https://phabricator.wikimedia.org/T345783 (10Legoktm) Switching to path-based input is something that tool developers would need to support in each tool [05:54:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:06:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:17:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:01:35] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:04:31] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [07:05:15] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING WARNING - Packet loss = 77%, RTA = 22.30 ms [07:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:13:17] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:57:37] (CephSlowOps) firing: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [07:57:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [08:02:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:24:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:29:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:34:09] 10Tool-masto-collab: masto-collab - 422 error when trying to approve posts - https://phabricator.wikimedia.org/T351012 (10Peachey88) 05Open→03Invalid Post was over content length which was causing the error [08:38:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:43:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:52:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:14:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:19:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:37:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:50:54] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: Migrate cloudlb hosts to nftables - https://phabricator.wikimedia.org/T351087 (10taavi) 05Open→03Resolved [09:50:57] 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 (10taavi) [09:51:04] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) a:03taavi [09:51:07] 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 (10taavi) a:03taavi [09:53:21] 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 (10taavi) [09:53:50] 10Data-Services, 10cloud-services-team, 10Infrastructure-Foundations: nftables ignores drange filter for IPv6 if drange only has IPv4 addresses - https://phabricator.wikimedia.org/T351094 (10taavi) 05Open→03Resolved a:03jbond [09:54:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:57:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:06:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:17:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:17:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:22:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:26:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:36:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:57:49] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10ABran-WMF) I think creating a synthetic indicator of `slave_io_running + slave_sql_running` and check if the value is `< 2` c... [11:28:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:38:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:51:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:01:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:08:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:18:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:52:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:07:19] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby on rails tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) I've been able to get a sample ruby on rails buildpack app running at https://sample-ruby-rails-buildpack-app.toolfor... [13:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:22:37] (CephSlowOps) firing: Ceph cluster in eqiad has 13 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:22:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [13:27:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 13 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:54:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:21:02] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:21:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:41:05] 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew) [14:43:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [14:43:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:44:52] 10Toolforge, 10DBA: I can't log in to srwiki's DB replica on Toolforge - https://phabricator.wikimedia.org/T351316 (10MBH) [14:45:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [14:46:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:47:07] 10Cloud-Services, 10Data Products: I can't log in to srwiki's DB replica on Toolforge - https://phabricator.wikimedia.org/T351316 (10Ladsgroup) DBAs don't maintain wikireplicas, we can't be of help to you sorry :( [14:47:18] 10Data-Services, 10cloud-services-team, 10Data Products: I can't log in to srwiki's DB replica on Toolforge - https://phabricator.wikimedia.org/T351316 (10taavi) [15:01:44] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10observability, and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Southparkfan) [15:16:52] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) [15:18:38] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) As part of the Puppet migration we already switched all Buster clients (where version of GNUTLS had problems with the new cert) toward... [15:43:58] 10cloud-services-team, 10decommission-hardware, 10ops-eqiad: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew) a:05Andrew→03None [16:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:18:37] (CephSlowOps) firing: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [16:19:35] 10Toolforge Jobs framework, 10cloud-services-team, 10Pywikibot, 10Patch-For-Review, 10User-Raymond_Ndibe: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 (10taavi) p:05Medium→03High a:03taavi Per today's team WMCS meeting. [16:20:00] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [16:23:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [16:26:43] 10Toolforge Jobs framework, 10cloud-services-team (FY2023/2024-Q1), 10Pywikibot, 10Patch-For-Review, 10User-Raymond_Ndibe: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 (10fnegri) [16:37:19] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Andrew) a:05taavi→03komla [16:45:27] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) @ABran-WMF I thought of adding something similar (`slave_io_running + slave_sql_running > 2`), but I suspect then I w... [16:45:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:52:15] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:53:45] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10komla) >>! In T350484#9321217, @Kanashimi wrote: > Thank you! I changed the settings so it now looks like I need to increase the number of continuous jobs. > > ` > # toolforg... [16:56:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:06:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:07:25] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team, 10User-brennen: Request creation of releng-data VPS project - https://phabricator.wikimedia.org/T351330 (10brennen) [17:07:55] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of releng-data VPS project - https://phabricator.wikimedia.org/T351330 (10brennen) [17:12:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:20:07] 10Tools, 10GeoData, 10Community-Wishlist-Survey-2016, 10Maps (Kartographer): Better interface and visualisation for coordinates and map - https://phabricator.wikimedia.org/T157844 (10Aklapper) [17:27:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:52:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:57:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:03:26] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking [18:05:43] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Reopening as cloudelastic1008-1010 don't appear to have reimaged properly, and we may need them for T350826 . [18:10:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:11:25] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10Aklapper) In case there's no capacity soon to review, https://www.mediawiki.org/wiki/GitLab/Hosting_a_project_on_GitLab#GitLab... [18:13:31] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10taavi) > The Toolforge rules request that The [[ https://wikitech.wikimedia.org/wiki/Help:Toolforge/Rules | Toolforge rules ]... [18:15:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:19:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:21:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:29:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:38:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:48:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:51:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:54:16] 10Grid-Engine-to-K8s-Migration: Migrate zygserv from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320213 (10komla) This is a `test` [18:56:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:01:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:02:19] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of releng-data VPS project - https://phabricator.wikimedia.org/T351330 (10bd808) @brennen, To avoid the appearance of `"Umbrella" projects with a broad scope, such as all the work to be... [19:04:33] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:05:51] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of releng-data VPS project - https://phabricator.wikimedia.org/T351330 (10bd808) > gerrit-stats involves a clone of all WMF repos from Gerrit, which comes to 34G. Spitballing, it would p... [19:06:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:08:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:13:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:16:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:16:17] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:23:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:28:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:35:03] (InstanceDown) firing: Project quarry instance quarry-worker-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:40:03] (InstanceDown) resolved: Project quarry instance quarry-worker-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:45:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:46:26] 10PAWS: PAWS terraform to opentofu? - https://phabricator.wikimedia.org/T351249 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/349 [19:46:31] vivian-rook closed https://github.com/toolforge/paws/pull/349 [19:50:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:52:44] 10PAWS: PAWS terraform to opentofu? - https://phabricator.wikimedia.org/T351249 (10rook) 05Open→03Resolved [19:54:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:58:55] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:59:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:00:04] 10cloud-services-team: PuppetFailure cloudcontrol2001-dev:9100 Puppet failure on cloudcontrol2001-dev:9100 - https://phabricator.wikimedia.org/T351346 (10phaultfinder) [20:04:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:11:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:21:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:27:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:32:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:37:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:42:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:45:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:46:14] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:49:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:50:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:52:59] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:58:59] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of releng-data VPS project - https://phabricator.wikimedia.org/T351330 (10brennen) > would y'all be ok with calling the project something like "devel-stats" as a riff on https://www.medi... [20:59:15] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05In progress→03Resolved Not sure what happened, but the cloudelastic1008-1010 hosts are up after a reim... [21:07:54] 10wikitech.wikimedia.org: Requesting content administrator permissions access for Triciaburmeister - https://phabricator.wikimedia.org/T347346 (10TBurmeister) @nskaggs we talked about this in our 1:1 on Oct 3 but I still don't have these rights; can you help? Thanks! [21:12:26] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10bd808) [21:13:56] 10Cloud-VPS (Project-requests), 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10bd808) +1 from me for creating the project [21:14:54] 10Cloud-VPS (Project-requests), 10cloud-services-team, 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10bd808) [21:20:41] 10cloud-services-team, 10Patch-For-Review: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10taavi) [21:20:43] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) [22:10:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:21:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:58:23] 10wikitech.wikimedia.org, 10User-bd808: Requesting content administrator permissions access for Triciaburmeister - https://phabricator.wikimedia.org/T347346 (10bd808) 05Open→03In progress a:03bd808 [23:02:25] 10wikitech.wikimedia.org, 10User-bd808: Requesting content administrator permissions access for Triciaburmeister - https://phabricator.wikimedia.org/T347346 (10bd808) 05In progress→03Resolved https://wikitech.wikimedia.org/w/index.php?title=Special:Log&logid=964019 [23:18:22] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10bd808) `editprotected` is technically the right that is needed to edit pages marked visually as "Allow only administrators". On wikitech most folks with this... [23:26:54] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10bd808) >>! In T347557#9235094, @fnegri wrote: > Thanks, then maybe the note in the page could be "This page is protected, if you need to edit it please [cont...