[00:04:00] FIRING: [3x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:04:22] RESOLVED: [3x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:05:41] RESOLVED: CloudVPSDesignateLeaks: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:06:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:08:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [00:10:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [00:14:12] 10Cloud-Services: Prepare "What's new with Wikimedia Cloud Services" presentation for WikiConNA 2024 - https://phabricator.wikimedia.org/T373159 (10bd808) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and... [00:17:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [00:18:15] 10Cloud-Services: Prepare "What's new with Wikimedia Cloud Services" presentation for WikiConNA 2024 - https://phabricator.wikimedia.org/T373159#10086719 (10bd808) p:05Triage→03Medium a:03bd808 [00:18:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:20:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [00:21:50] FIRING: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:23:49] FIRING: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:26:56] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:30:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [00:32:43] 10Cloud-Services: Prepare "What's new with Wikimedia Cloud Services" presentation for WikiConNA 2024 - https://phabricator.wikimedia.org/T373159#10086732 (10bd808) @Andrew, @dcaro, @Slst2020, @komla, @taavi: ideas for topics to cover are very welcome. At a high level I have been thinking about telling the story... [00:39:55] FIRING: MaxConntrack: Max conntrack at 80.1% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:44:55] RESOLVED: MaxConntrack: Max conntrack at 80.1% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:48:26] RESOLVED: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:13:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:43:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:23:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:29:56] (03CR) 10Jean-Frédéric: [C:03+1] "I think ./toolbox should also be considered MIT? It was the continuation of the ./api. But we can also deal with it later." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil) [07:30:39] (03CR) 10Jean-Frédéric: [C:03+2] Update documentation on localhost address [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064474 (owner: 10Lokal Profil) [07:32:27] (03CR) 10Jean-Frédéric: [C:03+2] "> With patch 669b549 the localhost addresses changed from localhost:8000:80" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064474 (owner: 10Lokal Profil) [07:32:56] (03Merged) 10jenkins-bot: Update documentation on localhost address [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064474 (owner: 10Lokal Profil) [08:08:41] FIRING: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [08:09:03] (03PS1) 10Jean-Frédéric: Use toolforge-jobs to install requirements in deployment process [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 [08:09:03] (03PS1) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 [08:09:28] (03PS2) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 [08:09:28] (03PS2) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 [08:33:41] RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [09:05:57] 10Toolforge: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163#10087095 (10Leloiandudu) I haven't seen these for a few months but started getting them every couple of hours today. The error message is slightly different now: `7:16:58 AM Got 'Resource temporarily un... [09:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:17:14] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [09:17:14] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [09:18:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:22:14] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [09:22:14] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [09:28:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:40:00] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [09:43:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:45:00] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [10:20:09] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10087333 (10DavidTornheim) >>! In T361426#10086510, @bd808 wrote: >>>! In T361426#10086509, @bd808 wrote: >> Someone wrote to the page that causes the bot to halt. The edit looks like a vandal: https://en.... [10:38:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:11:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:18:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:18:59] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10087457 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/67 [11:19:03] vivian-rook opened https://github.com/toolforge/quarry/pull/67 [11:28:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:36:13] (03CR) 10Jforrester: [C:03+2] build: Updating composer dependencies [labs/tools/coverme] - 10https://gerrit.wikimedia.org/r/1061184 (owner: 10Libraryupgrader) [13:48:49] RESOLVED: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [14:08:59] 10Toolforge: Toolforge buildservice logs error - https://phabricator.wikimedia.org/T373201 (10Bawolff) 03NEW [14:42:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1062.eqiad.wmnet' (T369044) [14:42:19] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [14:44:53] 10Cloud-VPS (Debian Buster Deprecation), 10Wikispore: Rebuild Wikispore Vagrant boxes on Bullseye or Bookworm - https://phabricator.wikimedia.org/T365934#10088082 (10Andrew) *bump* [14:47:45] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "dumps" project Buster deprecation - https://phabricator.wikimedia.org/T367528#10088093 (10Andrew) [14:48:27] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:48:42] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1062.eqiad.wmnet' (T369044) [14:48:52] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [14:49:27] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:50:06] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "dumps" project Buster deprecation - https://phabricator.wikimedia.org/T367528#10088096 (10Andrew) 05Open→03Resolved [14:53:56] FIRING: SystemdUnitDown: The service unit neutron-openvswitch-agent.service is in failed status on host cloudvirt1062. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:13:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:20:41] FIRING: CloudVPSDesignateLeaks: Detected 9 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:20:56] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Puppet-Infrastructure, 13Patch-For-Review: Ownership confusion on cloud-local puppet servers - https://phabricator.wikimedia.org/T364492#10088196 (10Andrew) 05Open→03Resolved I think this is a little better after the last round of fixes [15:55:57] 10VPS-Projects: magnum clusters not deploying in eqiad1 - https://phabricator.wikimedia.org/T373207 (10rook) 03NEW [16:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 15 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:37:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [tf-infra-tests] Failing to destroy - volumes stuck - https://phabricator.wikimedia.org/T352895#10088407 (10rook) This appears to have been repaired somewhere along the line. Tofu seems to be... [16:37:12] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [tf-infra-tests] Failing to destroy - volumes stuck - https://phabricator.wikimedia.org/T352895#10088408 (10rook) 05In progress→03Resolved [16:48:56] FIRING: SystemdUnitDown: The systemd unit neutron-openvswitch-agent.service on node cloudvirt1062 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:49:03] 06cloud-services-team: SystemdUnitDown Unit neutron-openvswitch-agent.service on node cloudvirt1062 has been down for long. - https://phabricator.wikimedia.org/T373214 (10phaultfinder) 03NEW [17:09:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:14:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:19:29] RESOLVED: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:26:58] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:27:41] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668#10088664 (10Raymond_Ndibe) [18:27:41] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Deploy with Helm - https://phabricator.wikimedia.org/T356301#10088665 (10Raymond_Ndibe) [18:28:40] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668#10088667 (10Raymond_Ndibe) [18:28:41] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#10088668 (10Raymond_Ndibe) [18:30:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668#10088675 (10Raymond_Ndibe) [18:30:18] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#10088676 (10Raymond_Ndibe) [19:01:58] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:04:43] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10088786 (10Eevans) Ok, it seems to be working n... [19:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:23:11] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10088831 (10Eevans) I think we can mark this clo... [19:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:38:18] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#10088905 (10Raymond_Ndibe) == Possible Steps == **Toolsbeta:** [x] create `harborstorage` object storage on horizon [] figure out authentication for... [20:19:17] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227 (10bd808) 03NEW [20:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:20:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service heat-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:20:45] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089005 (10bd808) @Andrew, any idea about where I should start looking for hints about what might... [20:23:56] FIRING: [2x] SystemdUnitDown: The service unit heat-api.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:49:11] FIRING: SystemdUnitDown: The systemd unit neutron-openvswitch-agent.service on node cloudvirt1062 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:09:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:23:56] FIRING: [2x] SystemdUnitDown: The service unit heat-api.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:25:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service heat-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:28:56] RESOLVED: [2x] SystemdUnitDown: The service unit heat-api.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:30:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service heat-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:30:41] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:31:11] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [21:31:11] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [21:34:04] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089087 (10bd808) https://docs.openstack.org/magnum/2024.1/admin/troubleshooting-guide.html#heat-... [21:34:20] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089088 (10bd808) [21:34:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:35:41] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:36:11] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [21:36:11] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [21:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:55:10] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10089118 (10bd808) [22:04:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:34:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:40:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown