[00:34:57] 10Toolforge: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10Albertoleoncio) [00:46:10] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:27:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:50:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-14 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [02:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [02:38:39] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:27:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:50:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-14 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [05:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:38:39] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:27:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:50:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-14 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:07:05] 10Toolforge: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10taavi) a:03taavi This seems to be coming from the proxy error handler. [08:15:40] 10Toolforge: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10taavi) 05Open→03Resolved Fixed with https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/commit/c9b6facdcdc6c302c1d1510603b84fea255d2993. [08:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:20:27] (PrometheusRestarted) firing: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [08:24:39] 10Toolforge, 10cloud-services-team, 10Acme-chief, 10Patch-For-Review: toolforge acme-chief: Failed to generate additional resources using 'eval_generate': Could not intern_multiple from application/json: 416: unexpected token at '{"checksum":{"type":"md5","val' - https://phabricator.wikimedia.org/T349384 (1... [08:28:27] (PrometheusRestarted) firing: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [08:30:03] (PuppetAgentFailure) resolved: (2) Puppet agent failure detected on instance tools-sgeweblight-10-14 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:45:27] (PrometheusRestarted) resolved: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [08:49:18] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/19 harbor: upgrade to 2.9.0 [08:53:27] (PrometheusRestarted) resolved: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [08:58:04] 10cloud-services-team: CephSlowOps Ceph cluster in has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349425 (10dcaro) [08:58:11] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [09:09:31] 10Tool-bub2: Redesign the FAQs page - https://phabricator.wikimedia.org/T340385 (10Aklapper) [09:09:35] 10Tool-bub2: Redesign the UI to be more minimalistic and cleaner - https://phabricator.wikimedia.org/T340387 (10Aklapper) [09:21:12] 10Toolforge (Toolforge iteration 01): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) a:03Slst2020 [09:27:59] (PuppetFailure) firing: Puppet has failed on cloudgw1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:28:09] 10cloud-services-team: PuppetFailure cloudgw1002:9100 Puppet failure on cloudgw1002:9100 - https://phabricator.wikimedia.org/T349484 (10phaultfinder) [09:31:40] 10cloud-services-team: PuppetFailure cloudgw1002:9100 Puppet failure on cloudgw1002:9100 - https://phabricator.wikimedia.org/T349484 (10taavi) 05Open→03Resolved ` Oct 23 09:06:45 cloudgw1002 systemd[1]: prometheus-node-textfile-check-nft.timer: Timer unit lacks value setting. Refusing. ` seems to be related... [09:32:59] (PuppetFailure) resolved: Puppet has failed on cloudgw1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:55:28] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [09:58:39] (OpenstackAPIResponse) firing: (12) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:13:39] (OpenstackAPIResponse) firing: (12) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:17:57] (PrometheusNotConnectedToAM) firing: Prometheus is failing to connect to AlertManager - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DPrometheusNotConnectedToAM [10:17:57] (PrometheusNotConnectedToAM) firing: Prometheus is failing to connect to AlertManager - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DPrometheusNotConnectedToAM [10:18:06] 10cloud-services-team: PrometheusNotConnectedToAM prometheus1006:9904 Prometheus is failing to connect to AlertManager - https://phabricator.wikimedia.org/T349490 (10phaultfinder) [10:20:31] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Commands for change later on: ` wmcs-openstack port unset ca4cb8c7-bfb8-440b-8e41-74bb8e834717 --fixed-ip subnet=clo... [10:22:57] (PrometheusNotConnectedToAM) resolved: Prometheus is failing to connect to AlertManager - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DPrometheusNotConnectedToAM [10:22:57] (PrometheusNotConnectedToAM) resolved: Prometheus is failing to connect to AlertManager - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DPrometheusNotConnectedToAM [10:24:19] 10cloud-services-team: PrometheusNotConnectedToAM prometheus1006:9904 Prometheus is failing to connect to AlertManager - https://phabricator.wikimedia.org/T349490 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like this was transient, likely happened while puppet was updating prometheus configuratio... [10:27:54] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [10:28:44] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [10:53:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-puppetdb-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:58:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-puppetdb-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:14:29] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [11:31:56] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/massmailer] - 10https://gerrit.wikimedia.org/r/967892 (owner: 10L10n-bot) [11:31:58] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/967890 (owner: 10L10n-bot) [11:32:00] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/967895 (owner: 10L10n-bot) [11:58:25] 10Toolforge (Toolforge iteration 01): Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10taavi) [11:58:48] 10Toolforge (Toolforge iteration 01), 10cloud-services-team: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10taavi) [12:14:46] (03CR) 10D3r1ck01: [C: 03+2] Add missing library repos from https://doc.wikimedia.org/#libraries [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965809 (owner: 10Gergő Tisza) [12:15:23] (03CR) 10D3r1ck01: [C: 03+2] Add more missing library repos from Iae28fa6b31 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965841 (owner: 10Gergő Tisza) [12:15:44] (03Merged) 10jenkins-bot: Add missing library repos from https://doc.wikimedia.org/#libraries [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965809 (owner: 10Gergő Tisza) [12:16:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-83 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:16:18] (03Merged) 10jenkins-bot: Add more missing library repos from Iae28fa6b31 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965841 (owner: 10Gergő Tisza) [12:17:11] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [12:21:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-83 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:26:37] (CephSlowOps) firing: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:26:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [12:29:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:31:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-puppetmaster-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:31:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [12:36:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-puppetmaster-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:36:33] (PuppetAgentStaleLastRun) firing: (2) Last Puppet run was over 24 hours ago on instance metricsinfra-haproxy-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:38:03] (InstanceDown) firing: Project tools instance tools-sgewebgen-10-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:38:03] (InstanceDown) firing: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:39:03] (InstanceDown) firing: Project metricsinfra instance metricsinfra-puppetmaster-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:41:33] (PuppetAgentStaleLastRun) resolved: (2) Last Puppet run was over 24 hours ago on instance metricsinfra-haproxy-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:41:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 118 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:41:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [12:43:03] (InstanceDown) resolved: Project cloudinfra instance cloudinfra-db03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:43:03] (InstanceDown) resolved: Project tools instance tools-sgewebgen-10-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:44:03] (InstanceDown) resolved: Project metricsinfra instance metricsinfra-puppetmaster-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:44:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:21:22] 10Toolforge (Toolforge iteration 01): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) project/getProjectSummary seems to be the best endpoint for quota info [13:22:58] 10Toolforge (Toolforge iteration 01): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10taavi) Hopefully `tool` not `project` since this is Toolforge and not Cloud VPS? [13:33:02] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, 10User-dcaro: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [13:33:13] 10Toolforge (Toolforge iteration 01): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) >>! In T341068#9272587, @taavi wrote: > Hopefully `tool` not `project` since this is Toolforge and not Cloud VPS? The above was just a sloppy comment to myself about which Harbor AP... [13:33:15] 10cloud-services-team: cloudgw improvements - https://phabricator.wikimedia.org/T347469 (10dcaro) [13:33:54] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, 10User-dcaro: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) 05In progress→03Resolved This went as expected, and all the changes have been applied :) Thanks a lot @cmooney ! [13:46:40] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10fnegri) I don't have a strong preference, but I vote for Option 1 as I would prefer to tackle the different objectives separately: 1. consolidating the CLI 2. migrating to Go (can... [13:58:29] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10dcaro) I vote for option 3, as it's the one that will require less effort duplication, given that the api definiton is something that we want to do anyhow. It achieves the cli conso... [14:08:24] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10taavi) What would "autogenerated CLI" mean? I remaing sceptical that you can reliably do something more than generate an OOP wrapper around the API methods, which would do nothing a... [14:13:14] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10nskaggs) For option 3, once the openapi specification is complete, I presume you could also generate a python client? It seems the implied goal is to end up with go client binary, b... [14:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [14:24:41] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10CodeReviewBot) dcaro merged https://g... [14:26:42] 10Toolforge (Toolforge iteration 01): [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_reques... [14:33:40] 10cloud-services-team (FY2023/2024-Q1), 10SRE, 10ops-eqiad, 10Goal: cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10Jclark-ctr) 05Open→03Resolved [14:33:43] 10cloud-services-team (FY2023/2024-Q1), 10Epic, 10Goal: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10Jclark-ctr) [14:40:53] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro We need to start updating firmwares on servers they will need to be restarted to finalize installation. would y... [14:41:33] (03PS3) 10FNegri: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [14:47:26] (03PS4) 10FNegri: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [14:50:22] (03CR) 10FNegri: openstack: don't pass the new project when creating it (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [14:50:28] (03CR) 10CI reject: [V: 04-1] openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [14:53:44] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10dcaro) > What would "autogenerated CLI" mean? I remaing sceptical that you can reliably do something more than generate an OOP wrapper around the API methods, which would do nothing... [14:54:11] (03PS5) 10FNegri: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [14:54:14] !log admin fran@wmf3169 START - Cookbook wmcs.vps.create_project for project catalyst in eqiad1 [14:54:17] !log admin fran@wmf3169 END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project catalyst in eqiad1 [14:57:17] (03CR) 10CI reject: [V: 04-1] openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [15:06:19] !log admin fran@wmf3169 START - Cookbook wmcs.vps.create_project for project catalyst in eqiad1 [15:06:23] !log admin fran@wmf3169 END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project catalyst in eqiad1 [15:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:10:02] 10Cloud-VPS, 10cloud-services-team, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053 (10BCornwall) [15:10:10] 10Cloud-VPS, 10cloud-services-team, 10SRE-OnFire, 10Sustainability (Incident Followup), 10User-dcaro: monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694 (10BCornwall) [15:10:51] 10cloud-services-team, 10SRE-OnFire, 10Sustainability (Incident Followup), 10User-dcaro: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) a:03dcaro [15:10:58] 10cloud-services-team, 10SRE-OnFire, 10Sustainability (Incident Followup), 10User-dcaro: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:11:18] 10cloud-services-team, 10SRE-OnFire, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:11:30] 10cloud-services-team, 10SRE-OnFire, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:18:39] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:22:54] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [15:24:54] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [15:27:43] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [15:27:53] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [15:33:10] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [15:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:33:44] !log toolsbeta dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:34:48] 10Cloud-VPS (Project-requests): Request creation of catalyst VPS project - https://phabricator.wikimedia.org/T349378 (10Slst2020) a:03Slst2020 [15:39:32] !log tools dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [15:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:39:43] 10Cloud-VPS (Project-requests): Request creation of catalyst VPS project - https://phabricator.wikimedia.org/T349378 (10Slst2020) 05Open→03Resolved Done :) [15:40:07] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [15:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:43:01] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10CodeReviewBot) dcaro merged https://g... [15:43:11] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/120 builds-builder: bump to 0.0.80-20231023142438-55d11e16 [15:43:46] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10dcaro) 05In progress→03Resolved [15:58:39] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:03:39] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:13:39] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:14:29] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:18:39] (OpenstackAPIResponse) firing: (10) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:38:11] (03PS6) 10FNegri: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [16:46:28] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/16 [envvars-api]: Add prometheus [16:46:33] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:48:28] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [17:17:28] 10PAWS: update z2jh chart to 3.1.0 - https://phabricator.wikimedia.org/T349545 (10rook) [17:18:48] 10PAWS: Is PAWS culler workng? - https://phabricator.wikimedia.org/T345838 (10rook) Looks like this is noted in the changelog for the z2jh chart. T349545 may resolve this. [17:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [17:22:20] 10PAWS: update z2jh chart to 3.1.0 - https://phabricator.wikimedia.org/T349545 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/342 [17:22:25] vivian-rook opened https://github.com/toolforge/paws/pull/342 [17:27:58] 10Toolforge: Tool (k8s-status or a new one) to display details about buildservice pipelines and Harbor images - https://phabricator.wikimedia.org/T336133 (10Raymond_Ndibe) [17:28:04] 10Toolforge: Tool (k8s-status or a new one) to display details about buildservice pipelines and Harbor images - https://phabricator.wikimedia.org/T336133 (10Raymond_Ndibe) [17:28:24] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10Raymond_Ndibe) 05Open→03Stalled [17:29:17] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [builds-api.start] Add statistics - https://phabricator.wikimedia.org/T337390 (10Raymond_Ndibe) 05In progress→03Stalled [18:35:02] 10PAWS: update z2jh chart to 3.1.0 - https://phabricator.wikimedia.org/T349545 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/342 [18:35:07] vivian-rook closed https://github.com/toolforge/paws/pull/342 [18:35:20] 10PAWS: update z2jh chart to 3.1.0 - https://phabricator.wikimedia.org/T349545 (10rook) 05Open→03Resolved [18:36:23] 10PAWS: Remove old cluster - https://phabricator.wikimedia.org/T349551 (10rook) [18:36:34] 10PAWS: update z2jh chart to 3.1.0 - https://phabricator.wikimedia.org/T349545 (10rook) [18:36:40] 10PAWS: Remove old cluster - https://phabricator.wikimedia.org/T349551 (10rook) [18:40:22] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Raymond_Ndibe) In my opinion I think we should go with **Option 1** in the short term and **Option 3** in the long term. **Option 2** is totally out of the question in my opinion be... [19:01:17] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2001 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:36:25] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) Hey @taavi and @cmooney Just wanted to see if there was a timeframe for us to move these servers. Any specific time when we know the servers... [19:37:23] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9274072, @VRiley-WMF wrote: > Just wanted to see if there was a timeframe on this move. Like, a specific time when we know the server... [20:18:39] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:19:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:04:29] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:08:39] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:14:29] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:12:42] 10Tool-bub2: QueueTable Footer inconsistent style - https://phabricator.wikimedia.org/T349567 (10Okerekechinweotito) [22:17:07] 10Tool-bub2: QueueTable Footer Pagination inconsistent style - https://phabricator.wikimedia.org/T349567 (10Okerekechinweotito) [22:21:28] 10Tool-bub2: QueueTable Footer Pagination inconsistent style - https://phabricator.wikimedia.org/T349567 (10Okerekechinweotito) [22:23:44] 10Tool-bub2: QueueTable Footer Pagination inconsistent style - https://phabricator.wikimedia.org/T349567 (10Okerekechinweotito) [22:34:12] 10Tool-bub2: QueueTable Footer Pagination inconsistent style - https://phabricator.wikimedia.org/T349567 (10Okerekechinweotito) I have made a PR that fixes this issue PR here - https://github.com/coderwassananmol/BUB2/pull/229 [23:24:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [23:37:56] 10Toolforge (Toolforge iteration 01), 10cloud-services-team: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10Albertoleoncio) 05Resolved→03Open >>! Em T349452#9271561, @taavi escreveu: > Fixed with https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/... [23:43:39] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:44:29] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:48:39] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse