[00:08:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [00:08:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [00:11:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [00:16:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:16:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:30:46] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @cmooney hey i am working on 2 nodes cloudvirt1063 and 64 same rack E4 getting the message below. can you please see whu those nodes can not the... [01:18:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-22 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [01:43:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-28 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [01:48:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-28 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:01:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:08:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [02:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:08:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [03:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [03:13:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [03:16:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:19:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:48:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-28 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:01:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [06:13:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [06:13:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [06:16:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [06:19:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [07:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:14:43] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [webservice] Error shown when restarting buildpack-based tool - https://phabricator.wikimedia.org/T348312 (10kostajh) Seen also with `earlywarni... [07:34:20] 10Tool-wikiloves, 10Wiki Loves Monuments FY 2022-2023: WLM 2022: Armenia missing from wikiloves - https://phabricator.wikimedia.org/T318944 (10JeanFred) [07:35:17] 10Tool-wikiloves: Armenia missing in Statistik list, WLM 2022 - https://phabricator.wikimedia.org/T319144 (10JeanFred) [07:35:22] 10Tool-wikiloves, 10Wiki Loves Monuments FY 2022-2023: WLM 2022: Armenia missing from wikiloves - https://phabricator.wikimedia.org/T318944 (10JeanFred) [07:36:47] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [07:47:17] (03PS1) 10Stevemunene: Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) [07:51:47] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-28 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:55:02] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716 (10dcaro) p:05Triage→03High [08:01:47] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:08:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10SRE Observability (FY2023/2024-Q2): [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10fgiunchedi) The easiest at the moment is to add cloudcumin hosts to `profile::alertmanage... [08:13:54] 10Toolforge (Quota-requests), 10User-dcaro: Request increased quota for deltabot Toolforge tool - https://phabricator.wikimedia.org/T347951 (10dcaro) 05Open→03In progress [08:14:39] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ayounsi) @Papaul can you ping us when you're around so we can look into it? Did you check the vlan config on the switch? Is it not able to reach anything e... [08:19:29] 10Toolforge (Quota-requests), 10User-dcaro: Request increased quota for deltabot Toolforge tool - https://phabricator.wikimedia.org/T347951 (10dcaro) Done, sorry for the delay, let me know if you have any issues with it :) [08:19:33] 10Toolforge (Quota-requests), 10User-dcaro: Request increased quota for deltabot Toolforge tool - https://phabricator.wikimedia.org/T347951 (10dcaro) 05In progress→03Resolved [08:20:53] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10Patch-For-Review, 10User-dcaro: [toolsdb] ToolsDB replication is broken on tools-db-2 (errno 1032) - 2023-08-17 - https://phabricator.wikimedia.org/T344411 (10dcaro) # I think this is already so... [08:20:59] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10Patch-For-Review, 10User-dcaro: [toolsdb] ToolsDB replication is broken on tools-db-2 (errno 1032) - 2023-08-17 - https://phabricator.wikimedia.org/T344411 (10dcaro) 05In progress→03Resolved [08:21:47] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:36:47] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:00:34] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10dcaro) >>! In T346153#9243224, @Slst2020 wrote: > I lean towards Option 1 as my t... [09:06:47] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:16:47] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:16:47] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [09:31:19] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10Slst2020) >>! In T346153#9245363, @dcaro wrote: >>>! In T346153#9243224, @Slst202... [09:48:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:48:37] 10Cloud-VPS: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10fnegri) [09:49:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10fnegri) p:05Triage→03Medium [09:55:24] 10cloud-services-team, 10User-dcaro, 10User-fgiunchedi: [wmcs][alerting] Allow volunteer admins silencing alerts from cloudvps/toolforge/paws/quarry - https://phabricator.wikimedia.org/T320973 (10dcaro) 05Open→03In progress [09:55:29] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Epic, and 2 others: Streamline WMCS Alerting and Paging - https://phabricator.wikimedia.org/T313444 (10dcaro) [10:28:57] (TfInfraTestApplyFailed) firing: (2) Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:38:57] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:43:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:53:57] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-28 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:19:13] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.659% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:30:47] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/965493 (owner: 10L10n-bot) [11:35:56] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Change Books.js component to React hooks component - https://phabricator.wikimedia.org/T348414 (10Pamisijohn) I made a PR to fix this: https://github.com/coderwassananmol/BUB2/pull/222 [11:44:11] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10Sustainability (Incident Followup): High availability for the main cloud vps web proxy - https://phabricator.wikimedia.org/T316982 (10taavi) `lang=shell-session taavi@project-proxy-puppetmaster-01:~$ curl --connect-to ::172.16.5.140 https://deb-tool... [11:44:13] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.886% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:53:57] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:53:57] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-28 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:54:33] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:59:33] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:59:39] 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694 (10dcaro) I have enabled paging for MainProxyDown alert on metricsinf... [11:59:47] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [11:59:49] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [12:00:06] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [12:00:08] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [12:03:57] (ToolsGridQueueProblem) resolved: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [12:03:57] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:18:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:36:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) I finally managed to ssh into testdb03, by adding the "ssh-from-anywhere" security group to the Nova instance. This showed... [12:56:52] 10Toolforge (Toolforge iteration 01), 10Toolforge Jobs framework, 10Patch-For-Review: jobs: Add option to disable NFS mounts - https://phabricator.wikimedia.org/T348250 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/2 Add --mount option [13:04:06] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10Slst2020) [13:04:29] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10Slst2020) 05In progress→03Resolved [13:04:36] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: Toolforge beyond build service - https://phabricator.wikimedia.org/T342077 (10Slst2020) [13:09:30] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/965493 (owner: 10L10n-bot) [13:10:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.apt.copy_to_main_repo for package 'toolforge-jobs-framework-cli' version '15' [13:10:36] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.apt.copy_to_main_repo (exit_code=0) for package 'toolforge-jobs-framework-cli' version '15' [13:15:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:15:18] 10Toolforge (Toolforge iteration 01), 10Toolforge Jobs framework: jobs: Add option to disable NFS mounts - https://phabricator.wikimedia.org/T348250 (10taavi) 05Open→03Resolved [13:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [13:19:47] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10dcaro) p:05Triage→03High [13:20:02] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10dcaro) [13:28:56] 10Quarry: Remove gerrit git from quarry puppet - https://phabricator.wikimedia.org/T348748 (10rook) [13:29:56] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Slst2020) [13:30:02] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10CodeReviewBot) dcaro updated https://... [13:31:17] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10CodeReviewBot) dcaro updated https://... [13:33:29] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10Slst2020) [13:35:48] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [builds-builder] apt buildpack does not fail when it fails to fetch packages - https://phabricator.wikimedia.org/T348746 (10dcaro) 05Open→03In progress [13:36:05] 10Toolforge (Toolforge iteration 01), 10Toolforge Build Service, 10cloud-services-team: toolsbeta harbor instance ran out of disk - https://phabricator.wikimedia.org/T348337 (10dcaro) a:03dcaro [13:36:33] 10Toolforge (Toolforge iteration 01): [toolforge] add changelog page to send small updates for projects - https://phabricator.wikimedia.org/T348537 (10dcaro) 05Open→03Resolved [13:36:50] 10Toolforge (Toolforge iteration 01), 10Toolforge Build Service, 10cloud-services-team: toolsbeta harbor instance ran out of disk - https://phabricator.wikimedia.org/T348337 (10dcaro) 05Open→03In progress [13:38:15] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs][builder] Inject nodejs buildpack - https://phabricator.wikimedia.org/T346635 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/re... [13:38:37] 10Cloud Services Proposals, 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 3 others: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs - https://phabricator.wikimedia.org/T194332 (10Slst2020) [13:38:56] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] User story - I can use multiple language stacks for my application - https://phabricator.wikimedia.org/T325799 (10Slst2020) [13:39:01] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: Toolforge beyond build service - https://phabricator.wikimedia.org/T342077 (10Slst2020) 05Stalled→03Resolved [13:39:07] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs][builder] Inject nodejs buildpack - https://phabricator.wikimedia.org/T346635 (10Slst2020) 05In progress→03Resolved [13:39:26] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Slst2020) 05Open→03In progress [13:42:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 5 others: Cloud Ceph outage 2023-02-13 - https://phabricator.wikimedia.org/T329535 (10dcaro) I'll close this for now, it might be related to {T348643} as that could explain the cluste... [13:42:47] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10Slst2020) [13:43:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:48:31] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @ayounsi yes i checked the the vlan config on the switch and confirmed that the interface is in the right vlan. The reason you can not ssh into the... [13:50:59] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10Slst2020) [13:56:06] 10Toolforge, 10Toolforge-standards-committee, 10cloud-services-team, 10Security-Team, 10Security: Standard process for dealing with public OAuth consumer secrets - https://phabricator.wikimedia.org/T348752 (10taavi) [14:01:25] 10Toolforge (Toolforge iteration 01), 10Toolforge Build Service, 10cloud-services-team: toolsbeta harbor instance ran out of disk - https://phabricator.wikimedia.org/T348337 (10dcaro) Actually, the VM had a 40G volume assigned that was not being used, I mounted it and moved the data there, that should give u... [14:02:27] 10Toolforge (Toolforge iteration 01), 10Toolforge Jobs framework: jobs: Add option to disable NFS mounts - https://phabricator.wikimedia.org/T348250 (10taavi) [14:02:33] 10Toolforge, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] Define what NFS access to enable and how users will interact with it - https://phabricator.wikimedia.org/T334081 (10taavi) [14:03:36] 10Toolforge Jobs framework: toolforge-jobs – wikihistory needs a container with both php7 and mono - https://phabricator.wikimedia.org/T305780 (10taavi) The toolforge build service should make this possible: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service [14:07:24] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:19:13] 10Toolforge Jobs framework: Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755 (10taavi) [14:20:25] 10Toolforge Jobs framework: Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755 (10taavi) [14:20:28] 10Toolforge Jobs framework: Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 (10taavi) [14:26:52] 10Toolforge Jobs framework: Support tool-internal networking - https://phabricator.wikimedia.org/T348758 (10taavi) [14:28:05] 10Toolforge Jobs framework: Support tool-internal networking - https://phabricator.wikimedia.org/T348758 (10taavi) [14:28:30] 10Toolforge Jobs framework: Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755 (10taavi) [14:28:34] 10Toolforge Jobs framework: Support tool-internal networking - https://phabricator.wikimedia.org/T348758 (10taavi) [14:33:07] 10Toolforge (Toolforge iteration 01), 10Toolforge Build Service, 10cloud-services-team: toolsbeta harbor instance ran out of disk - https://phabricator.wikimedia.org/T348337 (10dcaro) 05In progress→03Resolved [14:35:37] (CephClusterInWarning) firing: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar [14:36:03] (InstanceDown) firing: (2) Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:38:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-30 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:40:03] (InstanceDown) firing: Project project-proxy instance project-proxy-acme-chief-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:41:03] (InstanceDown) firing: (3) Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:43:32] 10Cloud-VPS, 10cloud-services-team: ceph slow ops 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10dcaro) [14:44:35] 10Cloud-VPS, 10cloud-services-team: ceph slow ops 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10dcaro) It's just hapenning again, two osds affected: ` root@cloudcephmon1001:~# ceph osd find 111 { "osd": 111, "addrs": { "addrvec": [ { "type": "v2",... [14:45:03] (InstanceDown) resolved: Project project-proxy instance project-proxy-acme-chief-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:45:37] (CephClusterInWarning) resolved: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW [14:45:52] (CephClusterInWarning) firing: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar [14:46:03] (InstanceDown) resolved: (3) Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:46:07] (CephClusterInWarning) resolved: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW [14:48:03] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-30 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:56:27] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [14:56:42] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [14:57:14] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [14:59:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) Some of the options in the configuration group are not accepted by MariaDB when found in the my.cnf configuration file. I c... [15:03:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:08:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:09:03] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [15:18:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:26:28] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) I think those `latin1` values can be overridden on the client side, e.g. if I connect to the server using the `mariadb` cli... [15:31:23] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [15:31:47] 10Toolforge, 10Toolforge-standards-committee, 10cloud-services-team, 10Security-Team, 10Security: Standard process for dealing with public OAuth consumer secrets - https://phabricator.wikimedia.org/T348752 (10sbassett) +1 to this as a general process to follow. The first two procedures are what have org... [15:33:15] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @Papaul I was able to change netbox to Public Vlan redoing most of the steps for setting up... [15:57:06] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [16:14:20] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) > Networking Setup: 2 connections, 10G. public1-*-eqiad This is incorrect. All these hosts should have a single con... [16:14:56] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [16:15:02] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [16:15:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:15:10] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [16:15:57] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) [16:16:41] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [16:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [16:19:21] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:29:04] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye completed: - cloud... [16:36:00] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [16:36:34] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:41:07] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:43:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:51:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:55:29] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye completed: - cloud... [16:55:55] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [17:13:35] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye completed: - cloud... [17:15:07] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [17:16:41] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285) [17:16:43] !log admin fran@wmf3169 END (ERROR) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=97) (T341285) [17:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:16:46] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [17:16:49] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) 05Open→03Resolved a:03Papaul @Jclark-ctr @Andrew this now complete. I update the switch ports as recommended @ https://wikitech.wikimedia.org... [17:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:17:39] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/24 [17:17:39] siddharthvp closed https://github.com/toolforge/quarry/pull/24 [17:18:02] siddharthvp closed https://github.com/toolforge/quarry/pull/26 [17:18:11] 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/26 [17:35:06] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [17:35:11] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [17:35:17] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [17:43:11] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [17:43:18] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [17:43:38] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [17:58:12] (03PS1) 10LWatson: releases: Bump Codex to 1.0.0-rc.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/965553 [17:59:56] (03CR) 10VolkerE: [C: 03+2] releases: Bump Codex to 1.0.0-rc.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/965553 (owner: 10LWatson) [18:00:31] (03Merged) 10jenkins-bot: releases: Bump Codex to 1.0.0-rc.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/965553 (owner: 10LWatson) [18:23:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:54:07] vivian-rook closed https://github.com/toolforge/quarry/pull/27 [18:54:11] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/27 [18:54:36] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10rook) 05Open→03Resolved a:03rook [18:56:05] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) [18:59:54] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Quarry#Deployment felt the more appropriate place. [19:00:02] 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) 05Open→03Resolved [19:00:48] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10rook) With PR-24 closed, should this task be closed as resolved? [19:01:38] 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10rook) With PR-26 closed, should this task be closed as resolved? [19:03:20] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [19:03:30] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [19:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:03:38] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [19:15:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [19:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [19:45:36] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [19:45:49] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [19:48:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:26:07] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [20:26:13] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [20:29:21] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10kostajh) >>! In T268196#8922105, @hashar wrote: > If we don't want to rely on GitHub search, then I guess codesearch should index Gitlab repositories. What wo... [20:38:55] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [21:28:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:35:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:53:33] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:59:08] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [22:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [22:44:23] 10Tool-Global-user-contributions, 10Design-Research, 10IP Masking, 10Stewards-and-global-tools, and 3 others: [Design research] Understand usage of current GUC tool - https://phabricator.wikimedia.org/T347618 (10KColeman-WMF) [22:48:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed