[00:01:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:14:25] (NodeDownForLong) firing: The node cloudvirt1063 has been unreachable for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDownForLong [01:16:40] (NeutronAgentDownForLong) firing: Neutron neutron-linuxbridge-agent on cloudvirt1063 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [03:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:17:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:22:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:18:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:23:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:03:00] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 (10Slst2020) This has been solved now, haven't tested it yet though. https://github.com/goharbor/harbor/pull/19799 [07:22:51] 10Toolforge (Toolforge iteration 02): [harbor] update to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) [07:23:30] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 (10Slst2020) [07:23:46] 10Toolforge (Toolforge iteration 02): [harbor] update to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) a:05Slst2020→03None [07:23:50] 10Toolforge (Toolforge iteration 02): [harbor] update to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) 05Open→03Stalled [07:24:59] 10Toolforge (Toolforge iteration 02): [harbor] update to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) Marking as stalled for now as we're waiting on upstream. **Do not** upgrade to 2.10.0, as it doesn't solve our issues. [07:25:27] 10Toolforge (Toolforge iteration 02): [harbor] upgrade to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) [08:33:48] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10Ghuron) I'm a colleague of @MBH, we are developing tools together. I apologize for a long text, but it looks like - we know how things are working for us, but lack basic k... [09:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:14:29] 10Grid-Engine-to-K8s-Migration: Migrate wahrani from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320133 (10fnegri) 05Open→03Resolved a:03fnegri @wahrani that file is only to track the tools where the Grid Engine functionalities are disabled. We should probably have cal... [09:26:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:50:15] 10Tool-bub2, 10Patch-For-Review: Make the queue refresh automatically - https://phabricator.wikimedia.org/T344119 (10Aklapper) @wassan.anmol117 Could the PR please get a review? Thanks [09:58:38] (03CR) 10Aklapper: "Thanks for the review! Just FYI I will not get back to this in the next weeks; earlierst would be late March." [labs/striker] - 10https://gerrit.wikimedia.org/r/971912 (https://phabricator.wikimedia.org/T320915) (owner: 10Aklapper) [09:59:13] (03CR) 10Aklapper: "Heh all cool, thanks! Just FYI after this very week I will not get back to this until late March earliest." [labs/striker] - 10https://gerrit.wikimedia.org/r/987145 (https://phabricator.wikimedia.org/T344610) (owner: 10Aklapper) [10:09:26] 10Cloud-VPS (Quota-requests), 10cloud-services-team: Quota increase request for project 'monitoring' - https://phabricator.wikimedia.org/T354412 (10fnegri) a:03fnegri [10:11:29] !log fnegri@cloudcumin1001 monitoring START - Cookbook wmcs.openstack.quota_increase (T354412) [10:11:32] !log fnegri@cloudcumin1001 monitoring END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) (T354412) [10:11:34] T354412: Quota increase request for project 'monitoring' - https://phabricator.wikimedia.org/T354412 [10:13:29] 10Cloud-VPS: [wmcs-cookbook] increase_quota cookbook fails - https://phabricator.wikimedia.org/T352840 (10fnegri) [10:13:41] 10Cloud-VPS: [wmcs-cookbooks] quota_show fails to parse openstack CLI output - https://phabricator.wikimedia.org/T353833 (10fnegri) [10:14:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [wmcs-cookbook] increase_quota cookbook fails - https://phabricator.wikimedia.org/T352840 (10fnegri) p:05Triage→03Medium a:03fnegri [10:15:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [wmcs-cookbook] increase_quota cookbook fails - https://phabricator.wikimedia.org/T352840 (10fnegri) As I stated in the duplicate task T353833, I suspect the JSON format returned by `openstack quota show -f json` has changed in OpenStack version Antelope.... [10:16:27] 10Cloud-VPS (Quota-requests), 10cloud-services-team: Quota increase request for project 'monitoring' - https://phabricator.wikimedia.org/T354412 (10fnegri) The cookbook failed because of {T352840}, I will update the quotas manually. [10:22:19] 10Cloud-VPS (Quota-requests), 10cloud-services-team: Quota increase request for project 'monitoring' - https://phabricator.wikimedia.org/T354412 (10fnegri) 05Open→03Resolved ` fnegri@cloudcontrol1005:~$ sudo wmcs-openstack quota set --ram 96256 --cores 48 --instance 42 monitoring ` [10:24:03] (InstanceDown) firing: Project tools instance tools-sgeexec-10-21 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:24:43] 10Toolforge (Toolforge iteration 02), 10Toolforge Jobs framework, 10Patch-For-Review: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/50 command: Always use... [10:26:14] 10Toolforge (Toolforge iteration 02), 10Toolforge Jobs framework, 10Patch-For-Review: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537 (10CodeReviewBot) taavi updated https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/51 api: allow file lo... [10:27:21] 10Toolforge (Toolforge iteration 02), 10Toolforge Jobs framework, 10Patch-For-Review: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolf... [10:27:56] 10Toolforge (Toolforge iteration 02), 10Toolforge Jobs framework, 10Patch-For-Review: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/51 api: allow file log... [10:29:03] (InstanceDown) resolved: Project tools instance tools-sgeexec-10-21 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:30:54] 10Toolforge (Toolforge iteration 02), 10Toolforge Jobs framework, 10Patch-For-Review: Allow using file logs with build service images - https://phabricator.wikimedia.org/T353537 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/164 jobs-api:... [10:30:57] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [10:31:09] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [10:47:28] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [10:47:41] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [10:51:32] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [10:51:45] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [11:00:36] 10VPS-project-Codesearch, 10Special:NewLexeme revival, 10wmde-wikidata-tech: Please add wmde/new-lexeme-special-page to codesearch index - https://phabricator.wikimedia.org/T351938 (10Lucas_Werkmeister_WMDE) Thanks! [12:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:20:31] 10Toolforge Build Service, 10cloud-services-team: tools-harbor-1.tools.eqiad1.wikimedia.cloud overloaded - https://phabricator.wikimedia.org/T354151 (10taavi) [12:20:47] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2), 10User-dcaro: [harbor] Redis using all available memory - https://phabricator.wikimedia.org/T354176 (10taavi) [12:21:08] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2), 10User-dcaro: [harbor] Redis using all available memory - https://phabricator.wikimedia.org/T354176 (10taavi) Is there a reason we could not configure the maximum memory limit for Redis? [12:26:17] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_grid_node for tools-sgeweblight-10-27, tools-sgeweblight-10-28 [12:27:54] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/988477 (owner: 10L10n-bot) [12:34:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-27 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:34:36] 10Cloud-VPS, 10cloud-services-team: Check Cloud VPS running kernels for ext4 data corruption bug - https://phabricator.wikimedia.org/T353178 (10taavi) 05Open→03Resolved a:03taavi [12:39:03] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-27 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:41] 10Cloud-VPS, 10Toolforge, 10cloud-services-team: Ensure Toolforge and Cloud VPS comply with Google's new email sender guidelines - https://phabricator.wikimedia.org/T354112 (10taavi) [13:22:49] 10Toolforge, 10Patch-For-Review: Require mail sent via the Toolforge mail servers uses a Toolforge domain - https://phabricator.wikimedia.org/T341004 (10taavi) [13:24:07] 10Cloud-VPS, 10Toolforge, 10cloud-services-team: Ensure Toolforge and Cloud VPS comply with Google's new email sender guidelines - https://phabricator.wikimedia.org/T354112 (10taavi) a:03taavi [13:26:22] 10Cloud-VPS, 10Toolforge, 10cloud-services-team: Ensure Toolforge and Cloud VPS comply with Google's new email sender guidelines - https://phabricator.wikimedia.org/T354112 (10taavi) [13:26:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:45:55] (03PS1) 10Majavah: Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) [13:52:37] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2), 10User-dcaro: [harbor] Redis using all available memory - https://phabricator.wikimedia.org/T354176 (10dcaro) >>! In T354176#9441291, @taavi wrote: > Is there a reason we could not configure the maximum memory limit for Redis?... [14:04:12] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2), 10User-dcaro: [harbor] Redis using all available memory - https://phabricator.wikimedia.org/T354176 (10fnegri) > Is there a reason we could not configure the maximum memory limit for Redis? According to the [[ https://phabricat... [14:05:27] (03CR) 10FNegri: [C: 03+1] Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:05:40] (03CR) 10Majavah: [V: 03+2 C: 03+2] Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:06:27] (03PS1) 10David Caro: lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) [14:07:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [14:19:10] (03CR) 10Andrew Bogott: [C: 03+1] "!" [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [14:23:47] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, 10Design: [Design] UX exploration and wireframes - https://phabricator.wikimedia.org/T354531 (10KColeman-WMF) [14:25:51] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, 10Design: [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901 (10KColeman-WMF) [14:26:13] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, 10Design: [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901 (10KColeman-WMF) [14:27:48] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] Create user flows for different GUC scenarios - https://phabricator.wikimedia.org/T349902 (10KColeman-WMF) [14:31:32] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] UX exploration and wireframes - https://phabricator.wikimedia.org/T354531 (10KColeman-WMF) [14:48:06] 10Toolforge, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Patch-For-Review: wmcs-wheel-of-misfortune kills system processes - https://phabricator.wikimedia.org/T354430 (10fnegri) From IRC: `lang=irc [19:22:45] dhinus: we have the adduser class in Puppet which defines these ranges (both for the ad... [15:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:07:37] 10Toolforge, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Patch-For-Review: wmcs-wheel-of-misfortune kills system processes - https://phabricator.wikimedia.org/T354430 (10fnegri) 05In progress→03Resolved [15:08:01] 10Cloud-VPS (Quota-requests), 10cloud-services-team: Quota increase request for project 'monitoring' - https://phabricator.wikimedia.org/T354412 (10fgiunchedi) Thank you folks! Appreciate it [15:08:27] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Reopened Ticket with Dell [15:36:43] 10Cloud-VPS, 10cloud-services-team, 10Cumin, 10Infrastructure-Foundations, 10Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453 (10Volans) [15:37:58] 10Cloud-VPS, 10cloud-services-team, 10Cumin, 10Infrastructure-Foundations, 10Patch-For-Review: Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773 (10Volans) p:05Triage→03Medium [15:38:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) Self dispatched 8 new drives for cloudcephosd1028 [15:39:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Scheduled next thursday to do the swap of the drives, will get the host out of the cluster before that. [15:44:26] (NodeDown) resolved: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [15:46:42] 10cloud-services-team, 10Infrastructure-Foundations, 10LDAP, 10Patch-Needs-Improvement: Rename ldap-labs cluster - https://phabricator.wikimedia.org/T295150 (10MoritzMuehlenhoff) p:05Triage→03Low a:03MoritzMuehlenhoff [15:52:28] (NodeDown) firing: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [15:53:19] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:10:02] 10Cloud-VPS: Simple multiroot cfssl PKI setup for Cloud-VPS projects - https://phabricator.wikimedia.org/T340742 (10joanna_borun) [16:23:19] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:44:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:49:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:07:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [17:26:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:55:47] 10Cloud-VPS, 10cloud-services-team: ceph slow ops 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10fnegri) [17:56:22] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) p:05Triage→03High [17:56:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) 05Open→03In progress [18:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:27:27] (03CR) 10Dzahn: [C: 03+2] "!:)" [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [18:27:53] (03CR) 10Dzahn: [V: 03+2 C: 03+2] secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [19:06:23] 10Striker, 10wikitech.wikimedia.org, 10MediaWiki-extensions-OATHAuth, 10TestMe: Wikitech 2FA does not appear to allow recovery with recovery codes - https://phabricator.wikimedia.org/T204682 (10Reedy) I've just tested logging into wikitech with a recovery token, worked fine... I then burned another loggin... [19:08:16] (03CR) 10Dzahn: [V: 03+2 C: 03+2] secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) (owner: 10Dzahn) [19:08:48] (03PS4) 10Dzahn: secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) [19:19:14] (03CR) 10Dzahn: [V: 03+2] secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) (owner: 10Dzahn) [19:20:56] 10superset.wmcloud.org: Superset to tofu - https://phabricator.wikimedia.org/T354444 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/superset-deploy/pull/15 [19:21:04] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/15 [19:22:23] 10Grid-Engine-to-K8s-Migration: Migrate dplbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319701 (10russblau) 05Open→03Resolved No jobs running on the grid at this time, and crontab has been blanked. [19:54:45] 10Tool-gitlab-account-approval, 10User-bd808: Consider adding Gerrit Trusted-Users group as source of trust - https://phabricator.wikimedia.org/T353914 (10bd808) 05In progress→03Resolved [20:07:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [20:12:03] (PuppetSyncFailure) resolved: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [20:21:08] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: builds log streaming times out when time between two loglines exceeds ~1min - https://phabricator.wikimedia.org/T354189 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_... [20:21:57] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: builds log streaming times out when time between two loglines exceeds ~1min - https://phabricator.wikimedia.org/T354189 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge... [20:37:55] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: builds log streaming times out when time between two loglines exceeds ~1min - https://phabricator.wikimedia.org/T354189 (10Raymond_Ndibe) 05Open→03In progress [21:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:29:11] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:16:48] 10superset.wmcloud.org: Superset to tofu - https://phabricator.wikimedia.org/T354444 (10rook) Moving to new cluster as part of this as old cluster got a little weird. Seems like disk filled but containers.conf was in place. [22:16:59] 10superset.wmcloud.org: Superset to tofu - https://phabricator.wikimedia.org/T354444 (10rook) 05Open→03In progress a:03rook [22:26:48] 10superset.wmcloud.org: Superset to tofu - https://phabricator.wikimedia.org/T354444 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/superset-deploy/pull/15 [22:26:54] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/15 [22:27:58] 10superset.wmcloud.org: remove old cluster - https://phabricator.wikimedia.org/T354574 (10rook) [22:28:08] 10superset.wmcloud.org: Superset to tofu - https://phabricator.wikimedia.org/T354444 (10rook) 05In progress→03Resolved [23:19:11] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:26:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [23:31:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [23:47:13] 10Cloud-VPS (Quota-requests): Increase disk qouta for math - https://phabricator.wikimedia.org/T354579 (10Physikerwelt)