[00:08:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgecron-2 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:09:49] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:16:57] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service, 10Patch-For-Review: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/72 [builds-ap... [00:22:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-sgecron-02 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:08:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgecron-2 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:12:28] (PuppetAgentNoResources) resolved: No Puppet resources found on instance toolsbeta-sgecron-02 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:23:28] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgecron-2 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:58:52] 10Grid-Engine-to-K8s-Migration: Migrate panoviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319953 (10tstarling) >>! In T319953#9479694, @tstarling wrote: > I hit a problem — Hugin is missing from Ubuntu 22.04, which is also the only distro available for Toolforge bu... [09:26:04] 10Data-Services, 10Toolforge, 10cloud-services-team: Requesting SQL code review for application on Toolforge - https://phabricator.wikimedia.org/T355779 (10taavi) 05Open→03Resolved a:03taavi Hi. We talked about this in our team meeting yesterday. This kind of usage is fine (and relatively common thing... [10:40:40] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Not sourcing /layers/fagiani_apt/apt/.profile.d/000_apt.sh - https://phabricator.wikimedia.org/T355214 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_... [10:41:58] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Not sourcing /layers/fagiani_apt/apt/.profile.d/000_apt.sh - https://phabricator.wikimedia.org/T355214 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/r... [10:45:47] 10PAWS: move paws-dev to pawsdev - https://phabricator.wikimedia.org/T355543 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/368 [10:45:58] vivian-rook opened https://github.com/toolforge/paws/pull/368 [10:48:39] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:00:30] !log dcaro@urcuchillay toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:01:06] !log dcaro@urcuchillay toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [11:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:12:32] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [11:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:13:07] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [11:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:21:27] 10Toolforge, 10WMCH-Infrastructure: Toolfoge "wmch" tool is offline now. - https://phabricator.wikimedia.org/T355856 (10ValerioBoz-WMCH) It seems that also Petscan is offline (?) https://petscan.toolforge.org/ [11:23:43] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Not sourcing /layers/fagiani_apt/apt/.profile.d/000_apt.sh - https://phabricator.wikimedia.org/T355214 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merg... [11:30:37] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: apt buildpack (Aptfile support): not installing dependencies of packages already present on the build image - https://phabricator.wikimedia.org/T353847 (10dcaro) [11:31:59] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service: [apt-buildpack] Not sourcing /layers/fagiani_apt/apt/.profile.d/000_apt.sh - https://phabricator.wikimedia.org/T355214 (10dcaro) 05In progress→03Resolved Okok, now we support both using procfile entries (strongly recommended), and passing ad-... [11:32:50] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service: [dev][harbor] reconcile harbor install methods - https://phabricator.wikimedia.org/T354942 (10dcaro) 05Open→03Resolved [11:33:32] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service: [apt-buildpack] Does not handle virtual packages correctly - https://phabricator.wikimedia.org/T355575 (10dcaro) 05Open→03In progress [11:39:54] (03PS2) 10Majavah: vps: refresh_puppet_certs: Parse SAL project from FQDN [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992697 [12:24:13] !log taavi@runko toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster [12:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:24:39] !log taavi@runko toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the toolsbeta cluster [12:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:27:34] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/992915 (owner: 10L10n-bot) [12:27:51] !log taavi@runko toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster [12:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:28:18] !log taavi@runko toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the toolsbeta cluster [12:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:30:20] !log taavi@runko toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster [12:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:30:42] !log taavi@runko toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the toolsbeta cluster [12:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:46:06] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/992915 (owner: 10L10n-bot) [12:48:19] !log taavi@runko admin Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-1.toolsbeta.eqiad1.wikimedia.cloud to the cluster [12:48:20] !log taavi@runko admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster [12:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:51:45] 10Toolforge: Fix deprecated Kubelet flags - https://phabricator.wikimedia.org/T355881 (10taavi) [12:56:54] 10Toolforge: Create a pool of NFS-less Toolforge Kubernetes workers - https://phabricator.wikimedia.org/T355883 (10taavi) [12:57:06] 10Toolforge: Create a pool of NFS-less Toolforge Kubernetes workers - https://phabricator.wikimedia.org/T355883 (10taavi) p:05Triage→03Medium [13:00:09] 10Toolforge: Create a pool of NFS-less Toolforge Kubernetes workers - https://phabricator.wikimedia.org/T355883 (10taavi) [13:00:18] 10Toolforge (Toolforge iteration 04), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) [13:02:55] (03PS1) 10Majavah: toolforge: add_k8s_node: Add support for containerd [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992923 (https://phabricator.wikimedia.org/T284656) [13:02:57] (03PS1) 10Majavah: wmcs_libs: k8s: Fix Kubernetes role usage [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992924 [13:02:59] (03PS1) 10Majavah: Add worker-nfs Toolforge Kubernetes role/prefix [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992925 (https://phabricator.wikimedia.org/T355883) [13:03:01] (03PS1) 10Majavah: toolforge: add_k8s_node: Allow passing --network [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992926 (https://phabricator.wikimedia.org/T284656) [13:04:04] !log taavi@runko tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:15:43] !log taavi@runko admin Added a new k8s worker-nfs tools-k8s-worker-nfs-1.tools.eqiad1.wikimedia.cloud to the cluster [13:15:43] !log taavi@runko admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:17:13] !log taavi@runko tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [13:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:24:43] (03CR) 10FNegri: [C: 03+1] "LGTM!" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992697 (owner: 10Majavah) [13:27:50] !log taavi@runko admin Added a new k8s worker-nfs tools-k8s-worker-nfs-2.tools.eqiad1.wikimedia.cloud to the cluster [13:27:50] !log taavi@runko admin END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:30:15] (03CR) 10Majavah: [C: 03+1] "+1. This means that any tools doing something stupid like `var_dump( $_SERVER );` will now leak their database credentials but that feels " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:32:47] (03CR) 10David Caro: [C: 03+2] lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:32:59] (03CR) 10FNegri: toolsdb: add cookbook to retrieve stuck table+query (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [13:35:37] !log taavi@runko testlabs START - Cookbook wmcs.openstack.quota_increase [13:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [13:35:51] !log taavi@runko testlabs END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [13:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [13:36:45] (03CR) 10Majavah: [C: 03+1] quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 (owner: 10David Caro) [13:37:42] (03PS1) 10Pwangai: This patch allows the bot to fetch coverage numbers when the Quality Gate includes coverage. The coverage estimate after merge is displayed under the condition that the value is > 0.0% [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/992929 (https://phabricator.wikimedia.org/T355803) [13:38:03] (03CR) 10CI reject: [V: 04-1] This patch allows the bot to fetch coverage numbers when the Quality Gate includes coverage. The coverage estimate after merge is displayed under the condition that the value is > 0.0% [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/992929 (https://phabricator.wikimedia.org/T355803) (owner: 10Pwangai) [13:39:59] (03PS2) 10Pwangai: Append coverage value [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/992929 (https://phabricator.wikimedia.org/T355803) [13:40:08] (03Merged) 10jenkins-bot: lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:40:15] (03CR) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [13:40:23] (03CR) 10CI reject: [V: 04-1] Append coverage value [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/992929 (https://phabricator.wikimedia.org/T355803) (owner: 10Pwangai) [13:41:32] (03CR) 10Majavah: [C: 03+2] vps: refresh_puppet_certs: Parse SAL project from FQDN [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992697 (owner: 10Majavah) [13:41:41] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10Slst2020) 05Open→03In progress [13:42:36] 10Toolforge (Toolforge iteration 04): [ci] Add shellcheck to pre-commit where missing - https://phabricator.wikimedia.org/T353052 (10Slst2020) 05Open→03In progress [13:44:59] (03Merged) 10jenkins-bot: vps: refresh_puppet_certs: Parse SAL project from FQDN [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992697 (owner: 10Majavah) [13:45:51] (03CR) 10FNegri: toolsdb: add cookbook to retrieve stuck table+query (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [14:04:12] (03PS3) 10Pwangai: Append coverage value [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/992929 (https://phabricator.wikimedia.org/T355803) [14:21:58] 10Toolforge, 10cloud-services-team: Toolforge: Ensure long-running Kubernetes pods get container updates applied - https://phabricator.wikimedia.org/T314705 (10taavi) This is still something I think we should make happen, but I really don't know if my initial approach was the correct approach or if it should b... [14:29:13] (03PS2) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [14:29:15] (03PS1) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [14:29:25] (03CR) 10CI reject: [V: 04-1] inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 (owner: 10David Caro) [14:29:27] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [14:50:12] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) Submitted all new tsr reports along with smartctl data [14:59:11] 10PAWS: jupyterlab to 4.0.11 - https://phabricator.wikimedia.org/T355890 (10rook) [15:11:47] (03PS2) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [15:11:56] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [15:13:57] (03PS4) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [15:13:59] (03CR) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [15:14:01] (03PS3) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [15:14:03] (03PS3) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [15:14:10] (03CR) 10CI reject: [V: 04-1] inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 (owner: 10David Caro) [15:14:12] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [15:15:59] (03PS5) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [15:18:13] (03PS6) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [15:18:15] (03PS4) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [15:18:17] (03PS4) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [15:18:27] (03CR) 10CI reject: [V: 04-1] inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 (owner: 10David Caro) [15:18:29] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [15:19:02] (03CR) 10David Caro: "I don't know why it says that there's a merge conflict :/, I don't see it locally" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [15:20:43] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) updated system settings server is back up now [15:23:16] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 2.89 ms [15:28:56] (NodeDown) resolved: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [15:29:17] (NodeDownForLong) resolved: The node cloudvirt1063 has been unreachable for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDownForLong [15:39:23] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do. [15:47:40] (03PS1) 10Stevemunene: Remove dummy-keytabs for decommissioned druid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) [15:49:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:56:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [16:28:01] 10Striker: Allow marking all notifications as read - https://phabricator.wikimedia.org/T355895 (10RPI2026F1) [16:29:15] 10Striker: Allow marking all notifications as read - https://phabricator.wikimedia.org/T355895 (10taavi) [16:29:46] 10Striker: Mark all alerts as read - https://phabricator.wikimedia.org/T332579 (10taavi) [16:32:03] (03PS3) 10Majavah: enc.py: rename project_name arg to project_id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/988047 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [16:32:09] (03CR) 10Majavah: [C: 03+2] enc.py: rename project_name arg to project_id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/988047 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [16:36:32] (03Merged) 10jenkins-bot: enc.py: rename project_name arg to project_id [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/988047 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [16:40:40] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:45:31] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [16:51:34] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-wmcs-dnsleaks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:56] (SystemdUnitDown) firing: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:59:17] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Remove dummy-keytabs for decommissioned druid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [17:03:09] (03PS7) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [17:03:11] (03PS5) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [17:03:13] (03PS5) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [17:03:56] (SystemdUnitDown) resolved: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:04:01] (03PS8) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [17:04:03] (03PS6) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [17:04:05] (03PS6) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [17:04:17] (03CR) 10David Caro: "So much many more rebasing xd" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [17:04:33] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1002 for hosts: `cloudrabbit[1001-1002].wikimedia.org` - clou... [17:06:34] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) [17:07:26] (03CR) 10CI reject: [V: 04-1] inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 (owner: 10David Caro) [17:07:28] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [17:12:39] (03PS7) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [17:12:41] (03PS7) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [17:16:10] (03CR) 10CI reject: [V: 04-1] inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 (owner: 10David Caro) [17:16:25] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [17:22:37] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:17] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [clouddb-service-puppetmaster-2] Renew puppet CA certificates - https://phabricator.wikimedia.org/T355410 (10Andrew) Some doc links (that I haven't finished reading): https://wikite... [17:57:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:02:23] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:03:25] (03PS8) 10David Caro: inventory: split into submodules [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992638 [18:03:27] (03PS8) 10David Caro: toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 [18:06:38] (03CR) 10CI reject: [V: 04-1] toolsdb: load the inventory dynamically [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992932 (owner: 10David Caro) [19:29:42] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) [19:29:47] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) [19:31:35] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) cloudrabbit1002 is now in E4 U17 CableID 2M-20220016 Port 3 [19:45:55] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [clouddb-service-puppetmaster-2] Renew puppet CA certificates - https://phabricator.wikimedia.org/T355410 (10Andrew) Adapting that sysbee doc to our platform, I got down to this: `... [19:57:09] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) cloudrabbit1001 is now in C8 U19 CableID 5336 Port 21 [20:56:52] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [clouddb-service-puppetmaster-2] Renew puppet CA certificates - https://phabricator.wikimedia.org/T355410 (10Andrew) https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#... [21:31:05] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service: [apt-buildpack] Not sourcing /layers/fagiani_apt/apt/.profile.d/000_apt.sh - https://phabricator.wikimedia.org/T355214 (10LucasWerkmeister) Sounds good, thanks! [21:37:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:47:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:05:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:10:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:12:56] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:17:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown