[00:07:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:12:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:22:28] (InstanceDown) firing: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:32:28] (InstanceDown) resolved: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:24:10] 10Cloud-VPS, 13Patch-For-Review: 14cloudcumin can't reach bastion-restricted itself - 14https://phabricator.wikimedia.org/T361831#9696438 (10taavi) 05Open→03Resolved [08:44:36] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9696495 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/23 d/changelog: bump to 16.0.5 [08:50:49] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9696523 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/23 d/changelog: bump to 16.0.5 [08:57:20] 10Cloud-VPS, 10DNS, 06SRE, 06Traffic: DNS name resolution failure with www.spacecom.mil from Cloud VPS - https://phabricator.wikimedia.org/T346471#9696527 (10taavi) `www.spacecom.mil` seems to work now: `lang=shell-session taavi@tools-bastion-12:~ $ dig www.spacecom.mil ; <<>> DiG 9.18.24-1-Debian <<>> ww... [09:05:20] 10Toolforge, 07Documentation, 07good first task: Update Help:Access to Toolforge instances with PuTTY and WinSCP - https://phabricator.wikimedia.org/T334697#9696546 (10fnegri) [09:06:00] 10Toolforge, 07Documentation, 07good first task: Update Help:Access to Toolforge instances with PuTTY and WinSCP - https://phabricator.wikimedia.org/T334697#9696550 (10fnegri) [09:06:29] 10Cloud-VPS, 10Toolforge, 07Documentation, 07good first task: Update Help:Access to Toolforge instances with PuTTY and WinSCP - https://phabricator.wikimedia.org/T334697#9696551 (10fnegri) [09:09:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453#9696565 (10fnegri) a:05Volans→03fnegri [09:13:09] 10Tool-ldap: Indicate if account is locked - https://phabricator.wikimedia.org/T362046 (10taavi) 03NEW [09:32:29] 10Toolforge: [buildservice] Determine the least invasive/smallest extra output buildpack needed to pair with Apt - https://phabricator.wikimedia.org/T361409#9696622 (10dcaro) > I would like to know which of the buildpacks that qualify for enabling apt-buildpack places the least number of bytes in the resulting i... [09:33:13] 10Toolforge: [buildservice] Determine the least invasive/smallest extra output buildpack needed to pair with Apt - https://phabricator.wikimedia.org/T361409#9696626 (10dcaro) p:05Triage→03Low [09:33:23] 10Toolforge: [builds-builder] Determine the least invasive/smallest extra output buildpack needed to pair with Apt - https://phabricator.wikimedia.org/T361409#9696627 (10dcaro) [09:37:27] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-builder,builds-admission] Remove direct access to tekton from tools and remove the admission controller - https://phabricator.wikimedia.org/T360329#9696646 (10dcaro) [09:38:04] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-builder,builds-admission] Remove direct access to tekton from tools and remove the admission controller - https://phabricator.wikimedia.org/T360329#9696649 (10CodeReviewBot) dcaro updated https://gitlab.wikimedia.org/repos/cloud/toolforge/builds... [09:38:41] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9696650 (10aborrero) hey, I just noticed the `dumps` operation now shows an invalid YAML that cannot be loaded back: `lang=shell-session local.tf-test@lima-lima-... [09:39:10] 10Toolforge: [webservice] Allow configuration of Promethus scraping of a specific webservice endpoint for publication in grafana.wmcloud.org - https://phabricator.wikimedia.org/T362012#9696648 (10taavi) The Prometheus side of this seems easy to implement, but should be done on a separate pair of Prometheus VMs i... [09:43:02] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9696656 (10dcaro) >>! In T335592#9696650, @aborrero wrote: > hey, I just noticed the `dumps` operation now shows an invalid YAML that cannot be loaded back: > >... [09:54:26] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9696703 (10CodeReviewBot) aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/24 dump: handle new health-check [10:12:02] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api] Remove flask-restful - https://phabricator.wikimedia.org/T359806#9696719 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/234 jobs... [10:12:38] 06cloud-services-team, 10Toolforge: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050 (10aborrero) 03NEW [10:12:51] 10Toolforge, 07Epic: [component] First iteration of the component API - https://phabricator.wikimedia.org/T362051 (10dcaro) 03NEW p:05Triage→03High [10:15:08] 06cloud-services-team, 10Toolforge: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9696747 (10taavi) a:03taavi Tentatively claiming. Seems like we have packages and Puppetization available for OpenSearch 2 on Bookworm, will need to check how to m... [10:16:37] !log dcaro@urcuchillay toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [10:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:17:11] !log dcaro@urcuchillay toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [10:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:26:10] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [10:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:26:45] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [10:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:58:35] 06cloud-services-team, 10Toolforge: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050#9697074 (10dcaro) > readOnlyRootFilesystem: true We probably don't want to enforce this, so people can create temporary files and similar without the need of mounting volumes [12:08:15] 06cloud-services-team, 10Toolforge: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050#9697095 (10aborrero) >>! In T362050#9697074, @dcaro wrote: >> readOnlyRootFilesystem: true > > We probably don't want to enforce this, so people can create temporary files and similar w... [12:08:21] 06cloud-services-team, 10Toolforge: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050#9697096 (10aborrero) [12:17:11] 06cloud-services-team, 10Cloud-VPS: Allow authenticated write access from the wikiprod network to metricsinfra alertmanager API - https://phabricator.wikimedia.org/T362061 (10taavi) 03NEW [12:17:57] 06cloud-services-team, 10Cloud-VPS: Allow authenticated write access from the wikiprod network to metricsinfra alertmanager API - https://phabricator.wikimedia.org/T362061#9697122 (10taavi) [12:17:57] 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: [wmcs][alerting] Allow silencing alerts metricsinfra alerts on alerts.wikimedia.org - https://phabricator.wikimedia.org/T320973#9697123 (10taavi) [12:26:27] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9697161 (10CodeReviewBot) aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/24 dump: handle new health-check [12:31:01] 10Toolforge, 07Epic: [component] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9697167 (10dcaro) [12:32:55] 10Toolforge, 07Epic: [jobs-api,webservice] Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755#9697182 (10dcaro) [12:32:56] 10Toolforge, 07Epic: [component] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9697181 (10dcaro) [12:34:09] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9697202 (10dcaro) 05Open→03In progress [12:34:54] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: 14[jobs-api] Remove flask-restful - 14https://phabricator.wikimedia.org/T359806#9697187 (10dcaro) 05In progress→03Resolved [12:35:59] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: 14[builds-builder,builds-admission] Remove direct access to tekton from tools and remove the admission controller - 14https://phabricator.wikimedia.org/T360329#9697205 (10dcaro) 05Open→03Resolved [12:36:19] 10Toolforge: [component-api] Develop the webhook mechanism to trigger a deploy (unrefined) - https://phabricator.wikimedia.org/T362066 (10dcaro) 03NEW [12:46:57] 10Toolforge: [component-api] Get a skeleton of API webservice (unrefined) - https://phabricator.wikimedia.org/T362069 (10dcaro) 03NEW [12:47:27] 10Toolforge, 07Epic: [component-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9697282 (10dcaro) [12:48:32] (03PS1) 10Arturo Borrero Gonzalez: toolforge.k8s.component.deploy: report project in runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017849 [12:51:01] (03CR) 10Majavah: [C:04-1] "messages are already logged to the project-specific SAL and each project currently only has one project, so this just seems redundant info" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017849 (owner: 10Arturo Borrero Gonzalez) [12:51:48] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge.k8s.component.deploy: report project in runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017849 (owner: 10Arturo Borrero Gonzalez) [12:55:15] 10Toolforge: [component-api] Get a minimal version of the config with build-only data - https://phabricator.wikimedia.org/T362070 (10dcaro) 03NEW [12:57:28] 10Toolforge: [component-api] Extend the list of build triggers (unrefined) - https://phabricator.wikimedia.org/T362071 (10dcaro) 03NEW [12:58:13] 10Toolforge: [component-api] Extend the list of build triggers (unrefined) - https://phabricator.wikimedia.org/T362071#9697329 (10dcaro) [12:58:14] 10Toolforge: [component-api] Get a skeleton of API webservice (unrefined) - https://phabricator.wikimedia.org/T362069#9697330 (10dcaro) [13:00:32] 10Toolforge: [component-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072 (10dcaro) 03NEW [13:00:36] 10Toolforge: [component-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072#9697346 (10dcaro) [13:00:41] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support services in jobs - https://phabricator.wikimedia.org/T348758#9697347 (10dcaro) [13:01:55] 10Toolforge: [component-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072#9697349 (10dcaro) [13:09:08] 10PAWS: Reduce cluster size - https://phabricator.wikimedia.org/T361952#9697370 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/397 [13:09:18] vivian-rook opened https://github.com/toolforge/paws/pull/397 [13:10:22] 10Toolforge (Toolforge iteration 08): [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9697377 (10dcaro) [13:10:43] 06cloud-services-team, 10Toolforge (Toolforge iteration 08), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [builds-api,components-api] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065#9697378 (10dcaro) [13:11:55] (03CR) 10Btullis: [C:03+1] Remove obsolete stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016312 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:08] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:12:12] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:12:46] (03CR) 10Btullis: [C:03+1] schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:12:58] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:14:18] 10PAWS: Reduce cluster size - https://phabricator.wikimedia.org/T361952#9697397 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/397 [13:14:28] vivian-rook closed https://github.com/toolforge/paws/pull/397 [13:15:42] 10PAWS: 14Reduce cluster size - 14https://phabricator.wikimedia.org/T361952#9697399 (10rook) 05Open→03Resolved a:03rook [13:16:26] (03PS1) 10Majavah: hieradata: add fake metricsinfra irc credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1017859 [13:16:52] 10Toolforge: [component-api] add one-off, scheduled and continuous jobs support to the yaml + api (unrefined) - https://phabricator.wikimedia.org/T362075 (10dcaro) 03NEW [13:17:05] 10Toolforge: [component-api] add one-off, scheduled and continuous jobs support to the yaml + api (unrefined) - https://phabricator.wikimedia.org/T362075#9697414 (10dcaro) [13:17:05] 10Toolforge: [component-api] Get a skeleton of API webservice (unrefined) - https://phabricator.wikimedia.org/T362069#9697415 (10dcaro) [13:17:12] (03CR) 10Majavah: [V:03+2 C:03+2] hieradata: add fake metricsinfra irc credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1017859 (owner: 10Majavah) [13:18:27] 10Toolforge: [components-api] Add support for pre-build images (to refine) - https://phabricator.wikimedia.org/T362076 (10dcaro) 03NEW [13:19:08] 10Toolforge: [component-api] add one-off, scheduled and continuous jobs support to the yaml + api (unrefined) - https://phabricator.wikimedia.org/T362075#9697439 (10dcaro) [13:19:09] 10Toolforge: [components-api] Add support for pre-build images (to refine) - https://phabricator.wikimedia.org/T362076#9697438 (10dcaro) [13:19:43] 10Toolforge, 07Kubernetes: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#9697442 (10SD0001) @dcaro I have a stuck pod `bot-monitor-28525880-5gm4t` in my tool account `sdzerobot`. The job generally takes only a couple of minutes. But today, I saw... [13:19:54] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:19:58] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:20:12] 10Toolforge: [component-api] Add webservice support (to refine) - https://phabricator.wikimedia.org/T362077 (10dcaro) 03NEW [13:20:26] 10Toolforge: [component-api] Add webservice support (to refine) - https://phabricator.wikimedia.org/T362077#9697455 (10dcaro) [13:20:27] 10Toolforge: [component-api] Get a skeleton of API webservice (unrefined) - https://phabricator.wikimedia.org/T362069#9697456 (10dcaro) [13:21:14] 10Toolforge: [component-api] Add webservice support (to refine) - https://phabricator.wikimedia.org/T362077#9697459 (10dcaro) [13:21:16] 10Toolforge, 07Epic: [jobs-api,webservice] Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755#9697460 (10dcaro) [13:22:06] 10Toolforge, 07Epic: [component-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9697461 (10dcaro) [13:22:34] 10Toolforge: [component-api] Add support for pre-build images (to refine) - https://phabricator.wikimedia.org/T362076#9697462 (10taavi) [13:24:23] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:28:47] 10Toolforge, 07Kubernetes: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#9697475 (10dcaro) >>! In T306391#9697442, @SD0001 wrote: > @dcaro I have a stuck pod `bot-monitor-28525880-5gm4t` in my tool account `sdzerobot`. The job generally takes on... [13:28:54] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:01] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [13:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:11] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:14] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:29:16] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [13:29:17] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:34] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:39] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:48] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:31:27] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:32:05] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:32:08] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [13:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:35:10] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:35:13] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:36:57] 10Toolforge, 07Kubernetes: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#9697531 (10dcaro) It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, looking [13:37:15] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:37:37] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:40:53] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:43:42] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:43:46] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:45:52] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:46:00] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) (owner: 10Muehlenhoff) [13:47:04] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:47:28] (InstanceDown) firing: Project tools instance tools-k8s-etcd-20 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:49:12] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:49:33] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:49:36] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [13:51:42] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:52:26] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-56 [13:52:28] (InstanceDown) resolved: Project tools instance tools-k8s-etcd-20 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:53:43] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [13:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:53:54] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-56 [13:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:54:31] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [13:56:17] 10Tool-Global-user-contributions, 06Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] Update wireframes with user testing learnings - https://phabricator.wikimedia.org/T359827#9697589 (10KColeman-WMF) [13:56:28] (PuppetAgentStaleLastRun) firing: (2) Last Puppet run was over 24 hours ago on instance tools-k8s-etcd-19 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:56:47] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [13:57:16] 10Toolforge: [components-api] Add minimal cli - https://phabricator.wikimedia.org/T362082 (10dcaro) 03NEW [13:57:49] 10Toolforge: [component-api] Get a skeleton of API webservice (unrefined) - https://phabricator.wikimedia.org/T362069#9697616 (10dcaro) [13:57:50] 10Toolforge: [components-api] Add minimal cli - https://phabricator.wikimedia.org/T362082#9697615 (10dcaro) [14:05:28] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/operations/puppet on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [14:06:56] ^ me [14:10:28] (PuppetSyncFailure) resolved: Failed to update Puppet repository /srv/git/operations/puppet on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [14:13:49] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-21 [14:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:14:59] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-21 [14:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:16:58] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [14:17:01] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [14:21:28] (PuppetAgentStaleLastRun) resolved: (2) Last Puppet run was over 24 hours ago on instance tools-k8s-etcd-19 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:24:28] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:28:10] 10Cloud-VPS, 13Patch-For-Review: Support downtiming metricsinfra alerts in wmcs-cookbooks - https://phabricator.wikimedia.org/T360932#9697754 (10joanna_borun) [14:29:28] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:32:06] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [14:32:11] 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; e... - https://phabricator.wikimedia.org/T361218#9697786 [14:32:14] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [14:32:17] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [14:36:47] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9697825 (10Volans) p:05Triage→03Medium a:03Volans [14:36:53] 10Quarry, 10Toolforge, 10ChangeProp, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9697832 (10joanna_borun) p:05Triage→03Medium [14:41:18] 10Toolforge, 07Epic: [component-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9697843 (10aborrero) [14:48:32] 10Tools, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech: [SW] [GENERAL] Deprecate connecting senses prototype - https://phabricator.wikimedia.org/T351829#9697871 (10Lucas_Werkmeister_WMDE) 05Open→03Stalled This is stalled until Itamar comes back – the [only maintainers](https://toolsadmin.wikime... [14:49:00] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [14:49:57] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [14:50:00] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [14:54:20] 10Tools, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech: [SW] [GENERAL] Deprecate connecting senses prototype - https://phabricator.wikimedia.org/T351829#9697923 (10Lucas_Werkmeister_WMDE) [14:59:33] 10Toolforge: [webservice] Allow configuration of Promethus scraping of a specific webservice endpoint for publication in grafana.wmcloud.org - https://phabricator.wikimedia.org/T362012#9697957 (10bd808) [15:07:52] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [15:21:38] 06cloud-services-team, 10Toolforge: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9698083 (10bd808) >>! In T311905#9696747, @taavi wrote: > Also cc-ing @bd808 in case you have a tool that could be used as a canary here. #stashbot + sal.toolforge.... [15:26:07] 10Tools, 10Wikidata, 07SecTeam-Processed, 07Security, 07Vuln-Infoleak: 14connecting-senses tool OAuth credentials were world-readable - 14https://phabricator.wikimedia.org/T362089#9698106 (10sbassett) p:05Triage→03Low [15:29:50] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 177 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:31:11] 10Toolforge, 07Kubernetes: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#9698133 (10dcaro) >>! In T306391#9697531, @dcaro wrote: > It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, loo... [15:33:24] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9698138 (10LSobanski) a:03Dzahn [15:35:07] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9698143 (10LSobanski) p:05Triage→03Medium [15:38:34] 10Toolforge (Toolforge iteration 08): [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093 (10dcaro) 03NEW [15:39:24] 10Toolforge (Toolforge iteration 08): [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093#9698194 (10dcaro) [15:39:27] 10Toolforge (Toolforge iteration 08): [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093#9698190 (10dcaro) 05Open→03In progress p:05Triage→03High [15:44:50] RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.320 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:44:57] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [15:51:11] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [15:52:35] 10Tools, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech: [SW] [GENERAL] Deprecate connecting senses prototype - https://phabricator.wikimedia.org/T351829#9698282 (10Lucas_Werkmeister_WMDE) > In case anyone is using this prototype tool, a cursory deprecation notice should be given. Due to T362089 an... [15:59:50] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 177 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:01:12] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [16:04:18] 10Wikibugs: Replace Redis queue with custom http solution - https://phabricator.wikimedia.org/T361518#9698337 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/28 Replace Redis queue with custom http solution [16:04:34] 10Quarry, 10Toolforge, 10ChangeProp, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9698336 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/28... [16:04:50] RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.631 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:08:27] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [16:09:01] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [16:09:38] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [16:11:35] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [16:20:59] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [16:22:37] (03CR) 10BryanDavis: [C:04-2] "test" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1008016 (https://phabricator.wikimedia.org/T90594) (owner: 10BryanDavis) [16:22:56] 10Wikibugs: Wikibugs testing task - https://phabricator.wikimedia.org/T90594#9698399 (10bd808) test [16:24:27] (03PS1) 10Andrew Bogott: etcd depool_and_remove: add --shutdown option [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017884 [16:24:50] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [16:24:50] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 177 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:29:37] (03CR) 10CI reject: [V:04-1] etcd depool_and_remove: add --shutdown option [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017884 (owner: 10Andrew Bogott) [16:35:01] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [16:39:49] RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.540 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:58:30] 10Toolforge, 07Kubernetes: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#9698495 (10MusikAnimal) The use case mentioned in the task (#copypatrol) will soon no longer be an issue as we're moving to Cloud VPS. I still have a similar situation for... [17:13:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:18:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:48:19] (03PS2) 10Andrew Bogott: etcd depool_and_remove: add --shutdown option [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1017884 [18:11:41] 10Quarry: refreshing a running query changes favicon from orange to blue - https://phabricator.wikimedia.org/T362101 (10Novem_Linguae) 03NEW [18:14:53] 10PAWS: Remove prometheus migrate logic - https://phabricator.wikimedia.org/T362102 (10rook) 03NEW [18:34:56] (SystemdUnitDown) firing: The service unit postgresql@15-main.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:39:56] (SystemdUnitDown) resolved: The service unit postgresql@15-main.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:46:56] (SystemdUnitDown) firing: The service unit postgresql@15-main.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:51:56] (SystemdUnitDown) firing: (2) The service unit postgresql@15-main.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:53:41] (CloudVPSDesignateLeaks) firing: (3) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:57:04] 10Tools, 06Tech-Docs-Team, 07Documentation, 03Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9698778 (10TBurmeister) Draft of Tool Docs guide is now ready finalized at https://www.mediawiki.org/wiki/Documentation/Tool_docs. I... [18:58:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:06:56] (SystemdUnitDown) firing: (3) The service unit backup_cinder_volumes.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:11:17] 10VPS-project-Codesearch: Codesearch ignores space at the end of the regex - https://phabricator.wikimedia.org/T343057#9698809 (10Novem_Linguae) I encountered this today when searching for `\$[A-Za-z_]+ = \$[A-Za-z_]+ = `. I would like that space on the end to narrow down my search, but it is getting trimmed. h... [19:11:56] (SystemdUnitDown) resolved: (3) The service unit backup_cinder_volumes.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:13:41] (CloudVPSDesignateLeaks) firing: (2) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:18:41] (CloudVPSDesignateLeaks) firing: (3) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:43:24] 10Quarry: [bug] "Access denied" for quarry database user - https://phabricator.wikimedia.org/T362111 (10bvibber) 03NEW [20:57:31] 10Quarry: [bug] "Access denied" for quarry database user - https://phabricator.wikimedia.org/T362111#9699071 (10rook) I've restarted the deployments. See how it behaves now? [21:34:02] 10Quarry: 14[bug] "Access denied" for quarry database user - 14https://phabricator.wikimedia.org/T362111#9699167 (10bvibber) 05Open→03Resolved a:03bvibber 14Confirmed good now. Thanks! [22:48:36] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9699268 (10Raymond_Ndibe) >>! In T335592#9691106, @bd808 wrote: >>>! In T335592#9691103, @bd808 wrote: >> @Raymond_Ndibe I think this feature deserves a section on... [22:52:59] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9699271 (10Raymond_Ndibe) >>! In T335592#9692471, @taavi wrote: > Re-opening since I think documentation needs to be added to https://wikitech.wikimedia.org/wiki/H... [22:54:32] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9699275 (10bd808) >>! In T335592#9699268, @Raymond_Ndibe wrote: > @bd808 the script will be executed inside the pod. you can either provide an inline script (`--he... [22:58:01] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9699278 (10Raymond_Ndibe) >>! In T335592#9699275, @bd808 wrote: >>>! In T335592#9699268, @Raymond_Ndibe wrote: >> @bd808 the script will be executed inside the pod...