[00:09:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:14:28] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:14:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:19:28] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:22:18] 10Cloud-VPS: 14Reset default Security Group rules for the openvas Cloud VPS project - 14https://phabricator.wikimedia.org/T360694#9652518 (10bd808) 05Open→03Resolved a:03bd808 14>>! In T360694#9651988, @KHurd-WMF wrote: > Those are the 4 that are not able to be re-added. The "any" option is not availa... [03:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:59:56] 10Tools: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913#9653080 (10Hoi) Metadata of these files are not well-structured. I was managing to tidy them up so that template fields can be filled properly and appropriate categories can be created, which makes me... [08:44:09] 10Quarry, 10Toolforge, 10ChangeProp, 06collaboration-services, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9653141 (10Jelto) [08:53:13] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project - https://phabricator.wikimedia.org/T360703#9653248 (10fgiunchedi) Thank you for the heads up; for context I'm working on {T352640} which will enable us to rebuild the whole o11... [08:54:25] 10tool-wscontest: WS Contest has stopped updating its score - https://phabricator.wikimedia.org/T360749#9653269 (10Peachey88) [09:01:24] (03PS4) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [09:04:57] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (owner: 10David Caro) [09:06:10] (03PS5) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [09:09:01] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T348643) [09:09:09] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (owner: 10David Caro) [09:09:52] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T348643) [09:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:16:36] (03PS6) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [09:18:10] 10Toolforge (Quota-requests): Request increased quota for pm20-* Toolforge tool - https://phabricator.wikimedia.org/T359785#9653373 (10Jneubert) Hi @dcaro , sorry for the late respose - I was some days off and missed the notification. The request is about a standing instance, to be shared among at least two per... [09:19:47] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (owner: 10David Caro) [09:21:38] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T348643) [09:22:09] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T348643) [09:22:12] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T348643) [09:25:35] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T348643) [09:40:41] (CloudVPSDesignateLeaks) firing: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:45:41] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:48:38] (03PS7) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [09:51:46] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (owner: 10David Caro) [09:54:25] (03PS8) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [09:57:45] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (owner: 10David Caro) [10:00:15] (03PS9) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [10:03:21] (03PS1) 10Muehlenhoff: Remove obsolete dummy key tabs [labs/private] - 10https://gerrit.wikimedia.org/r/1013511 (https://phabricator.wikimedia.org/T331613) [10:06:31] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy key tabs [labs/private] - 10https://gerrit.wikimedia.org/r/1013511 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:20:35] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [10:20:38] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T348643) [10:22:30] (03PS10) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [10:22:31] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [10:26:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9653483 (10dcaro) [10:26:43] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T348643) [10:26:51] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [10:34:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [10:44:09] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [10:45:56] (SystemdUnitDown) firing: The service unit ceph-osd@270.service is in failed status on host cloudcephosd1034. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1034 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:46:17] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9653542 (10dcaro) TSR and performance tests sent to DELL, bringing all the hosts back online. [10:50:45] 06cloud-services-team, 10Toolforge: Upgrade Toolforge legacy URL redirectors to Debian Bullseye or later - https://phabricator.wikimedia.org/T311909#9653551 (10taavi) a:03taavi Taking this. I'll also merge the two roles together during this. [10:55:56] (SystemdUnitDown) resolved: The service unit ceph-osd@270.service is in failed status on host cloudcephosd1034. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1034 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:06:14] 10Toolforge (Quota-requests): Request increased quota for pm20-* Toolforge tool - https://phabricator.wikimedia.org/T359785#9653581 (10dcaro) Thanks, good to know. Given that, there's two possible ways to go forward: * Creating your own database in a VM on CloudVPS: ** You have full control (for good and bad,... [11:09:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 07): Upgrade Toolforge legacy URL redirectors to Debian Bullseye or later - https://phabricator.wikimedia.org/T311909#9653593 (10taavi) [11:15:28] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:19:40] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.refresh_puppet_certs on toolsbeta-legacy-redirector-2.toolsbeta.eqiad1.wikimedia.cloud [11:21:10] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on toolsbeta-legacy-redirector-2.toolsbeta.eqiad1.wikimedia.cloud [11:24:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-legacy-redirector-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:24:34] (DiskSpace) firing: Disk space cloudbackup1004:9100:/srv 5.986% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:29:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance toolsbeta-legacy-redirector-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:30:10] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T348643) [11:30:21] (HarborProbeUnknown) firing: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [11:30:21] (HarborComponentDown) firing: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [11:35:20] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:35:21] (HarborProbeUnknown) resolved: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [11:35:21] (HarborComponentDown) resolved: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [11:35:28] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:36:11] (CloudVPSDesignateLeaks) resolved: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:37:08] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:37:50] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:38:16] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:39:15] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:49:14] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T348643) [11:57:43] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T348643) [11:59:26] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [12:06:07] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [12:06:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:08:55] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T348643) [12:09:50] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [12:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:27:40] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T348643) [12:28:34] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [12:55:11] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge, 07Epic, 05Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#9653676 (10taavi) [12:55:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: Upgrade Toolforge legacy URL redirectors to Debian Bullseye or later - https://phabricator.wikimedia.org/T311909#9653675 (10taavi) 05Open→03In progress [13:03:33] 10Cloud-VPS: 14Reset default Security Group rules for the openvas Cloud VPS project - 14https://phabricator.wikimedia.org/T360694#9653866 (10KHurd-WMF) 14They are, I really appreciate it. I promise not to touch it again! [13:13:29] 06cloud-services-team, 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: 14toolschecker: retrieve the list of etcd nodes from hiera - 14https://phabricator.wikimedia.org/T279078#9653892 (10taavi) 05Open→03Resolved a:03taavi [13:41:10] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T348643) [13:41:20] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T348643) [13:41:39] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T348643) [13:46:09] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:02:50] (ProbeDown) firing: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:07:50] (ProbeDown) resolved: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:15:22] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T348643) [14:22:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:26:38] (03PS11) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 [14:27:09] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:33:57] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Cloud-vps Buster deprecation - https://phabricator.wikimedia.org/T331738#9654125 (10Andrew) As of today there are 230 active buster hosts and 23 shut down hosts. [14:42:12] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project - https://phabricator.wikimedia.org/T360703#9654148 (10Andrew) >>! In T360703#9653248, @fgiunchedi wrote: > Thank you for the heads up; for context I'm working on {T352640} whic... [14:43:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654150 (10Jhancock.wm) @Andrew thanks for the update. Can I bug you to update the site.pp as well? Thanks! [15:05:24] 10Toolforge (Quota-requests): Request increased quota for pm20-* Toolforge tool - https://phabricator.wikimedia.org/T359785#9654197 (10Jneubert) That sounds great! Project name "pm20database" would be fine. I'd try a Trove instance first, and hope that backups are possible via pgadmin. Load will be low, so I d... [15:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:21:44] 10Cloud-VPS (Quota-requests): Temporarily increase quota for dwl Buster migration - https://phabricator.wikimedia.org/T360788 (10Giftpflanze) 03NEW [15:24:49] (DiskSpace) firing: Disk space cloudbackup1004:9100:/srv 5.219% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:27:25] 06cloud-services-team, 10Toolforge: Upgrade Toolforge apt repository (tools-services hosts) to Debian Bullseye or later - https://phabricator.wikimedia.org/T311914#9654353 (10taavi) So I'm wondering whether we should take the opportunity to migrate the repository to [[ https://wikitech.wikimedia.org/wiki/Repre... [15:37:38] 06cloud-services-team, 10VPS-Projects, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Puppet (Puppet 7.0): Update Integration project puppetmaster - https://phabricator.wikimedia.org/T360461#9654405 (10Andrew) I've built the new puppetserver for this project. The old one (integrat... [15:50:28] (InstanceDown) firing: (2) Project cloudinfra instance cloud-puppetmaster-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:55:28] (InstanceDown) resolved: (2) Project cloudinfra instance cloud-puppetmaster-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:13:39] 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9654517 (10Anomie) > `webservice perl5.32 shell` Just noting what we talked about, that "toolforge-shell" would be a lot more discoverable as a command name for t... [16:33:58] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Set up a bitu instance for codfw1dev - https://phabricator.wikimedia.org/T360795 (10Andrew) 03NEW [16:46:21] 10Wikibugs, 07Software-Licensing: Relicense Wikibugs from MIT to GPL-3.0-or-later after approval by all substantive contributors - https://phabricator.wikimedia.org/T360718#9654631 (10valhallasw) I have no objections to relicensing code that I have contributed under GPL-3. [16:57:43] 10PAWS: Update nb_serverproxy_openrefine - https://phabricator.wikimedia.org/T360798 (10rook) 03NEW [16:58:46] 10PAWS: Update jupyter-rsession-proxy - https://phabricator.wikimedia.org/T360800 (10rook) 03NEW [16:59:30] 10PAWS: Update jupyter-rsession-proxy - https://phabricator.wikimedia.org/T360800#9654719 (10rook) [17:01:38] 10PAWS: Update nb_serverproxy_openrefine - https://phabricator.wikimedia.org/T360798#9654737 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/393 [17:01:51] vivian-rook opened https://github.com/toolforge/paws/pull/393 [17:08:02] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Temporarily increase quota for dwl Buster migration - https://phabricator.wikimedia.org/T360788#9654745 (10bd808) +1 [17:08:59] 10PAWS: PAWS partially down - https://phabricator.wikimedia.org/T360803 (10rook) 03NEW [17:10:41] (CloudVPSDesignateLeaks) firing: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:15:41] (CloudVPSDesignateLeaks) firing: (5) Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:26:35] (03PS2) 10Dzahn: delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413) [17:26:44] (03CR) 10Dzahn: [V:03+2 C:03+2] delete doc.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013419 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:28:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on project-proxy-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [17:30:33] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#9654886 (10bd808) [17:32:31] 10Striker, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#9654890 (10bd808) a:05bd808→03None [17:33:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 1 deleted instances on project-proxy-puppetmaster-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [17:36:17] 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9654901 (10Anomie) > Possibly I could set up my ~/.ssh/config with an entry that would use ProxyCommand to ssh→become→webservice shell→sshd -i though... First att... [17:38:23] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654909 (10Jhancock.wm) [17:51:19] 10Wikibugs: Update irc task to use AsyncRedisQueue - https://phabricator.wikimedia.org/T359982#9654947 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/15 Revamp irc bot plugin and queue [17:51:21] 10Wikibugs: Explore replacing asyncio-redis with redis.asyncio - https://phabricator.wikimedia.org/T360074#9654948 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/15 Revamp irc bot plugin and queue [17:51:50] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [17:51:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9654950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2004.codfw.... [17:57:05] 10Wikibugs: 14Update irc task to use AsyncRedisQueue - 14https://phabricator.wikimedia.org/T359982#9654959 (10bd808) 05In progress→03Resolved [17:57:18] 10Wikibugs: 14Explore replacing asyncio-redis with redis.asyncio - 14https://phabricator.wikimedia.org/T360074#9654960 (10bd808) 05In progress→03Resolved [18:03:35] (03PS2) 10Dzahn: delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413) [18:03:59] (03CR) 10Dzahn: [V:03+2 C:03+2] delete releases.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013418 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:36:07] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2004.codfw.wmne... [18:37:09] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655105 (10Jhancock.wm) [19:14:18] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655164 (10Jhancock.wm) cloudbackup2003 has os after a few attempts. had to delete and redo the virtual disks twice before it took. b... [19:24:49] (DiskSpace) firing: Disk space cloudbackup1004:9100:/srv 3.963% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:08:36] 10Toolforge (Software install/update): Consider adding `kubectl`, `webservice`, and `toolforge` binaries to shell container images - https://phabricator.wikimedia.org/T360818 (10bd808) 03NEW [20:17:39] 10Toolforge (Software install/update): Consider adding `kubectl`, `webservice`, and `toolforge` binaries to shell container images - https://phabricator.wikimedia.org/T360818#9655327 (10bd808) It would be a breaking change, but this could also be seen as a way to remove vim, emacs, and other primarily interactiv... [20:24:28] (InstanceDown) firing: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:33:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 1 deleted instances on project-proxy-puppetmaster-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [20:34:28] (InstanceDown) resolved: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:51:34] PROBLEM - Disk space on cloudbackup1004 is CRITICAL: DISK CRITICAL - free space: /srv 647777 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1004&var-datasource=eqiad+prometheus/ops [20:52:33] 10Cloud-VPS (Quota-requests), 10Release-Engineering-Team (Radar): Increase instance and volume quota in devtools project for puppetmaster upgrade - https://phabricator.wikimedia.org/T360823 (10brennen) 03NEW [20:53:10] 10Cloud-VPS (Quota-requests), 06collaboration-services, 10Release-Engineering-Team (Radar): Increase instance and volume quota in devtools project for puppetmaster upgrade - https://phabricator.wikimedia.org/T360823#9655432 (10brennen) [20:54:50] 10Cloud-VPS (Quota-requests), 06collaboration-services, 10Release-Engineering-Team (Radar): Increase instance and volume quota in devtools project for puppetmaster upgrade - https://phabricator.wikimedia.org/T360823#9655434 (10brennen) [20:54:52] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9655433 (10brennen) [21:06:15] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Temporarily increase quota for dwl Buster migration - https://phabricator.wikimedia.org/T360788#9655459 (10Giftpflanze) It turns out that the quota increase is actually not needed anymore. If you wish you can instead decrease the quota to 5 instances, 65 CPU... [21:15:41] (CloudVPSDesignateLeaks) firing: (5) Detected 14 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:15:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:20:41] (CloudVPSDesignateLeaks) firing: (5) Detected 14 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:25:41] (CloudVPSDesignateLeaks) resolved: (5) Detected 14 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:29:14] 10PAWS: PAWS partially down - https://phabricator.wikimedia.org/T360803#9655477 (10rook) quay.io appears well again [21:29:20] 10PAWS: 14PAWS partially down - 14https://phabricator.wikimedia.org/T360803#9655478 (10rook) 05Open→03Resolved [21:44:34] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/srv 5.979% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:51:34] RECOVERY - Disk space on cloudbackup1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1004&var-datasource=eqiad+prometheus/ops [22:16:28] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [22:31:28] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [22:35:07] 10Cloud-VPS (Quota-requests), 06collaboration-services, 10Release-Engineering-Team (Radar): Increase instance and volume quota in devtools project for puppetmaster upgrade - https://phabricator.wikimedia.org/T360823#9655548 (10taavi) +1 [22:37:48] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9655564 (10Andrew) [22:39:01] 10Cloud-VPS (Quota-requests), 06collaboration-services, 10Release-Engineering-Team (Radar): 14Increase instance and volume quota in devtools project for puppetmaster upgrade - 14https://phabricator.wikimedia.org/T360823#9655561 (10Andrew) 05Open→03Resolved a:03Andrew 14all set [22:40:41] (CloudVPSDesignateLeaks) firing: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:45:41] (CloudVPSDesignateLeaks) firing: (5) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:54:11] (03PS1) 10Operator873: Fix log_params issue causing StewardBot to crash [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013612 (https://phabricator.wikimedia.org/T360790) [22:56:04] (03CR) 10Majavah: [C:03+1] "Given that log processing is real-time I don't think we need any backwards compat here." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013612 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [22:56:28] (03CR) 10Urbanecm: [C:03+2] Fix log_params issue causing StewardBot to crash [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013612 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [22:57:01] (03Merged) 10jenkins-bot: Fix log_params issue causing StewardBot to crash [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013612 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [23:15:26] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [23:33:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 1 deleted instances on project-proxy-puppetmaster-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [23:43:37] (03PS1) 10Operator873: Fix log_action as well [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013623 (https://phabricator.wikimedia.org/T360790) [23:44:08] (03CR) 10CI reject: [V:04-1] Fix log_action as well [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013623 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [23:46:46] (03PS1) 10Operator873: Fix log_action as well [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013625 (https://phabricator.wikimedia.org/T360790) [23:47:02] (03Abandoned) 10Operator873: Fix log_action as well [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013623 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [23:52:57] (03CR) 10Urbanecm: [C:03+2] Fix log_action as well [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1013625 (https://phabricator.wikimedia.org/T360790) (owner: 10Operator873) [23:59:00] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9655682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2003.codfw.wmne...