[00:14:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye comp... [00:16:26] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye comp... [00:16:44] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942777 (10Jclark-ctr) [00:16:56] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942778 (10Jclark-ctr) a:03Jclark-ctr [00:17:19] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942779 (10Jclark-ctr) [00:18:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942781 (10Jclark-ctr) @VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney [00:27:21] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9942790 (10Jclark-ctr) @Andrew @dcaro thank you for providing update did you have host names for this and please update preseed.yaml, and site.pp [01:05:11] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:19:56] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:05:11] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:19:57] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:43:09] 06cloud-services-team, 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#9943079 (10Marostegui) >>! In T344599#9940511, @fnegri wrote: >> If I may @fnegri, the issue is that those hosts are in a way special > > @jcrespo you absolutely may :) I'm... [05:46:19] 10Data-Services: [wikireplicas] Automated tests for views - https://phabricator.wikimedia.org/T368050#9943085 (10Marostegui) >>! In T368050#9940598, @fnegri wrote: > @Marostegui thanks that's very useful! I can try implementing this as a Python script called by a systemd timer on each clouddb host. Then we can p... [06:44:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-elastic-4 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:24:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-elastic-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:53:24] 10Data-Services, 06Data-Engineering-Icebox: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#9943298 (10Marostegui) 05Open→03Declined All those fields are gone ar_comment {T233135} rev_text_id https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Delet... [08:07:26] 10Data-Services: CONVERT_TZ fails because named time zones have not been loaded - https://phabricator.wikimedia.org/T323183#9943342 (10Blahma) Thank you, @fnegri, for looking into this! Now that we know that a "solution exists", I will be happy to see it implemented one day. My specific purpose is a research i... [08:10:47] 10Data-Services, 06Data-Engineering-Icebox: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#9943352 (10Zache) I am still interested for archive comments as it makes possible to for example analyse if there were notability discussion before page was delete... [08:31:16] 06cloud-services-team, 10Technical-blog-posts: Tech blog post: "Wikimedia Toolforge: migrating Kubernetes from PodSecurityPolicy to kyverno" - https://phabricator.wikimedia.org/T368948#9943427 (10aborrero) a:03debt Thanks for the review, replied in the document. [08:58:13] (03merge) 10aborrero: deployment: drop PSP [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/47 (https://phabricator.wikimedia.org/T368142) [08:59:10] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943498 (10cmooney) 05Resolved→03Open [08:59:21] 10Data-Services, 06Data-Engineering-Icebox: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#9943500 (10Marostegui) Would you mind creating a task for that field? Just to have a clearer task, as this one is a bit messy and can be confusing with some of the... [09:00:00] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-builder: bump to 0.0.106-20240702085825-e1519ac7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/371 (https://phabricator.wikimedia.org/T368142) [09:05:11] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:08:31] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [09:08:48] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [09:10:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943509 (10cmooney) >>! In T363341#9936269, @Jclark-ctr wrote: > cloudcephosd1039 > 2nd cable serial#20220008 port 1 > cloudcephosd1040 > 2nd cable serial#... [09:10:37] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [09:10:55] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [09:11:22] (03merge) 10aborrero: builds-builder: bump to 0.0.106-20240702085825-e1519ac7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/371 (https://phabricator.wikimedia.org/T368142) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:15:14] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943513 (10cmooney) 05Open→03Resolved [09:19:57] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:21:46] (03open) 10aborrero: cert-manager: drop internal PSP definitions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/372 (https://phabricator.wikimedia.org/T368142) [09:22:05] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component cert-manager [09:22:24] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component cert-manager [09:22:43] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component cert-manager [09:23:00] (03merge) 10aborrero: cert-manager: drop internal PSP definitions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/372 (https://phabricator.wikimedia.org/T368142) [09:23:00] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component cert-manager [09:24:21] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#9943555 (10notconfusing) Hello, sorry to be running quite late to this thread. (I just had the birth of my second child). I really intend to keep humaniki online. I might nee... [09:45:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance project-proxy-puppetserver-1 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:46:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance paws-puppetserver-1 on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:50:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-internal-puppetserver-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:54:07] (03open) 10aborrero: wmcs-k8s-metrics: kube-state-metrics: drop internal PSP definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/373 (https://phabricator.wikimedia.org/T368142) [09:56:05] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [09:56:19] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [09:56:45] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [09:56:59] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [09:57:17] (03merge) 10aborrero: wmcs-k8s-metrics: kube-state-metrics: drop internal PSP definition [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/373 (https://phabricator.wikimedia.org/T368142) [09:59:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-elastic-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:02:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-puppetserver-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:04:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance gitlab-runners-puppetserver-01 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:05:29] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-cloudvps-puppetserver-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:10:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-puppetserver-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:16:54] (03open) 10aborrero: tekton-pipelines: drop internal PSP definitions [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/48 (https://phabricator.wikimedia.org/T368142) [10:19:07] (03merge) 10aborrero: tekton-pipelines: drop internal PSP definitions [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/48 (https://phabricator.wikimedia.org/T368142) [10:20:15] (03update) 10phuedx: Add test stream configs for dogfooding [toolforge-repos/mwdemo] - 10https://gitlab.wikimedia.org/toolforge-repos/mwdemo/-/merge_requests/1 (owner: 10cjming) [10:26:00] (03update) 10phuedx: Add test stream configs for dogfooding [toolforge-repos/mwdemo] - 10https://gitlab.wikimedia.org/toolforge-repos/mwdemo/-/merge_requests/1 (owner: 10cjming) [10:30:33] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-builder: bump to 0.0.107-20240702102918-afd8fe1a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/374 (https://phabricator.wikimedia.org/T368142) [10:37:56] (03update) 10phuedx: Add test stream configs for dogfooding [toolforge-repos/mwdemo] - 10https://gitlab.wikimedia.org/toolforge-repos/mwdemo/-/merge_requests/1 (owner: 10cjming) [10:48:06] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9943798 (10Marostegui) Just one addition: sanitarium hosts also have replication filters to exclude tables or entire databases (private wikis). [10:48:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] Views flaggedpage_pending and flaggedtemplates are broken - https://phabricator.wikimedia.org/T368939#9943794 (10fnegri) a:03fnegri @Marostegui thanks for the patch. I will remove those tables from `maintain-views.yaml` and run the co... [10:53:18] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [10:53:35] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [10:53:58] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [10:54:08] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [10:54:20] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [10:54:39] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [10:54:52] (03merge) 10aborrero: builds-builder: bump to 0.0.107-20240702102918-afd8fe1a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/374 (https://phabricator.wikimedia.org/T368142) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [11:01:36] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9943831 (10aborrero) [11:01:42] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9943832 (10aborrero) 05In progress→03Resolved [11:02:11] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9943855 (10aborrero) 05In progress→03Resolved [11:03:21] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110#9943857 (10aborrero) 05In progress→03Resolved everything done. [11:07:08] (03open) 10aborrero: kind: drop TTLAfterFinished feature gate flag [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/154 (https://phabricator.wikimedia.org/T349197) [11:18:02] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9943936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin10... [11:30:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-puppetserver-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:30:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance project-proxy-puppetserver-1 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:31:29] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance paws-puppetserver-1 on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:35:29] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-cloudvps-puppetserver-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:42:26] 06cloud-services-team, 10Toolforge, 07Kubernetes, 13Patch-For-Review: [infra] Remove TTLAfterFinished from config before upgrade to 1.25 - https://phabricator.wikimedia.org/T349197#9944012 (10aborrero) 05Open→03In progress a:03aborrero [11:44:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-elastic-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:44:45] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9944030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 f... [11:47:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-puppetserver-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:48:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9944039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin10... [11:50:30] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-cloudvps-puppetserver-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:50:35] 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [reimage,ceph] reimaging cloudcephosd hosts gets stuck in network configuration screen - https://phabricator.wikimedia.org/T369026 (10dcaro) 03NEW [11:52:02] 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [reimage,ceph] reimaging cloudcephosd hosts gets stuck in network configuration screen - https://phabricator.wikimedia.org/T369026#9944055 (10dcaro) 05Open→03In progress p:05Triage→03High [11:54:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance gitlab-runners-puppetserver-01 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:56:23] 06cloud-services-team, 10Toolforge, 07Kubernetes, 13Patch-For-Review: [infra] Remove TTLAfterFinished from config before upgrade to 1.25 - https://phabricator.wikimedia.org/T349197#9944064 (10aborrero) removed from ops puppet in commit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/791368 [11:56:38] (03merge) 10aborrero: kind: drop TTLAfterFinished feature gate flag [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/154 (https://phabricator.wikimedia.org/T349197) [11:56:44] 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [reimage,ceph] reimaging cloudcephosd hosts gets stuck in network configuration screen - https://phabricator.wikimedia.org/T369026#9944065 (10dcaro) [11:57:01] 06cloud-services-team, 10Toolforge, 07Kubernetes, 13Patch-For-Review: [infra] Remove TTLAfterFinished from config before upgrade to 1.25 - https://phabricator.wikimedia.org/T349197#9944067 (10aborrero) 05In progress→03Resolved [11:59:35] 06cloud-services-team, 10Toolforge: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944079 (10aborrero) [12:02:46] 06cloud-services-team, 10Toolforge: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944084 (10aborrero) 05Open→03In progress p:05Triage→03Medium a:03aborrero [12:11:08] (03approved) 10dcaro: pre-commit: Autoupdate [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/26 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:11:11] (03merge) 10dcaro: pre-commit: Autoupdate [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/26 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:11:31] (03approved) 10dcaro: pre-commit: Autoupdate [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/99 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:11:34] (03merge) 10dcaro: pre-commit: Autoupdate [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/99 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:13:12] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.26-20240702121120-cebdcedc [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/375 [12:13:14] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.26-20240702121120-cebdcedc [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/375 [12:18:31] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-api: bump to 0.0.158-20240702121148-b93443d8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/376 [12:26:23] (03open) 10aborrero: kyverno: enable prometheus monitoring [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/377 (https://phabricator.wikimedia.org/T368515) [12:29:14] (03close) 10aborrero: kyverno: enable prometheus monitoring [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/377 (https://phabricator.wikimedia.org/T368515) [12:52:50] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944384 (10aborrero) created dashboard: https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools [12:53:31] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944385 (10aborrero) [13:01:27] (03open) 10aborrero: utils/update_component.sh: fail if no chartVersion tag was found at all [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/378 [13:03:01] 10Toolforge (Toolforge iteration 11): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#9944396 (10dcaro) 05Open→03In progress [13:05:11] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:05:43] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "dumps" project Buster deprecation - https://phabricator.wikimedia.org/T367528#9944402 (10Andrew) btw, ideally the VMs should be replaced with new builds rather than upgraded in place. in-place upgrades still show up in our reporting as running the old OS, and g... [13:07:32] (03update) 10aborrero: utils/update_component.sh: fail if no chartVersion tag was found at all [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/378 [13:12:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944425 (10aborrero) [13:12:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9944426 (10aborrero) [13:13:50] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [infra,k8s] package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061#9944432 (10aborrero) [13:14:10] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): toolforge: kubernetes can't revoke certificates - https://phabricator.wikimedia.org/T365681#9944437 (10aborrero) [13:15:15] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] Figure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9944435 (10Raymond_Ndibe) 05In progress→03Stalled [13:16:05] 10Toolforge (Toolforge iteration 11): [jobs-cli] enforce proper validation for load jobs before calculate_changes - https://phabricator.wikimedia.org/T366211#9944439 (10Raymond_Ndibe) 05Open→03In progress [13:16:48] 06cloud-services-team, 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9944451 (10Andrew) [13:17:13] 10Toolforge (Toolforge iteration 11): [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#9944445 (10Raymond_Ndibe) 05Open→03In progress [13:19:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): toolforge: kubernetes can't revoke certificates - https://phabricator.wikimedia.org/T365681#9944488 (10fnegri) a:03aborrero [13:19:57] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:23:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [infra,k8s] package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061#9944497 (10dcaro) a:03aborrero [13:23:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496#9944499 (10Andrew) [13:23:50] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9944500 (10dcaro) p:05Triage→03High [13:23:58] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): toolforge: kubernetes can't revoke certificates - https://phabricator.wikimedia.org/T365681#9944501 (10dcaro) p:05Triage→03High [13:24:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9944507 (10dcaro) p:05Triage→03High [13:25:18] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516#9944511 (10dcaro) p:05Triage→03High [13:25:32] 10Toolforge (Toolforge iteration 11): [toolforge,replica_cnf] Use tool-prefixed urls for envvars - https://phabricator.wikimedia.org/T368909#9944508 (10dcaro) p:05Triage→03High a:05Slst2020→03dcaro [13:26:30] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 (10Andrew) 03NEW [13:29:36] 10Toolforge (Toolforge iteration 11): [toolforge,replica_cnf] Use tool-prefixed urls for envvars - https://phabricator.wikimedia.org/T368909#9944536 (10dcaro) 05Open→03In progress [13:29:56] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9944551 (10dcaro) p:05Triage→03Low [13:31:03] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516#9944533 (10dcaro) 05Open→03In progress [13:31:11] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9944575 (10dcaro) @Slst2020 do you think that this is still relevant? [13:32:41] 10Toolforge (Toolforge iteration 11): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9944597 (10dcaro) a:03dcaro [13:32:50] 10Toolforge (Toolforge iteration 11): Toolforge Aptfile not producing working copy of `ffmpeg` - https://phabricator.wikimedia.org/T365633#9944584 (10dcaro) 05Open→03In progress [13:33:29] 10Toolforge (Toolforge iteration 11), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9944600 (10dcaro) a:03Slst2020 [13:34:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge: [docs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313#9944601 (10dcaro) [13:35:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [api-gateway] add alert for uptime - https://phabricator.wikimedia.org/T348633#9944607 (10dcaro) a:03Slst2020 [13:36:18] 10Toolforge (Toolforge iteration 11), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9944604 (10dcaro) 05In progress→03Stalled [13:55:22] 10Toolforge: [builds-api, envvars-api] Use the aliases instead of ResponseMessage type - https://phabricator.wikimedia.org/T366870#9944782 (10dcaro) a:03Slst2020 [13:55:34] 10Toolforge: [builds-api, envvars-api] Use the aliases instead of ResponseMessage type - https://phabricator.wikimedia.org/T366870#9944784 (10dcaro) a:05Slst2020→03None [13:55:41] 10Toolforge: [builds-api, envvars-api] Use the aliases instead of ResponseMessage type - https://phabricator.wikimedia.org/T366870#9944785 (10dcaro) p:05Triage→03Low [13:59:16] 06cloud-services-team, 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496#9944798 (10dcaro) [13:59:16] 06cloud-services-team, 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9944799 (10dcaro) [13:59:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): toolforge: kubernetes can't revoke certificates - https://phabricator.wikimedia.org/T365681#9944800 (10dcaro) [13:59:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [infra,k8s] package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061#9944801 (10dcaro) [13:59:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9944802 (10dcaro) [13:59:25] 10Toolforge (Toolforge iteration 12): [builds-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367182#9944803 (10dcaro) [13:59:29] 10Toolforge (Toolforge iteration 12): [envvars-api] Remove authentication and use api-gateway provided headers - https://phabricator.wikimedia.org/T367181#9944804 (10dcaro) [13:59:31] 10Toolforge (Toolforge iteration 12): [jobs-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367180#9944805 (10dcaro) [13:59:33] 10Toolforge (Toolforge iteration 12): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9944806 (10dcaro) [13:59:37] 10Toolforge (Toolforge iteration 12): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9944807 (10dcaro) [13:59:45] 10Toolforge (Toolforge iteration 12), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9944808 (10dcaro) [13:59:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [api-gateway] add alert for uptime - https://phabricator.wikimedia.org/T348633#9944809 (10dcaro) [13:59:53] 10Toolforge (Toolforge iteration 12), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#9944810 (10dcaro) [14:01:14] 10Toolforge (Toolforge iteration 12): [jobs-cli] enforce proper validation for load jobs before calculate_changes - https://phabricator.wikimedia.org/T366211#9944826 (10dcaro) [14:02:09] 10Toolforge (Toolforge iteration 12): [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#9944828 (10dcaro) [14:02:25] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [jobs-api] move jobs load feature to the backend - https://phabricator.wikimedia.org/T366209#9944834 (10dcaro) [14:02:31] 10Toolforge (Toolforge iteration 12): Toolforge Aptfile not producing working copy of `ffmpeg` - https://phabricator.wikimedia.org/T365633#9944832 (10dcaro) [14:02:48] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515#9944830 (10dcaro) [14:03:18] 10Toolforge (Toolforge iteration 12): [toolforge,replica_cnf] Use tool-prefixed urls for envvars - https://phabricator.wikimedia.org/T368909#9944836 (10dcaro) [14:03:48] 10Toolforge (Toolforge iteration 12): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#9944842 (10dcaro) [14:04:10] 10Toolforge (Toolforge iteration 12): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9944840 (10dcaro) [14:04:28] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9944848 (10dcaro) [14:04:44] 10Toolforge (Toolforge iteration 12): [api-gateway] Move authentication from the APIs - https://phabricator.wikimedia.org/T367179#9944844 (10dcaro) [14:05:14] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516#9944838 (10dcaro) [14:05:20] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9944850 (10dcaro) [14:05:33] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire - https://phabricator.wikimedia.org/T366579#9944856 (10dcaro) [14:05:49] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9944846 (10dcaro) [14:05:52] 10Toolforge (Toolforge iteration 12): [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation - https://phabricator.wikimedia.org/T356261#9944852 (10dcaro) [14:06:25] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9944876 (10dcaro) [14:06:30] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 12): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#9944854 (10dcaro) [14:06:43] 10Toolforge (Toolforge iteration 12): [toolforge] simplify calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377#9944880 (10dcaro) [14:07:13] 10Toolforge (Toolforge iteration 12), 07Upstream: [builds-builder] golang based images get infinite nested loops for procfile entries - https://phabricator.wikimedia.org/T363417#9944878 (10dcaro) [14:07:19] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] Figure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9944874 (10dcaro) [14:07:21] 10Toolforge (Toolforge iteration 12): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9944872 (10dcaro) [14:07:27] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 12), 05Goal: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#9944886 (10dcaro) [14:08:10] 10Toolforge (Toolforge iteration 12), 07Upstream: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9944882 (10dcaro) [14:08:57] 10Toolforge (Toolforge iteration 12), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9944884 (10dcaro) [14:09:08] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#9944891 (10dcaro) [14:09:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9944888 (10dcaro) [14:10:38] 10Toolforge (Toolforge iteration 12): [envvars-api, envvars-cli] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363809#9944897 (10dcaro) [14:10:50] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [jobs-api, jobs-cli] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363346#9944895 (10dcaro) [14:12:02] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [builds-api, builds-cli] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363808#9944899 (10dcaro) [14:15:58] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9944985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 f... [14:19:53] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9945018 (10ops-monitoring-bot) Host rebooted by dcaro@cumin1002 with reason: upgraded packages [14:24:12] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [14:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:24:18] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [14:29:22] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#9945052 (10aborrero) p:05Triage→03Low [14:30:04] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9945048 (10aborrero) 05Open→03Stalled marking as stalled, because the work on ceph nodes wont be progressing for a while. [14:30:07] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [toolforge,infra] Fix deprecated Kubelet flags - https://phabricator.wikimedia.org/T355881#9945057 (10aborrero) [14:31:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:32:53] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945091 (10CDanis) >>! In T348643#9931318, @dcaro wrote: > Any ideas/recommendations on how to proceed next? > > I... [14:33:55] 06cloud-services-team, 10Toolforge (Toolforge iteration 12): [toolforge,infra] Fix deprecated Kubelet flags - https://phabricator.wikimedia.org/T355881#9945099 (10aborrero) I believe the kubelet setup is maintained by kubeadm. We don't configure anything ourselves directly, no? [14:35:15] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge,replica_cnf] Use tool-prefixed urls for envvars - https://phabricator.wikimedia.org/T368909#9945095 (10dcaro) 05In progress→03Resolved [14:39:34] 06cloud-services-team, 10Cloud-VPS, 06collaboration-services: VMs in Cloud VPS share the same machine-id - https://phabricator.wikimedia.org/T351507#9945158 (10Andrew) a:03Andrew [14:41:10] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:45:14] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "dumps" project Buster deprecation - https://phabricator.wikimedia.org/T367528#9945179 (10Hydriz) @Nemo_bis We can proceed to delete those VMs and spin up new ones later. The archiving scripts haven't been working for some time and I am still in the midst of rev... [14:46:09] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component envvars-api [14:46:13] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component envvars-api [14:47:29] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component envvars-api [14:47:38] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component envvars-api [14:50:39] (03update) 10dcaro: envvars-api: bump to 0.0.50-20240619035607-42829b67 again [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/354 (https://phabricator.wikimedia.org/T368516) [15:00:47] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945281 (10dcaro) >>! In T348643#9945091, @CDanis wrote: >>>! In T348643#9931318, @dcaro wrote: >> Any ideas/recomme... [15:01:18] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component envvars-api [15:01:29] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component envvars-api [15:01:31] (03close) 10cjming: Add test stream configs for dogfooding [toolforge-repos/mwdemo] - 10https://gitlab.wikimedia.org/toolforge-repos/mwdemo/-/merge_requests/1 [15:05:07] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945304 (10CDanis) Yeah okay, that's all pretty messy to potentially clean up from. Have you tried the `ceph-syn` t... [15:12:51] (03approved) 10dcaro: envvars-api: bump to 0.0.50-20240619035607-42829b67 again [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/354 (https://phabricator.wikimedia.org/T368516) [15:12:56] (03merge) 10dcaro: envvars-api: bump to 0.0.50-20240619035607-42829b67 again [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/354 (https://phabricator.wikimedia.org/T368516) [15:15:51] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516#9945328 (10dcaro) 05In progress→03Resolved [15:25:35] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 06Research: Replace or remove Debian Buster VMs in 'wmf-research-tools' cloud-vps project - https://phabricator.wikimedia.org/T367444#9945402 (10Miriam) a:03Isaac [15:28:09] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 06Research: Replace or remove Debian Buster VMs in 'wmf-research-tools' cloud-vps project - https://phabricator.wikimedia.org/T367444#9945404 (10Miriam) @Isaac will coordinate with the rest of the team to take care of this, by July 15th we shoul... [15:40:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge,infra] Fix deprecated Kubelet flags - https://phabricator.wikimedia.org/T355881#9945467 (10aborrero) >>! In T355881#9945099, @aborrero wrote: > I believe the kubelet setup is maintained by kubeadm. We don't configure... [15:43:13] 06cloud-services-team, 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#9945474 (10fnegri) > Yes, they are special and will always remain like that for many reasons including: I want to clarify that in my previous comment "special" was meant as... [15:50:52] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 06Research: Replace or remove Debian Buster VMs in 'wmf-research-tools' cloud-vps project - https://phabricator.wikimedia.org/T367444#9945492 (10Isaac) Thanks for flagging this @Andrew -- apologies, I hadn't seen something from StrikerBot so thi... [15:59:25] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Allow Quarry to query ToolsDB public databases - https://phabricator.wikimedia.org/T348407#9945538 (10fnegri) 05In progress→03Resolved > I will give people a 2-week notice for this change, and enable access to all _p databases on M... [16:13:57] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945668 (10dcaro) >>! In T348643#9945304, @CDanis wrote: > Yeah okay, that's all pretty messy to potentially clean u... [16:32:14] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945806 (10CDanis) Ah okay sorry. Maybe experiment with running `rados bench` and slowly increasing the number of n... [16:34:17] 10Cloud-VPS (Debian Buster Deprecation), 06Research: Cloud VPS "research-collaborations-api" project Buster deprecation - https://phabricator.wikimedia.org/T367551#9945817 (10Isaac) [16:44:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance tools-elastic-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:44:29] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-elastic-4 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:44:31] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9945834 (10dcaro) [16:48:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945840 (10dcaro) >>! In T348643#9945806, @CDanis wrote: > Ah okay sorry. Maybe experiment with running `rados benc... [16:48:14] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [16:48:26] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [16:50:14] 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [reimage,ceph] reimaging cloudcephosd hosts gets stuck in network configuration screen - https://phabricator.wikimedia.org/T369026#9945841 (10dcaro) [16:51:26] 10Tool-masto-collab: HTTP status client error (422 Unprocessable Entity) on posting with media - https://phabricator.wikimedia.org/T363314#9945842 (10Legoktm) Yeah, https://github.com/mastodon/mastodon/issues/6569 is the upstream ticket about not supporting SVG. We can handle this better in the tool though to di... [16:51:34] 10Tool-masto-collab: HTTP status client error (422 Unprocessable Entity) on posting with SVG media - https://phabricator.wikimedia.org/T363314#9945843 (10Legoktm) [16:53:33] 10Toolforge (Toolforge iteration 12), 13Patch-For-Review: [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9945846 (10dcaro) [16:53:55] 10Cloud-VPS (Quota-requests): puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669#9945847 (10fnegri) 05Open→03In progress a:03fnegri [16:54:28] 10Cloud-VPS (Quota-requests): puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669#9945851 (10dcaro) +1 [16:54:30] 10Cloud-VPS (Quota-requests): puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669#9945852 (10fnegri) p:05Triage→03Medium [16:55:17] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [16:55:29] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [16:55:30] (03update) 10dcaro: builds-api: bump to 0.0.158-20240702121148-b93443d8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/376 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:58:41] (03update) 10dcaro: api: auth and proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [16:59:44] (03approved) 10dcaro: builds-api: bump to 0.0.158-20240702121148-b93443d8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/376 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:59:47] (03merge) 10dcaro: builds-api: bump to 0.0.158-20240702121148-b93443d8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/376 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:59:48] (03update) 10dcaro: builds-api: bump to 0.0.158-20240702121148-b93443d8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/376 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:59:51] !log fnegri@cloudcumin1001 puppet-diffs START - Cookbook wmcs.openstack.quota_increase (T368669) [16:59:54] T368669: puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669 [16:59:56] 10Tool-masto-collab: HTTP status client error (422 Unprocessable Entity) on posting with SVG media - https://phabricator.wikimedia.org/T363314#9945859 (10Legoktm) We could also download the rasterized PNG version via the thumbnails 🤔 [16:59:59] !log fnegri@cloudcumin1001 puppet-diffs END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T368669) [17:00:10] (03update) 10dcaro: api-gateway: bump to 0.0.26-20240702121120-cebdcedc [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/375 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:00:20] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component api-gateway [17:00:31] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component api-gateway [17:01:50] 10Cloud-VPS (Quota-requests): puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669#9945868 (10fnegri) 05In progress→03Resolved > Increased quotas by 5 cores, 20 gigabytes, 1 instances, 12288 ram [17:05:12] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:07:05] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component api-gateway [17:07:17] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component api-gateway [17:15:02] (03update) 10dcaro: api-gateway: bump to 0.0.26-20240702121120-cebdcedc [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/375 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:15:04] (03merge) 10dcaro: api-gateway: bump to 0.0.26-20240702121120-cebdcedc [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/375 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:19:57] FIRING: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:13:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:18:10] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:46:29] 10Cloud-VPS (Debian Buster Deprecation), 06Research: Cloud VPS "research-collaborations-api" project Buster deprecation - https://phabricator.wikimedia.org/T367551#9946618 (10Isaac) 05In progress→03Resolved Being bold and resolving. Both instances have been migrated and when the report next updates (so... [19:04:57] RESOLVED: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:17:29] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [19:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:17:35] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [19:27:29] FIRING: InstanceDown: Project tools instance tools-elastic-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:30:24] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10video2commons: Replace or remove Debian Buster VMs in 'video' cloud-vps project - https://phabricator.wikimedia.org/T360711#9946843 (10Don-vip) My plan to solve this, with current status: # **Done**: Suspend gfg instance to check it is not us... [19:32:29] RESOLVED: InstanceDown: Project tools instance tools-elastic-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:19:44] (03update) 10raymond-ndibe: Draft: [jobs-cli] refactor handle_http_exception [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/46 [20:20:24] (03update) 10raymond-ndibe: [jobs-cli] refactor handle_http_exception [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/46 [21:14:04] 06cloud-services-team, 10Technical-blog-posts: Tech blog post: "Wikimedia Toolforge: migrating Kubernetes from PodSecurityPolicy to kyverno" - https://phabricator.wikimedia.org/T368948#9947286 (10debt) @aborrero @Andrew Can you take a look at the [[ https://techblog.wikimedia.org/?p=2474&preview=true | blog po... [21:19:57] FIRING: CloudVPSDesignateLeaks: Detected 12 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:50:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [22:04:59] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "globaleducation" project Buster deprecation - https://phabricator.wikimedia.org/T367531#9947441 (10Ragesoss) I'm most of the way there, and hope to finish up tomorrow. [22:05:29] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "globaleducation" project Buster deprecation - https://phabricator.wikimedia.org/T367531#9947444 (10Ragesoss) [22:37:28] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 4 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [23:42:11] (03open) 10raymond-ndibe: [lima-kilo] update bookworm arm64 image [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/155 [23:57:50] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "globaleducation" project Buster deprecation - https://phabricator.wikimedia.org/T367531#9947555 (10Ragesoss) [23:59:29] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "globaleducation" project Buster deprecation - https://phabricator.wikimedia.org/T367531#9947556 (10Ragesoss) 05Open→03Resolved a:03Ragesoss Done!