[00:00:45] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: add on-wiki edits of toolforge tools to toolviews report - https://phabricator.wikimedia.org/T317953#10460789 (10bd808) A Toolforge job for this task was enabled which seemed to do nothing except crash and spam my mailbox. @Andrew disabled the job. [00:09:43] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: add on-wiki edits of toolforge tools to toolviews report - https://phabricator.wikimedia.org/T317953#10460801 (10Raymond_Ndibe) >>! In T317953#10460784, @bd808 wrote: > A Toolforge job for this task was enabled which seemed to do nothing except crash an... [00:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:59:22] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:08:44] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:11:42] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:12:05] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:13:11] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:18:02] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:24:41] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: add on-wiki edits of toolforge tools to toolviews report - https://phabricator.wikimedia.org/T317953#10460865 (10bd808) >>! In T317953#10460801, @Raymond_Ndibe wrote: > I was actively working on this just an hour ago @bd808 . Thanks for bringing it to m... [01:24:44] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:28:13] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:31:17] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:33:46] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:35:03] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:47:25] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:54:00] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:55:21] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:56:47] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [01:59:57] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:00:53] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:12:43] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:14:48] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:27:16] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:29:09] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:32:49] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:33:33] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:38:52] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:53:38] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [02:54:59] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [03:05:15] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [03:12:31] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [03:13:33] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [03:21:58] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [03:22:53] (03approved) 10raymond-ndibe: jobs-api: bump to 0.0.345-20250113175346-77c98100 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/655 (https://phabricator.wikimedia.org/T364204) (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:22:57] (03merge) 10raymond-ndibe: jobs-api: bump to 0.0.345-20250113175346-77c98100 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/655 (https://phabricator.wikimedia.org/T364204) (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:24:05] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [03:34:06] (03update) 10raymond-ndibe: maintain-harbor: bump to 0.0.20-20250113175254-7d5dce92 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/654 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:34:46] (03approved) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [03:34:50] (03merge) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [03:35:40] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [03:36:30] (03update) 10raymond-ndibe: maintain-harbor: bump to 0.0.20-20250113175254-7d5dce92 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/654 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [03:36:40] !log raymond-ndibe@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-harbor [03:36:45] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [03:39:33] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [03:47:55] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [03:50:38] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [03:57:05] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [04:02:57] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [04:06:56] FIRING: SystemdUnitDown: The service unit purge_vm_rbd_images.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:12:18] (03update) 10raymond-ndibe: scheduled jobs: add timeout option [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/129 (https://phabricator.wikimedia.org/T306391) (owner: 10dcaro) [04:13:37] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [04:21:56] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [04:22:06] (03approved) 10raymond-ndibe: maintain-harbor: bump to 0.0.20-20250113175254-7d5dce92 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/654 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [04:22:13] (03merge) 10raymond-ndibe: maintain-harbor: bump to 0.0.20-20250113175254-7d5dce92 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/654 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [05:35:30] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [06:01:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:02:07] 06cloud-services-team: SystemdUnitDown The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T383751 (10phaultfinder) 03NEW [08:14:47] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypotesis] 6.3.5 Obtain a shortlist of categories for the Toolforge sustainability scoring framework - https://phabricator.wikimedia.org/T376896#10461092 (10Slst2020) [08:23:47] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypotesis] 6.3.5 Obtain a shortlist of categories for the Toolforge sustainability scoring framework - https://phabricator.wikimedia.org/T376896#10461120 (10Slst2020) 05In progress→03Resolved a:03Slst2020 This... [09:09:07] (03update) 10dcaro: bump_version: copy from jobs-api [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/67 [09:41:25] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: add on-wiki edits of toolforge tools to toolviews report - https://phabricator.wikimedia.org/T317953#10461337 (10dcaro) >>! In T317953#10460865, @bd808 wrote: >>>! In T317953#10460801, @Raymond_Ndibe wrote: >> I was actively working on this just an hour... [10:01:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:37:07] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic, 07Security: sustainability of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T363125#10461543 (10Aklapper) >>! In T363125#10028983, @nshahquinn-wmf wrote: >> Plan has been draften in the "Wikitech Migration Plan" document > > Thank y... [10:42:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Remove hardcoded NFT rules related to PAWS workers - https://phabricator.wikimedia.org/T383261#10461556 (10fnegri) 05Open→03Resolved [13:46:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [wmcs-cookbooks] wmcs.openstack.cloudvirt.vm_console cookbook is not working from cloudcumin hosts - https://phabricator.wikimedia.org/T379570#10462102 (10fnegri) 05In progress→03Resolved The cookbook is now working: ` root@cloudcumin1001:~# cook... [14:01:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:29:15] 06cloud-services-team, 10Toolforge: [components-api] Add minimal cli with build-only features - https://phabricator.wikimedia.org/T362082#10462251 (10dcaro) 05Open→03Resolved a:03dcaro [14:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:51:56] RESOLVED: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:58:54] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge,storage] Add storage capabilities for tools - https://phabricator.wikimedia.org/T293670#10462505 (10dcaro) [14:58:58] 10Toolforge: [toolforge,storage,swift,s3] Object store? - https://phabricator.wikimedia.org/T225190#10462506 (10dcaro) [15:02:07] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge,storage] Add storage capabilities for tools - https://phabricator.wikimedia.org/T293670#10462514 (10dcaro) [15:02:28] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge,storage] Add storage capabilities for tools - https://phabricator.wikimedia.org/T293670#10462520 (10dcaro) [15:02:30] 06cloud-services-team, 10Toolforge: Toolforge: consider introducing some semantics for persistent storage - https://phabricator.wikimedia.org/T337192#10462518 (10dcaro) →14Duplicate dup:03T293670 [15:05:10] 06cloud-services-team, 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations: Revive the HostFile backend on cloudcuminXXXX - https://phabricator.wikimedia.org/T380789#10462531 (10fnegri) a:05fnegri→03None [15:05:39] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 06Data-Persistence, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#10462534 (10fnegri) [15:05:54] 06cloud-services-team, 10Toolforge, 10observability: [toolforge.infra] Provide centralized logging (logstash) for Toolforge platform - https://phabricator.wikimedia.org/T97861#10462536 (10dcaro) [15:06:00] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 06Data-Persistence, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#10462538 (10fnegri) 05Open→03In progress [15:09:43] 06cloud-services-team, 10Toolforge, 10observability: [toolforge.infra] Provide centralized logging (logstash) for Toolforge platform - https://phabricator.wikimedia.org/T97861#10462546 (10dcaro) [15:10:12] 06cloud-services-team, 10Toolforge, 10observability: [toolforge.infra] Provide centralized logging for Toolforge platform - https://phabricator.wikimedia.org/T97861#10462547 (10taavi) [15:13:07] 06cloud-services-team, 10Toolforge: toolforge webservice logs -f not robust to invalid output - https://phabricator.wikimedia.org/T383742#10462554 (10dcaro) [15:13:09] 06cloud-services-team, 10Toolforge: Toolforge buildservice logs error - https://phabricator.wikimedia.org/T373201#10462557 (10dcaro) →14Duplicate dup:03T383742 [15:13:49] 06cloud-services-team, 10Toolforge: toolforge webservice logs -f not robust to invalid output - https://phabricator.wikimedia.org/T383742#10462562 (10dcaro) [15:13:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge: [toolforge] webservice logs crashes with some unicode chars - https://phabricator.wikimedia.org/T364609#10462565 (10dcaro) →14Duplicate dup:03T383742 [15:14:11] 06cloud-services-team, 10Toolforge: toolforge webservice logs -f not robust to invalid output - https://phabricator.wikimedia.org/T383742#10462567 (10dcaro) p:05Triage→03Medium [15:15:48] 06cloud-services-team, 10Tool-spacemedia, 10Toolforge: Toolforge jobs logs -f almost always ends in error - https://phabricator.wikimedia.org/T364468#10462586 (10dcaro) p:05Triage→03Low [15:18:10] 06cloud-services-team, 10Toolforge: toolforge webservice logs -f not robust to invalid output - https://phabricator.wikimedia.org/T383742#10462608 (10dcaro) Might be related to {T362521} (they reuse some of the code) [15:20:13] 06cloud-services-team, 10Toolforge: [jobs-cli,toolforge-weld] `toolforge jobs ...` should use named loggers and always show timestamps and logger names - https://phabricator.wikimedia.org/T359963#10462613 (10dcaro) [15:20:20] 06cloud-services-team, 10Toolforge: [toolforge,jobs] toolforge jobs logs read timeout error - https://phabricator.wikimedia.org/T356503#10462616 (10dcaro) [15:20:22] 06cloud-services-team, 10Tool-spacemedia, 10Toolforge: Toolforge jobs logs -f almost always ends in error - https://phabricator.wikimedia.org/T364468#10462623 (10dcaro) →14Duplicate dup:03T356503 [15:20:24] 06cloud-services-team, 10Toolforge: `toolforge jobs logs -f` crashes after a while with internal k8s api errors - https://phabricator.wikimedia.org/T359953#10462621 (10dcaro) →14Duplicate dup:03T356503 [15:23:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:23:47] 06cloud-services-team, 10Toolforge, 10observability: [toolforge.infra] Provide centralized logging for Toolforge platform - https://phabricator.wikimedia.org/T97861#10462640 (10dcaro) [15:43:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:00:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10462926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bullsey... [16:32:32] 06cloud-services-team: KernelError Server cloudvirt1055 may have kernel errors - https://phabricator.wikimedia.org/T383739#10463097 (10fnegri) 05Open→03Resolved a:03fnegri The server was manually rebooted in T383583 and that caused the following error message to be logged: ` fnegri@cloudvirt1055:~$ su... [16:33:51] 06cloud-services-team, 06DC-Ops, 10ops-eqiad: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10463105 (10Andrew) DC people, is this anything? This same alert has popped up a few times in the last few days. [16:36:29] 06cloud-services-team: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796 (10Andrew) 03NEW [16:39:59] 06cloud-services-team, 10Cloud-VPS: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796#10463133 (10taavi) [16:40:58] 06cloud-services-team: KernelError Server cloudcontrol1011 may have kernel errors - https://phabricator.wikimedia.org/T383270#10463136 (10fnegri) 05Open→03Resolved a:03fnegri This host was being reimaged and it logged a few kernel errors. They all seem innocuous. ` fnegri@cloudcontrol1011:~$ sudo jour... [16:43:07] 06cloud-services-team: SystemdUnitDown The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T383751#10463142 (10fnegri) →14Duplicate dup:03T383796 [16:43:16] 06cloud-services-team, 10Cloud-VPS: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796#10463145 (10fnegri) [16:43:35] 06cloud-services-team, 10Cloud-VPS: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796#10463152 (10fnegri) p:05Triage→03Medium [16:44:05] 06cloud-services-team, 07affects-Kiwix-and-openZIM: SystemdUnitDown kiwix-mirror-update.service - https://phabricator.wikimedia.org/T381212#10463153 (10fnegri) 05Open→03Resolved a:03fnegri [16:44:42] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10463156 (10fnegri) [16:49:54] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [wmcs-cookbooks] wmcs.openstack.cloudvirt.vm_console cookbook is not working from cloudcumin hosts - https://phabricator.wikimedia.org/T379570#10463163 (10dcaro) 🎉 thanks! [16:56:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [16:57:26] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2006-dev:9100 - https://phabricator.wikimedia.org/T383432#10463177 (10fnegri) This was the error: ` 2025-01-10T18:20:52.316283+00:00 cloudcontrol2006-dev puppet-agent[1041956]: (/Stage[main]/Profile::Openstack::Base::Opentofu/Git::Clone[r... [16:57:27] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2006-dev:9100 - https://phabricator.wikimedia.org/T383432#10463180 (10fnegri) →14Duplicate dup:03T373815 [16:57:30] 06cloud-services-team, 10Cloud-VPS: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815#10463182 (10fnegri) [16:59:22] 06cloud-services-team, 10Toolforge: [jobs-api] treat URLs with and without a trailing slash the same - https://phabricator.wikimedia.org/T383798 (10Slst2020) 03NEW [16:59:31] 10wikitech.wikimedia.org, 06serviceops: Wikitech displays desktop site on mobile devices - https://phabricator.wikimedia.org/T383656#10463197 (10jijiki) p:05Triage→03Low [17:04:54] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [17:06:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [17:26:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [17:26:42] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [17:30:44] 06cloud-services-team, 10Toolforge: [jobs-api] treat URLs with and without a trailing slash the same - https://phabricator.wikimedia.org/T383798#10463429 (10Slst2020) 05Open→03In progress [17:36:42] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463528 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [17:58:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [18:22:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [18:22:58] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [18:50:33] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [18:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:51:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10463993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [19:08:11] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [19:08:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464033 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [19:15:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [19:15:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [19:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:30:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:31:11] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:46:47] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464109 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [19:47:09] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464110 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [19:54:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [19:55:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [20:15:49] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817 (10Andrew) 03NEW [20:18:45] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817#10464305 (10Andrew) > is the new partman recipe likely to work for all the rest of our osd nodes? Do they all conform to... [20:28:19] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10464338 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [21:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:35:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:52:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:57:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:23:05] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817#10464917 (10Andrew) p:05Triage→03Medium [23:23:21] 06cloud-services-team, 10Cloud-VPS: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625#10464918 (10Andrew) 05Open→03Resolved a:03Andrew [23:23:59] 10wikitech.wikimedia.org, 06serviceops: Wikitech displays desktop site on mobile devices - https://phabricator.wikimedia.org/T383656#10464927 (10bd808) I'm not sure Wikitech has ever had a consistent mobile frontend experience. [23:24:01] 06cloud-services-team, 10Toolforge: [jobs-api] treat URLs with and without a trailing slash the same - https://phabricator.wikimedia.org/T383798#10464928 (10Andrew) p:05Triage→03Medium [23:25:08] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects: Wikidocumentaries wiki is VERY slow - https://phabricator.wikimedia.org/T223378#10464935 (10bd808) 05Resolved→03Invalid [23:25:09] 06cloud-services-team, 10Horizon: Horizon: obsessive redirects during logins - https://phabricator.wikimedia.org/T383370#10464936 (10Andrew) p:05Triage→03Medium