[00:03:16] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:08:16] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:13:31] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:18:31] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:24:36] 10Tool-techcontribs: I break the tool :( - https://phabricator.wikimedia.org/T384554#10487048 (10Chlod) Hi, @Reedy! This only ever happens to people with contributions numbering the thousands and spanning many years. I've set timeouts (30 seconds) internally to make sure that Tech Contribs doesn't pull too much... [00:29:39] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:31:35] FIRING: NetworkOutSaturated: Outgoing network saturation detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DNetworkOutSaturated [00:33:40] 10Tool-techcontribs: Add reporting on gitlab.wikimedia.org activity - https://phabricator.wikimedia.org/T383935#10487057 (10Chlod) p:05Triage→03Medium This is up next on the list of services to check, right after I finish the (ongoing) GitHub one. I had this written down in [the repo README](https://gitlab.w... [00:34:39] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:36:35] RESOLVED: NetworkOutSaturated: Outgoing network saturation detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DNetworkOutSaturated [00:40:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:41:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [00:42:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487088 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [00:45:06] RESOLVED: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:49:37] 10Tool-techcontribs: Add reporting on SAL entries - https://phabricator.wikimedia.org/T384113#10487094 (10Chlod) p:05Triage→03Low This one's pretty cool! Checking around on what tools currently exist for SAL, there's [stashbot](https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/stashbot/+/refs/heads/m... [00:52:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:54:51] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10487098 (10Andrew) Using the help request form: ` Hello! My name is Andrew Bogott, and I'm an SRE at the wikimedia foundation. I'm one of the... [00:57:06] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:59:46] 10Tool-techcontribs: pie charts with overlapping/unreadable numbers - https://phabricator.wikimedia.org/T384557 (10Reedy) 03NEW [00:59:53] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [01:00:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [01:02:36] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:06:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [01:06:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487125 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [01:07:36] RESOLVED: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:14:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:19:06] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:20:14] 10Tool-techcontribs: Add reporting on SAL entries - https://phabricator.wikimedia.org/T384113#10487129 (10bd808) >>! In T384113#10487094, @Chlod wrote: > The question on my mind is how we'd be able to tie IRC usernames to developer accounts using some source of truth. Profiles here on Phabricator can have a irc... [01:28:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-idp-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [01:28:58] 10Tool-techcontribs: Add reporting on activity on technical mailing lists - https://phabricator.wikimedia.org/T384112#10487136 (10Chlod) p:05Triage→03Low There was supposed to be a long comment here about how it doesn't seem like Hyperkitty supports searching by sender, but after trudging through a lot of AP... [01:40:42] 10Tool-techcontribs: Add reporting on SAL entries - https://phabricator.wikimedia.org/T384113#10487146 (10Chlod) Thanks Bryan! It seems like I forgot about the Phabricator profile option. That's definitely a good starting point to use! Adding in IRC nick tracking on Bitu sounds like a good feature to have, but... [01:50:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumi... [01:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:34:20] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817#10487163 (10Andrew) After some more hatchet work this now images correctly. No idea if it'll work for other bigger osd n... [02:35:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10487170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin100... [02:51:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [02:52:07] 10tool-wscontest: I can't edit or view scores of a contest despite being made an administrator by the creator - https://phabricator.wikimedia.org/T384310#10487183 (10Samwilson) Sorry about this! Does your username as displayed at the top have an underscore or space in it? {F58254813} Because it looks like as... [02:56:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:14:39] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:19:39] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:29:18] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api,jobs-cli] increased exit code 137 rate since 2024-12-14 due to low ephemeral-storage - https://phabricator.wikimedia.org/T382865#10487194 (10JJMC89) 05Open→03Resolved [03:29:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api,jobs-cli] increased exit code 137 rate since 2024-12-14 due to low ephemeral-storage - https://phabricator.wikimedia.org/T382865#10487196 (10JJMC89) I haven't seen any new 137 errors since I last updated the task. [03:36:16] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:41:16] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:48:35] FIRING: NetworkOutSaturated: Outgoing network saturation detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DNetworkOutSaturated [04:14:14] FIRING: KernelError: Server cloudcephosd1012 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephosd1012 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [04:14:25] 06cloud-services-team: KernelError Server cloudcephosd1012 may have kernel errors - https://phabricator.wikimedia.org/T384560 (10phaultfinder) 03NEW [05:02:00] RESOLVED: NetworkOutSaturated: Outgoing network saturation detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DNetworkOutSaturated [06:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:49:35] (03PS1) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) [07:50:03] (03PS2) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) [08:00:35] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:07:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:14:14] FIRING: KernelError: Server cloudcephosd1012 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephosd1012 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [08:32:14] 10Tool-masto-collab: Masto-Collab: Logging in results in 500 - etag is missing - https://phabricator.wikimedia.org/T384568 (10Peachey88) 03NEW [08:35:04] 10Tool-techcontribs: I break the tool :( - https://phabricator.wikimedia.org/T384554#10487824 (10Peachey88) @Chlod There is a replica for gerrit at gerrit-replica it just doesn't have a web interface but everything else is the same iirc, @hashar (or maybe @Dzahn iirc) would probably be your best bets to learn ab... [09:02:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:21:08] 06cloud-services-team, 10Toolforge: [builds-builder] Support adding repositories for Apt buildpack - https://phabricator.wikimedia.org/T363027#10487906 (10dcaro) > I actually think all three solutions would prove beneficial to the Toolforge community and not just the tools I am attempting to maintain. And tha... [10:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:34:31] 10Tools: PetScan returns "This web service cannot be reached" - https://phabricator.wikimedia.org/T384464#10488052 (10M2k_dewiki) Hello, also see * https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#PetScan_nicht_erreichbar * https://de.wikipedia.org/w/index.php?title=Wikipedia%3AFragen_zur_Wikiped... [11:11:34] 10Tool-techcontribs: pie charts with overlapping/unreadable numbers - https://phabricator.wikimedia.org/T384557#10488239 (10Bugreporter2) The examples shown are why pie charts are often not a good idea when categories are massively different in size. There are often better alternatives. [11:25:23] (03PS1) 10Vgutierrez: secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764 [11:25:41] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764 (owner: 10Vgutierrez) [11:27:40] 10wikitech.wikimedia.org, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Interwiki-links: Enable interwiki links to/from Wikitech - https://phabricator.wikimedia.org/T290147#10488262 (10Lucas_Werkmeister_WMDE) [11:51:09] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817#10488312 (10dcaro) >>! In T383817#10487163, @Andrew wrote: > After some more hatchet work this now images correctly. No... [11:54:52] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586 (10dcaro) 03NEW p:05Triage→03High [11:55:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10488345 (10dcaro) [11:58:17] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10488383 (10Steph_Tyszka) |**Wikitech account/LDAP:**| Steph Tyszka| |**SUL account**| Steph Tyszka| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Yes| |**I have visited [[ h... [12:00:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591 (10dcaro) 03NEW [12:07:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591#10488445 (10fnegri) The way to create a database at the momen... [12:14:14] FIRING: KernelError: Server cloudcephosd1012 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcephosd1012 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [12:18:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:21:20] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1113784 (owner: 10L10n-bot) [12:24:28] 10Striker, 10Tool-phab-ban, 10MediaWiki-Action-API, 10Stashbot: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10488522 (10Oudedutchman) [12:43:04] 10Tool-techcontribs: I break the tool :( - https://phabricator.wikimedia.org/T384554#10488573 (10Bugreporter2) @Reedy when writing 🐛 bug 🎫 tickets, it's a good idea to describe the bug properly in the name. It's not immediately obvious here what the referents of "I" or "the tool" are. 🤷🤦 [12:46:30] 10tool-wscontest: I can't edit or view scores of a contest despite being made an administrator by the creator - https://phabricator.wikimedia.org/T384310#10488577 (10Ninovolador) That must be it! At the top it is displayed with a space, and at the admin list, with an underscore. [12:46:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591#10488578 (10dcaro) [12:49:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10488581 (10dcaro) [12:51:07] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge.infra] create fullstack tests - https://phabricator.wikimedia.org/T357977#10488585 (10dcaro) [12:51:44] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge.infra] run and monitor our own sample tools - https://phabricator.wikimedia.org/T357977#10488587 (10dcaro) [12:56:28] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge,storage,infra,k8s] Investigate persistent volume support - https://phabricator.wikimedia.org/T384596 (10dcaro) 03NEW [12:57:41] 10VPS-project-Codesearch, 06collaboration-services: Graduate codesearch to production - https://phabricator.wikimedia.org/T268199#10488620 (10LSobanski) p:05Low→03High [12:58:11] 10tool-wscontest: I can't edit or view scores of a contest despite being made an administrator by the creator - https://phabricator.wikimedia.org/T384310#10488621 (10Samwilson) Ah great. I'll fix it, but for now can you ask them to amend that contest and add your username without an underscore? [12:58:45] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10488625 (10dcaro) [12:59:48] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10488630 (10dcaro) [13:02:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10488649 (10Andrew) Can you elaborate on how 'ToolsDB management' differs from what we h... [13:02:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [13:05:47] PROBLEM - Host cloudcephosd1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:27] RECOVERY - Host cloudcephosd1012 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:09:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:11:21] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#10488676 (10Ladsgroup) I haven't figured out what's exactly causing this but some of the users don't have a corresponding SUL account. For example `WM... [13:12:56] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10488677 (10Ladsgroup) Force attached your account. [13:14:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:21:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): add on-wiki edits of toolforge tools to toolstats report - https://phabricator.wikimedia.org/T317953#10488711 (10Raymond_Ndibe) [13:21:34] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): add on-wiki edits of toolforge tools to toolstats report - https://phabricator.wikimedia.org/T317953#10488714 (10Raymond_Ndibe) 05Open→03Resolved [13:21:41] (03approved) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [13:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:43:33] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: partman vs cloudcephosd1012 - https://phabricator.wikimedia.org/T383817#10488775 (10Andrew) 05Open→03Resolved a:03Andrew [13:48:00] (03update) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [13:48:20] (03update) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [13:49:10] (03approved) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [13:50:31] (03merge) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [13:51:26] (03update) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [13:53:09] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.346-20250123135045-edb3fcc8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/656 (https://phabricator.wikimedia.org/T361120) [13:53:50] (03merge) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [14:03:09] (03update) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.346-20250123135045-edb3fcc8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/656 (https://phabricator.wikimedia.org/T361120) [14:03:13] (03update) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.346-20250123135045-edb3fcc8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/656 (https://phabricator.wikimedia.org/T361120 https://phabricator.wikimedia.org/T362621) [14:13:25] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:14:27] 06cloud-services-team: KernelError Server cloudcephosd1012 may have kernel errors - https://phabricator.wikimedia.org/T384560#10488832 (10fnegri) 05Open→03Resolved a:03fnegri New server, triggered one of the known error messages that can be ignored: ` Jan 23 13:07:03 cloudcephosd1012 kernel: x86/cpu: V... [14:21:38] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:26:35] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:30:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [14:32:55] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:39:30] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [14:44:01] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:47:09] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [14:52:02] (03PS1) 10David Caro: ceph: do `osd reweight` and `osd crush reweight` when undraining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 [14:53:12] (03CR) 10Andrew Bogott: [C:03+1] ceph: do `osd reweight` and `osd crush reweight` when undraining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 (owner: 10David Caro) [14:57:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [14:57:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [15:04:40] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node [15:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:10:43] (03PS2) 10David Caro: ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 [15:11:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:11:42] (03PS3) 10David Caro: ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 [15:15:40] (03CR) 10CI reject: [V:04-1] ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 (owner: 10David Caro) [15:21:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:21:15] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T384608 (10phaultfinder) 03NEW [15:31:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:32:03] 10Tool-techcontribs: I break the tool :( - https://phabricator.wikimedia.org/T384554#10489186 (10Reedy) The tool is defined by the tag. I is pretty clear it means me. [15:32:37] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [15:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:40:09] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate pontoon-puppet-01.monitoring.eqiad.wmflabs is about to expire in 23d 0h 36m 27s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:41:26] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [15:42:38] (03PS4) 10David Caro: ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 [15:53:21] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T384608#10489383 (10fnegri) →14Duplicate dup:03T383723 [15:53:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489385 (10fnegri) [15:59:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586#10489423 (10fnegri) [16:02:41] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:04:23] (03CR) 10David Caro: [V:03+1] "Fixed the tests xd" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 (owner: 10David Caro) [16:06:26] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:06:33] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T384623 (10phaultfinder) 03NEW [16:06:39] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489448 (10fnegri) This is firing again today. [16:29:08] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#10489589 (10bd808) >>! In T161859#10488676, @Ladsgroup wrote: > I haven't figured out what's exactly causing this but some of the users don't have a c... [16:42:47] 10Tool-techcontribs: I break the tool :( - https://phabricator.wikimedia.org/T384554#10489675 (10Dzahn) Please keep using regular gerrit and not the replica unless you explicitly hear otherwise from releng or sre-collab. A WMF IP should not be throttled. Since nothing secret is inside Gerrit there might be ways... [16:49:03] 10Striker, 10Tool-phab-ban, 10Bitu, 10MediaWiki-Action-API, 10Stashbot: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10489716 (10taavi) [16:58:36] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: Kernel error metrics have overlapping definitions - https://phabricator.wikimedia.org/T382961#10489750 (10fnegri) 05In progress→03Resolved This is all done. [17:03:12] 06cloud-services-team, 06Release-Engineering-Team, 10GitLab (CI & Job Runners): Kokkuri feature request: pipeline-configurable repo credentials - https://phabricator.wikimedia.org/T384396#10489769 (10dancy) 05Open→03Resolved a:03dancy [17:09:11] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489797 (10fnegri) Moving it out of #wmcs-hardware and back to #cloud-services-team because otherwise @phaultfinder keeps on creating new tasks for this alert. [17:09:41] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T384623#10489800 (10fnegri) →14Duplicate dup:03T383723 [17:09:42] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489802 (10fnegri) [17:13:41] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [17:21:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [17:25:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [17:31:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [17:36:30] (03update) 10dcaro: deploy-token: prevent accidental token overwrites [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/49 (owner: 10sstefanova) [17:40:54] 10Toolforge (Toolforge iteration 17): [components-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072#10489925 (10dcaro) [17:42:53] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [components-api] Trigger builds on every deploy - https://phabricator.wikimedia.org/T384634 (10dcaro) 03NEW [17:44:29] 10Toolforge (Toolforge iteration 17): Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#10489950 (10Raymond_Ndibe) Reminder: Add to changelog on wikitech [18:00:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [components-api] Trigger builds on every deploy - https://phabricator.wikimedia.org/T384634#10489976 (10dcaro) [18:05:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [18:06:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [components-api] Trigger builds on every deploy - https://phabricator.wikimedia.org/T384634#10489990 (10dcaro) p:05Triage→03High [18:07:16] 10Tool-Pageviews: Data problems with dumps and siteviews tool - https://phabricator.wikimedia.org/T384636 (10agray) 03NEW [18:07:17] 10Toolforge (Toolforge iteration 17): [components-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072#10490007 (10dcaro) p:05Triage→03Medium a:05dcaro→03Raymond_Ndibe [18:08:24] 10Toolforge (Toolforge iteration 17): [components-api] Add support for port/helathcheck for continuous jobs in tool config/depolyment - https://phabricator.wikimedia.org/T362072#10490010 (10dcaro) [18:10:06] 10Toolforge (Toolforge iteration 17): [components-api] Add support for port/helathcheck for continuous jobs in tool config/depolyment - https://phabricator.wikimedia.org/T362072#10490014 (10dcaro) [18:12:17] 10Toolforge (Toolforge iteration 17): [components-api] Trigger builds on every deploy - https://phabricator.wikimedia.org/T384634#10490028 (10dcaro) a:05dcaro→03Raymond_Ndibe [18:22:57] (03CR) 10David Caro: [V:03+1 C:03+2] ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 (owner: 10David Caro) [18:27:33] (03Merged) 10jenkins-bot: ceph: take into account the osd reweight when considering osds drained [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1113817 (owner: 10David Caro) [18:30:26] 10Toolforge (Toolforge iteration 17): [infra,harbor] upgrade to latest - https://phabricator.wikimedia.org/T384327#10490087 (10Raymond_Ndibe) a:03Raymond_Ndibe [18:34:05] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10490117 (10cjming) |**Wikitech account/LDAP:**| cjming (shell) or Clare Ming| |**SUL account**| CMing (WMF)| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Yes - today| |**I... [18:34:55] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [k8s, infra] update pause image to 3.6 - https://phabricator.wikimedia.org/T374193#10490119 (10Raymond_Ndibe) [18:48:25] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10490171 (10dcaro) [18:51:37] 10Toolforge (Toolforge iteration 17): [components-cli] Add the `refresh` subcommand to the autocomplete file - https://phabricator.wikimedia.org/T384641 (10dcaro) 03NEW [18:51:43] 10Toolforge (Toolforge iteration 17): [components-cli] Add the `refresh` subcommand to the autocomplete file - https://phabricator.wikimedia.org/T384641#10490199 (10dcaro) p:05Triage→03Low [18:52:15] 06cloud-services-team, 10Toolforge: [replica_cnf,functional-tests] Run replica_cnf functional tests in lima-kilo with the rest of functional tests - https://phabricator.wikimedia.org/T369800#10490208 (10Raymond_Ndibe) yeaaaa I think there is a task somewhere about moving `replica_cnf` out of puppet. maybe it's... [18:52:36] 06cloud-services-team, 10Toolforge: [replica_cnf,functional-tests] Run replica_cnf functional tests in lima-kilo with the rest of functional tests - https://phabricator.wikimedia.org/T369800#10490212 (10Raymond_Ndibe) [18:53:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [18:54:56] 10Horizon: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642 (10Magnus) 03NEW [18:55:05] 10Horizon: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642#10490230 (10Magnus) p:05Triage→03Unbreak! [18:57:36] 10Cloud-Services: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642#10490235 (10Magnus) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task.... [18:58:15] 06cloud-services-team, 10Cloud-VPS: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642#10490237 (10Magnus) [19:05:11] 06cloud-services-team, 10Cloud-VPS: petscan5 unresponsive - https://phabricator.wikimedia.org/T384642#10490267 (10taavi) p:05Unbreak!→03Triage [19:05:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [k8s, infra] update pause image to 3.9 - https://phabricator.wikimedia.org/T374193#10490269 (10Raymond_Ndibe) [19:05:56] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [k8s, infra] update pause image to 3.9 - https://phabricator.wikimedia.org/T374193#10490270 (10Raymond_Ndibe) a:03Raymond_Ndibe [19:08:44] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [k8s, infra] update pause image to 3.9 - https://phabricator.wikimedia.org/T374193#10490284 (10Raymond_Ndibe) 05Open→03In progress [19:31:31] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10490376 (10RoySmith) Just out of curiosity, what's wrong with setting up a syslog server and letting all the tools log to it? [19:34:46] 10Toolforge (Toolforge iteration 17): [infra,harbor] upgrade to v2.10.1 - https://phabricator.wikimedia.org/T384327#10490382 (10Raymond_Ndibe) [19:36:22] (03open) 10raymond-ndibe: [builds-builder::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/64 (https://phabricator.wikimedia.org/T384327) [19:37:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [20:10:41] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T370245) [20:10:47] T370245: [infra,k8s] remove deprecated kubelet flags before 1.28 upgrade (we might be able to remove all custom ones) - https://phabricator.wikimedia.org/T370245 [20:15:59] !log raymond-ndibe@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for all NFS workers (T370245) [20:16:19] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for all nodes (T370245) [20:34:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [20:41:14] FIRING: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [20:43:23] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all nodes (T370245) [20:43:26] T370245: [infra,k8s] remove deprecated kubelet flags before 1.28 upgrade (we might be able to remove all custom ones) - https://phabricator.wikimedia.org/T370245 [20:44:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [20:46:14] RESOLVED: [3x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [20:46:58] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10490544 (10bd808) >>! In T127367#10490376, @RoySmith wrote: > Just out of curiosity, what's wrong with setting up a syslog... [20:56:07] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [21:04:30] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [21:52:48] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10490668 (10RoySmith) OK, that makes sense. [22:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:35:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:50:39] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:55:39] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:00:39] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:05:54] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:10:54] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:11:09] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:15:54] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:16:09] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:20:54] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:33:21] (03open) 10dancy: only_image_publish.yaml: Remove KOKKURI_REGISTRY_INTERNAL variable [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/47 [23:33:23] (03update) 10dancy: only_image_publish.yaml: Remove KOKKURI_REGISTRY_INTERNAL variable [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/47 [23:35:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks