[00:03:16] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:08:16] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:18:27] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#10561971 (10Raymond_Ndibe) a:03Raymond_Ndibe [00:19:34] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#10561977 (10Raymond_Ndibe) 05Stalled→03In progress [00:20:54] 06cloud-services-team, 10Toolforge: [harbor,infra] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#10561986 (10Raymond_Ndibe) [00:21:28] 06cloud-services-team, 10Toolforge: [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10561990 (10Raymond_Ndibe) a:03Raymond_Ndibe [00:22:15] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10561997 (10Raymond_Ndibe) [00:22:44] 06cloud-services-team, 10Toolforge: [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#10561998 (10Raymond_Ndibe) a:03Raymond_Ndibe [00:23:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#10562003 (10Raymond_Ndibe) [00:25:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [00:30:55] FIRING: MaxConntrack: Max conntrack at 80.14% on cloudvirt1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:35:55] RESOLVED: MaxConntrack: Max conntrack at 80.36% on cloudvirt1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:59:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [01:02:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [01:12:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10562052 (10Raymond_Ndibe) **maintain-harbor current and future jobs and required robot account permissions:** 1. delete-empty-tool-projects * g... [01:18:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.reboot_node on hosts matched by 'D{cloudcontrol2005-dev.codfw.wmnet}' [01:23:16] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:28:16] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:32:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.reboot_node (exit_code=0) on hosts matched by 'D{cloudcontrol2005-dev.codfw.wmnet}' [01:33:16] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:07:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:12:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:19:18] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10562104 (10Raymond_Ndibe) 05Open→03In progress [02:19:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#10562106 (10Raymond_Ndibe) 05Open→03In progress [02:28:42] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: wmfkeystonehooks: project ids rather than names are being used in LDAP group creation - https://phabricator.wikimedia.org/T379030#10562113 (10Andrew) The above two patches implement option #2. They leave /etc/hosts subsequently unmanaged whic... [02:47:39] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:52:39] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:23:54] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [04:00:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:02:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:06:56] FIRING: SystemdUnitDown: The service unit purge_vm_rbd_images.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:07:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [04:10:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [06:01:56] FIRING: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:10:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:24:34] 10Tool-erinnermich: [ErinnerMichBot] Possible support for other languages and projects? - https://phabricator.wikimedia.org/T384842#10562403 (10Tkarcher) Bot is renamed now: https://meta.wikimedia.org/wiki/Special:CentralAuth/Pense-B%C3%B4t [08:37:38] 06cloud-services-team, 10Toolforge, 07Privacy: Make tools-static fontcdn/ and cdnjs/ redact UA - https://phabricator.wikimedia.org/T210959#10562447 (10Xover) >>! In T210959#8920374, @TheDJ wrote: > google fonts api returns woff2 for Windows, but returns woff for macOS > This apparently has to do with issues... [08:46:35] 06cloud-services-team, 10Toolforge, 07Epic, 13Patch-For-Review: loki into lima-kilo - https://phabricator.wikimedia.org/T386480#10562465 (10dcaro) > Do we think we should merge these and continue from here or take a different approach? It should be ok, we could not install loki by default too and avoid peo... [08:51:34] 06cloud-services-team, 10Toolforge, 07Epic, 13Patch-For-Review: [o11y,logging,infra] loki into lima-kilo - https://phabricator.wikimedia.org/T386480#10562477 (10dcaro) [08:52:06] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 07Epic, 13Patch-For-Review: [o11y,logging,infra] loki into lima-kilo - https://phabricator.wikimedia.org/T386480#10562480 (10dcaro) [08:52:32] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 07Epic, 13Patch-For-Review: [o11y,logging,infra] loki into lima-kilo - https://phabricator.wikimedia.org/T386480#10562482 (10dcaro) p:05Triage→03High [09:36:38] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10562687 (10aborrero) hey @rook I see some activity happening on {T386480}, and I'm curious about the architecture of this s... [09:53:12] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10562720 (10rook) > * Is the new system going to follow the ideas outlined at https://wikitech.wikimedia.org/wiki/User:Taavi... [10:01:57] FIRING: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:27:39] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10562767 (10dcaro) > Haven't considered this part thus far no. Current focus is getting loki into lima-kilo which I don't be... [10:28:41] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10562768 (10aborrero) >>! In T127367#10562720, @rook wrote: > > Haven't considered this part thus far no. Current focus is... [10:55:56] 10wikitech.wikimedia.org: Decide what to do with SUL attached Wikitech accounts that Bitu associates with a different SUL account - https://phabricator.wikimedia.org/T386026#10562857 (10ayounsi) Thanks @bd808 Please detach 'Ayounsi' from SUL, rename it to 'AYounsi (WMF)', and reattach to SUL. [10:59:55] 10wikitech.wikimedia.org: Decide what to do with SUL attached Wikitech accounts that Bitu associates with a different SUL account - https://phabricator.wikimedia.org/T386026#10562877 (10Ladsgroup) >>! In T386026#10562857, @ayounsi wrote: > Thanks @bd808 > Please detach 'Ayounsi' from SUL, rename it to 'AYounsi (... [11:13:01] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10562927 (10aborrero) [11:32:33] 06cloud-services-team, 10Cloud-VPS: SystemdUnitDown The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T386543#10562973 (10fnegri) 05In progress→03Resolved This is now fixed. [11:43:54] 06cloud-services-team, 10Cloud-VPS, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q3): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10562998 (10dcaro) Did a round with @cmooney on the current dashboards we have to make sure we are not mi... [11:48:26] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for ho... [11:50:09] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563028 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for ho... [12:02:31] 06cloud-services-team: Chuck Onwumelu internship: experiments with sample Toolforge tools - https://phabricator.wikimedia.org/T386805 (10aborrero) 03NEW [12:02:38] 06cloud-services-team: Chuck Onwumelu internship: experiments with sample Toolforge tools - https://phabricator.wikimedia.org/T386805#10563091 (10aborrero) p:05Triage→03Medium [12:02:44] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki satwiktionary - https://phabricator.wikimedia.org/T386634#10563093 (10Ladsgroup) The DBA side is done now @fnegri [12:08:25] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806 (10aborrero) 03NEW [12:08:32] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806#10563119 (10aborrero) p:05Triage→03Medium [12:09:09] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806#10563120 (10aborrero) [12:09:10] 06cloud-services-team: Chuck Onwumelu internship: experiments with sample Toolforge tools - https://phabricator.wikimedia.org/T386805#10563121 (10aborrero) [12:09:11] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10563122 (10aborrero) [12:27:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services: [wikireplicas] Create views for new wiki satwiktionary - https://phabricator.wikimedia.org/T386634#10563159 (10fnegri) a:03fnegri [12:27:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host c... [12:28:47] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudgw1002:9100 - https://phabricator.wikimedia.org/T386111#10563165 (10fnegri) 05Open→03Resolved a:03fnegri cloudgw1002 is no longer active, so we can ignore this Puppet error. See {T382356}. [12:28:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host c... [12:30:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1003.eqiad.w... [12:32:49] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services: [wikireplicas] Create views for new wiki satwiktionary - https://phabricator.wikimedia.org/T386634#10563176 (10fnegri) p:05Triage→03Medium [12:35:25] 10Cloud-VPS (Quota-requests): Increase object storage quota for project spacemedia - https://phabricator.wikimedia.org/T386588#10563178 (10fnegri) LGTM, +1 [12:36:52] 06cloud-services-team: SystemdUnitDown The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://phabricator.wikimedia.org/T386396#10563180 (10fnegri) 05Open→03Resolved a:03fnegri This seems to be ok now. [13:00:47] 10Cloud-VPS (Quota-requests): Increase object storage quota for project spacemedia - https://phabricator.wikimedia.org/T386588#10563220 (10Andrew) 05Open→03Resolved a:03Andrew Your quotas are now: ` "user_quota": { "enabled": true, "check_on_raw": false, "max_size": 858993... [13:09:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563235 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet... [13:20:58] 06cloud-services-team, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810 (10Andrew) 03NEW [13:21:04] 06cloud-services-team, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10563275 (10Andrew) [13:21:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563274 (10Andrew) [13:24:29] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806#10563307 (10aborrero) [13:24:42] 06cloud-services-team: Chuck Onwumelu internship: experiments with sample Toolforge tools - https://phabricator.wikimedia.org/T386805#10563308 (10aborrero) [13:26:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10563323 (10aborrero) 05Open→03Resolved [13:29:08] 06cloud-services-team, 10Cloud-VPS: cloudgw: suspected network problems - https://phabricator.wikimedia.org/T381078#10563345 (10aborrero) 05Open→03Resolved a:03aborrero The most accepted theory is that we had faulty hardware, which was replaced in {T382356} [13:31:20] 06cloud-services-team, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10563355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudgw1001.eqiad.wmnet` - cloudgw1001.eqiad.wmnet (**PASS**) - Downtimed host on I... [13:41:08] 10VPS-project-Wikistats: Add satwiktionary to wikistats - https://phabricator.wikimedia.org/T386636#10563391 (10Dzahn) 05Open→03Resolved a:03Dzahn ` MariaDB [wikistats]> insert into wiktionaries (prefix, lang, loclang, method) values ("sat","Santali","ᱥᱟᱱᱛᱟᱲ&#x... [13:42:10] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10563397 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudgw1002.eqiad.wmnet` - cloudgw1002.eqiad.wmnet (**FAIL**)... [13:43:52] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-eqiad, 13Patch-For-Review: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10563401 (10Andrew) a:05Andrew→03None [13:48:49] FIRING: PuppetDisabled: Puppet disabled on cloudservices2005-dev:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wmcs&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:48:54] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudservices2005-dev:9100 - https://phabricator.wikimedia.org/T386816 (10phaultfinder) 03NEW [13:59:54] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), and 2 others: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for ou... - https://phabricator.wikimedia.org/T374830#10563448 [14:02:07] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10563456 (10Andrew) Thank you @Don-vip, I've relayed that back to to flickr support. [14:04:01] 06cloud-services-team: SystemdUnitDown The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T386601#10563459 (10Andrew) →14Duplicate dup:03T383796 [14:04:07] 06cloud-services-team, 10Cloud-VPS: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796#10563461 (10Andrew) [14:08:49] FIRING: PuppetDisabled: Puppet disabled on cloudservices2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wmcs&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:08:56] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudservices2004-dev:9100 - https://phabricator.wikimedia.org/T386819 (10phaultfinder) 03NEW [14:10:54] 10Cloud-VPS (Quota-requests): Increase object storage quota for project spacemedia - https://phabricator.wikimedia.org/T386588#10563511 (10Don-vip) Thank you! It works, I was able to upload the entire collection. [14:10:54] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [14:13:49] RESOLVED: PuppetDisabled: Puppet disabled on cloudservices2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wmcs&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:14:22] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [14:20:03] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [14:20:49] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [14:21:24] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [14:21:47] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [15:18:23] 10Toolforge (Toolforge iteration 17): [components-api] Rename the CRDs groups to be `components-api.toolforge.org` - https://phabricator.wikimedia.org/T386829 (10dcaro) 03NEW [15:18:27] 10Toolforge (Toolforge iteration 17): [components-api] Rename the CRDs groups to be `components-api.toolforge.org` - https://phabricator.wikimedia.org/T386829#10563821 (10dcaro) p:05Triage→03Medium [15:24:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-02-12 - https://phabricator.wikimedia.org/T386240#10563871 (10fnegri) 05In progress→03Resolved Replication has been working normally in the past 2 days, I'm marking this task as Resolved. {F584... [15:27:12] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: race condition in purge_vm_rbd_images.service? - https://phabricator.wikimedia.org/T383796#10563901 (10Andrew) 05Open→03Resolved Should be fixed with the above patch. [15:29:26] RESOLVED: SystemdUnitDown: The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] mariadb crashing repeatedly (innodb_fatal_semaphore_wait_threshold) - https://phabricator.wikimedia.org/T385900#10563956 (10fnegri) 05In progress→03Resolved There were no more crashes this week. I could not clearly identify the root caus... [15:37:10] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [15:42:45] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [15:47:22] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: wmfkeystonehooks: project ids rather than names are being used in LDAP group creation - https://phabricator.wikimedia.org/T379030#10564039 (10bd808) >>! In T379030#10560867, @Andrew wrote: > Yes, the microservice approach was what I found unr... [15:49:18] FIRING: [3x] KernelErrors: Server cloudgw1003 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudgw1003 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [15:49:22] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386838 (10phaultfinder) 03NEW [15:49:45] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [15:55:17] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [15:57:44] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [16:01:14] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [16:01:56] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [16:04:51] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase [16:04:59] !log andrew@cloudcumin1001 wikitextexp END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [16:07:23] (03open) 10raymond-ndibe: [toolforge-weld] update jobs custom resources version in k8sclient [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/75 (https://phabricator.wikimedia.org/T359650) [16:08:33] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase [16:08:38] !log andrew@cloudcumin1001 wikitextexp END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) [16:09:08] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [16:09:13] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Content-Transform-Team (Work In Progress), 07OKR-Work: If necessary, bump down quota for wikitextexp now that we've migrated from parsing-qa-02 -> ctt-qa-03 - https://phabricator.wikimedia.org/T386030#10564197 (10Andrew) 05Open→03Resolved a:03... [16:09:21] (03update) 10raymond-ndibe: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) [16:09:51] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386838#10564202 (10Andrew) 05Open→03Resolved a:03Andrew false alarm, host was reimaged [16:09:52] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudservices2004-dev:9100 - https://phabricator.wikimedia.org/T386819#10564205 (10Andrew) 05Open→03Resolved a:03Andrew [16:09:54] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudservices2005-dev:9100 - https://phabricator.wikimedia.org/T386816#10564207 (10Andrew) 05Open→03Resolved a:03Andrew [16:10:48] 06cloud-services-team, 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10564210 (10Andrew) p:05Triage→03Medium [16:11:36] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase [16:11:44] !log andrew@cloudcumin1001 wikitextexp END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [16:13:18] (03approved) 10dcaro: [toolforge-weld] update jobs custom resources version in k8sclient [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/75 (https://phabricator.wikimedia.org/T359650) (owner: 10raymond-ndibe) [16:13:27] 10Toolforge (Toolforge iteration 17): [components-api] Rename the CRDs groups to be `components-api.toolforge.org` - https://phabricator.wikimedia.org/T386829#10564218 (10Raymond_Ndibe) a:03Raymond_Ndibe [16:21:54] 06cloud-services-team, 10Toolforge: jjtest tool not getting deleted - https://phabricator.wikimedia.org/T386557#10564244 (10Andrew) p:05Triage→03Low a:03Andrew [16:23:41] (03open) 10andrew: archive step: still declare success if there's nothing to archive [repos/cloud/toolforge/disable-tool] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/22 (https://phabricator.wikimedia.org/T386557) [16:24:44] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects, 10Catalyst: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#10564271 (10Andrew) p:05Triage→03Medium [16:28:09] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki sylwiki - https://phabricator.wikimedia.org/T386467#10564275 (10fnegri) a:03fnegri [16:28:44] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki sylwiki - https://phabricator.wikimedia.org/T386467#10564279 (10Andrew) p:05Triage→03High [16:29:12] (03update) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [16:33:43] (03update) 10rook: Adding loki to install [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/669 (https://phabricator.wikimedia.org/T386480) [16:40:33] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 07Epic, 13Patch-For-Review: [o11y,logging,infra] loki into lima-kilo - https://phabricator.wikimedia.org/T386480#10564330 (10rook) As I consider it more I guess it doesn't make much of a difference if it is merged now, as about the biggest "risk"... [16:42:00] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10564348 (10Raymond_Ndibe) toolsbeta-image-builder tools-image-builder gitlab_deploy gitlab_ci taavi-test [16:42:23] (03update) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [16:46:27] (03update) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [16:50:20] (03update) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [17:00:03] (03update) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [17:06:29] (03approved) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [17:06:34] (03merge) 10dcaro: add prometheus stats [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [17:09:15] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-emailer: bump to 0.0.52-20250219170643-a14ae54d [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/670 (https://phabricator.wikimedia.org/T320284 https://phabricator.wikimedia.org/T379924) [17:11:30] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10564493 (10Raymond_Ndibe) [] **all required permissions for local-image-builder robot account** * ( GET /projects/{project_name}/repositories/{re... [17:11:37] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10564507 (10Raymond_Ndibe) [] **all required permissions for toolsbeta-image-builder robot account** * ( GET /projects/{project_name}/repositories... [17:11:53] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10564509 (10Raymond_Ndibe) [] **all required permissions for tools-image-builder robot account** * ( GET /projects/{project_name}/repositories/{re... [17:12:00] 06cloud-services-team, 10Cloud-VPS: [monitoring] KernelErrors alerts trigger incorrectly when a host is reimaged - https://phabricator.wikimedia.org/T386850 (10fnegri) 03NEW [17:12:07] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386838#10564528 (10fnegri) This should not have fired with the filters we have in `kernel-messages-ignore-regex.txt`. I opened {T386850}. [17:12:24] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386838#10564530 (10fnegri) [17:12:28] 06cloud-services-team, 10Cloud-VPS: [monitoring] KernelErrors alerts trigger incorrectly when a host is reimaged - https://phabricator.wikimedia.org/T386850#10564531 (10fnegri) [17:22:11] !log dcaro@urcuchillay toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-emailer (T320284) [17:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:22:15] T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components - https://phabricator.wikimedia.org/T320284 [17:25:18] FIRING: [3x] KernelErrors: Server cloudgw1001 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudgw1001 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [17:25:24] 06cloud-services-team: KernelErrors Server cloudgw1001 logged kernel errors - https://phabricator.wikimedia.org/T386852 (10phaultfinder) 03NEW [17:30:40] !log dcaro@urcuchillay toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-emailer (T320284) [17:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:30:44] T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components - https://phabricator.wikimedia.org/T320284 [17:35:34] (03open) 10fnegri: Draft: Upgrade Kubernetes to 1.29 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/227 [18:12:16] 10Striker, 10Continuous-Integration-Infrastructure, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): striker-pipeline-test failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10564925 (10bd808) https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-te... [18:13:30] 10Striker, 10Continuous-Integration-Infrastructure, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10564932 (10bd808) [18:14:11] 10Striker, 10Continuous-Integration-Infrastructure, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10564934 (10bd808) [18:16:48] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10564954 (10bd808) p:05Triage→03Unbreak... [18:31:49] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565037 (10bd808) Trying to narrow down w... [18:35:12] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565060 (10taavi) This feels like very fa... [18:51:11] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565212 (10dduvall) a:03dduvall [19:16:22] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565346 (10taavi) [19:24:43] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565364 (10dduvall) >>! In T386755#105650... [19:32:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:33:36] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565388 (10dduvall) Seeing https://github... [19:49:18] FIRING: [3x] KernelErrors: Server cloudgw1003 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudgw1003 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:49:41] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386865 (10phaultfinder) 03NEW [20:09:13] 10Striker, 10Continuous-Integration-Infrastructure, 06Language and Product Localization, 10Toolhub, 10ci-test-error (WMF-deployed Build Failure): Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755#10565475 (10dduvall) 05Open→03Resol... [20:20:43] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-37 [20:22:55] 06cloud-services-team: KernelErrors Server cloudgw1003 logged kernel errors - https://phabricator.wikimedia.org/T386865#10565516 (10aborrero) [20:22:58] 06cloud-services-team, 10Cloud-VPS: [monitoring] KernelErrors alerts trigger incorrectly when a host is reimaged - https://phabricator.wikimedia.org/T386850#10565517 (10aborrero) [20:25:17] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-37 [20:25:31] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-55 [20:27:30] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-55 [20:57:24] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects, 10Catalyst: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#10565604 (10EBomani) Hello David, not sure how the `catalyst-qte` email alerts are configured right now. I will ask other peop... [21:27:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:32:31] (03CR) 10BryanDavis: templates: Remove unused Phabricator policy formatting code [labs/striker] - 10https://gerrit.wikimedia.org/r/1119865 (owner: 10Majavah) [21:32:44] (03CR) 10BryanDavis: [C:03+2] "retry #4" [labs/striker] - 10https://gerrit.wikimedia.org/r/1119865 (owner: 10Majavah) [21:35:26] (03Merged) 10jenkins-bot: templates: Remove unused Phabricator policy formatting code [labs/striker] - 10https://gerrit.wikimedia.org/r/1119865 (owner: 10Majavah) [21:39:24] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects, 10Catalyst: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#10565738 (10jeena) Hi @dcaro, just to add some clarification if you need it, we do get emails for puppet failures from the cat... [23:13:59] (03CR) 10BryanDavis: [C:04-1] Get openstack project list from keystone (031 comment) [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/1093997 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [23:19:23] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10566027 (10Raymond_Ndibe) [] **all required permissions for gitlab_deploy robot account** * ( GET /projects/{project_name}/repositories )[ projec... [23:19:33] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10566028 (10Raymond_Ndibe) [] **all required permissions for gitlab_ci robot account** * ( GET /projects/{project_name}/repositories )[ project-pe... [23:24:05] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#10566031 (10Raymond_Ndibe) [] **all required permissions for taavi-test robot account (toolsbeta)** * (not sure what permission should be assigned... [23:30:34] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10566063 (10Gryllida) Yes [23:58:13] (03update) 10raymond-ndibe: [builds-builder::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/64 (https://phabricator.wikimedia.org/T384327) [23:58:18] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10566114 (10bd808) >>! In T376267#10558107, @Gryllida wrote: > |**Wikitech account/LDAP:**| Svetlana Tkachenko| > |**SUL account**| Gryllida| > |**Account linked on [[ https://idm.wikimedia.o... [23:58:32] (03approved) 10raymond-ndibe: [builds-builder::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/64 (https://phabricator.wikimedia.org/T384327) [23:58:55] (03update) 10raymond-ndibe: [builds-builder::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/64 (https://phabricator.wikimedia.org/T384327) [23:59:37] (03merge) 10raymond-ndibe: [builds-builder::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/64 (https://phabricator.wikimedia.org/T384327)