[00:12:57] 10Toolforge (Toolforge iteration 03): Create a kubernetes container with mono and dotnet - https://phabricator.wikimedia.org/T311466 (10Hawkeye7) Oh. Thanks for that. I am not in the habit of calling the main program Program.cs (the default) because I usually build more than one in the same solution. Having a se... [01:37:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-84 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:42:23] (ToolforgeKubernetesNodeNotReady) firing: Kubernetes node tools-k8s-worker-84 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [01:44:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:49:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:08:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:09:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:14:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:47:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:52:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:02:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:07:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:20:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:25:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:56:45] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:01:45] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:23:59] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:11] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [04:37:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-84 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:42:23] (ToolforgeKubernetesNodeNotReady) firing: Kubernetes node tools-k8s-worker-84 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [04:44:08] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: Toolforge Build Service: add the locale buildpack - https://phabricator.wikimedia.org/T354128 (10Earwig) 05Open→03Resolved I was able to fix this — the locale for Chinese is apparently supposed to be `zh-hant_TW` instead of `zh_TW` or `zh_Han... [04:52:09] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: Toolforge Build Service: add the locale buildpack - https://phabricator.wikimedia.org/T354128 (10Legoktm) Why are we not installing all the locales by default? Given that internationalization and language support are first-class priorities of the... [04:57:03] 10Tools: [dplbot] uncategorized articles not working - https://phabricator.wikimedia.org/T355014 (10Legoktm) [04:58:21] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2024-03-06 18:33:38 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:59:21] 10Tools: [dplbot] uncategorized articles not working - https://phabricator.wikimedia.org/T355014 (10Legoktm) Unfortunately @Bearcat, if the tool maintainers aren't active, there's no one who is really going to respond to this Phabricator task either. You may be better off finding someone to re-implement the func... [05:00:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26898 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:04:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:09:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:08:16] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:49:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:54:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:37:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-84 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:42:23] (ToolforgeKubernetesNodeNotReady) firing: Kubernetes node tools-k8s-worker-84 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [08:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:14:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:19:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:27:45] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:32:45] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:11:40] 10Toolforge: There are some files that I cannot view, delete, or do anything to - https://phabricator.wikimedia.org/T355022 (10fnegri) I tried running these commands: ` become gergesbot2 cd WikipediaBot composer update ` I do get an error but I don't think it's related to permissions: ` In ArrayLoader.php lin... [09:14:45] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: Toolforge Build Service: add the locale buildpack - https://phabricator.wikimedia.org/T354128 (10dcaro) >>! In T354128#9458772, @Legoktm wrote: > Why are we not installing all the locales by default? Given that internationalization and language s... [09:17:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-84 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:17:23] (ToolforgeKubernetesNodeNotReady) resolved: Kubernetes node tools-k8s-worker-84 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:08:16] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:12:05] 10cloud-services-team, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q3): Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 (10fgiunchedi) [11:13:59] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations: [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10fgiunchedi) [11:14:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Observability-Alerting: [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10fgiunchedi) [11:14:47] 10Toolforge: There are some files that I cannot view, delete, or do anything to - https://phabricator.wikimedia.org/T355022 (10GergesSH) I tried updating composer and git files and it worked without problems, But I tried to delete an old folder I have, but no response appears by command ` become gergesbot2 cd Wi... [11:16:52] 10Cloud-VPS, 10SRE, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [11:18:00] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:37:30] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: Toolforge Build Service: add the locale buildpack - https://phabricator.wikimedia.org/T354128 (10dcaro) Update on this, I have done a series of improvements over the upstream buildpack (I have to send them upstream still): * Support for scripts... [11:53:09] (CephSlowOps) firing: Ceph cluster in eqiad has 3 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [11:53:16] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [11:58:09] (CephSlowOps) resolved: Ceph cluster in eqiad has 3 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:26:49] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/990660 (owner: 10L10n-bot) [12:26:51] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/990658 (owner: 10L10n-bot) [12:37:00] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [12:41:28] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [12:43:04] 10Cloud-VPS: wmcs.openstack.restart_openstack deals badly with down cloudvirts - https://phabricator.wikimedia.org/T355058 (10taavi) [12:43:15] 10Cloud-VPS, 10cloud-services-team: wmcs.openstack.restart_openstack deals badly with down cloudvirts - https://phabricator.wikimedia.org/T355058 (10taavi) [12:45:56] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:47:49] 10Grid-Engine-to-K8s-Migration, 10Wiki-Loves-Monuments-Database: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787 (10Lokal_Profil) The blocker here seems to be (@JeanFred please chime in if I misrepresented something): * We have a long running job (... [12:50:56] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:12:55] (MaxConnTrack) firing: Max conntrack at 90% on cloudvirt1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConnTrack [13:13:00] 10cloud-services-team: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 - https://phabricator.wikimedia.org/T355061 (10phaultfinder) [13:20:12] (03PS1) 10Majavah: openstack: restart_openstack: rename --all to --all-services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990690 [13:20:14] (03PS1) 10Majavah: openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 [13:21:30] (03PS2) 10Majavah: openstack: restart_openstack: rename --all to --all-services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990690 [13:21:32] (03PS2) 10Majavah: openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 [13:25:10] (03CR) 10CI reject: [V: 04-1] openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 (owner: 10Majavah) [13:28:50] (03PS3) 10Majavah: openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 [13:53:59] (03CR) 10FNegri: [C: 03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990690 (owner: 10Majavah) [13:54:49] (03CR) 10FNegri: [C: 03+1] "Nice one" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 (owner: 10Majavah) [13:57:27] (03CR) 10Majavah: [C: 03+2] openstack: restart_openstack: rename --all to --all-services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990690 (owner: 10Majavah) [13:57:29] (03CR) 10Majavah: [C: 03+2] openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 (owner: 10Majavah) [14:01:32] (03Merged) 10jenkins-bot: openstack: restart_openstack: rename --all to --all-services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990690 (owner: 10Majavah) [14:02:14] (03Merged) 10jenkins-bot: openstack: restart_openstack: add support for node role filtering [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990691 (owner: 10Majavah) [14:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:13:01] 10Grid-Engine-to-K8s-Migration, 10Wiki-Loves-Monuments-Database: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787 (10taavi) What does "pausing" mean here in practice? How is that implemented on the grid? [14:26:14] 10cloud-services-team: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 - https://phabricator.wikimedia.org/T355061 (10taavi) a:03taavi This is a new alert, but seems like an issue regardless. [14:27:58] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot (T355061) [14:28:05] T355061: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 - https://phabricator.wikimedia.org/T355061 [14:32:27] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) (T355061) [14:32:54] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot (T355061) [14:37:07] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) (T355061) [14:37:13] T355061: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 - https://phabricator.wikimedia.org/T355061 [14:37:28] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) [14:37:38] 10Cloud-VPS, 10cloud-services-team: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 - https://phabricator.wikimedia.org/T355061 (10taavi) [14:41:34] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) The migrations "failed" it seems. `lang=shell-session taavi@cloudcontrol1006 ~ $ os server migration list --changes-since 2024-01-15T00:00:00Z +-------+-------------------------------------... [14:41:53] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) Seems related to the recent PKI changes: ` 2024-01-15 14:36:58.102 236191 ERROR nova.virt.libvirt.driver [None req-31287e76-435f-4780-838f-87bcad67a60d novaadmin admin - - default default]... [14:51:21] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) And there are matching entries in the `cloudvirt1046` logs: ` Jan 15 14:28:54 cloudvirt1046 libvirtd[2662926]: Unable to verify TLS peer: No certificate was found. Jan 15 14:28:54 cloudvir... [14:52:01] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) >>! In T355067#9460003, @taavi wrote: > Where does nova specify the client certificate to use? Here, probably: `name=/etc/nova/nova-compute.conf live_migration_uri=qemu://%s.eqiad.wmnet/sy... [14:52:10] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/990660 (owner: 10L10n-bot) [14:52:35] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/990658 (owner: 10L10n-bot) [14:54:57] 10Cloud-VPS, 10cloud-services-team: wmcs-drain-hypervisor is broken - https://phabricator.wikimedia.org/T355067 (10taavi) This can also be reproduced via the `virsh` CLI: `lang=shell-session taavi@cloudvirt1060 ~ $ sudo virsh --connect qemu://cloudvirt1046.eqiad.wmnet/system?pkipath=/var/lib/nova 2024-01-15 14... [15:01:41] 10Tools: [dplbot] uncategorized articles not working - https://phabricator.wikimedia.org/T355014 (10russblau) a:03russblau Sorry for the delay in addressing this, but I've been busy with higher priority tasks (such as making sure the management didn't shut down this set of tools entirely). As of now, https://d... [15:03:55] (MaxConnTrack) firing: Max conntrack at 80% on cloudvirt1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConnTrack [15:04:16] 10Tools: [dplbot] uncategorized articles not working - https://phabricator.wikimedia.org/T355014 (10russblau) Also, I'll note to @Bearcat that none of the dplbot maintainers were originally tagged on this task, which made it kind of hard for us to address it. Maintainers are listed at https://toolsadmin.wikimedi... [15:07:55] (MaxConnTrack) resolved: Max conntrack at 90% on cloudvirt1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConnTrack [15:08:55] (MaxConnTrack) resolved: Max conntrack at 80% on cloudvirt1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConnTrack [15:17:25] (03PS1) 10Brouberol: Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) [15:18:16] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:18:29] (03CR) 10CI reject: [V: 04-1] Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) (owner: 10Brouberol) [15:20:21] (03PS2) 10Brouberol: Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) [15:43:01] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:02:45] (03PS3) 10Brouberol: Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) [17:03:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:00:47] 10Toolforge (Toolforge iteration 03), 10Toolforge Build Service: Toolforge Build Service: add the locale buildpack - https://phabricator.wikimedia.org/T354128 (10Earwig) I can confirm it's working now. Thanks so much for fixing this quickly! [18:33:01] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:46:46] 10VPS-project-Codesearch, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Peachey88) [19:01:45] 10Toolforge: There are some files that I cannot view, delete, or do anything to - https://phabricator.wikimedia.org/T355022 (10Andrew) Hello! Based on your comments on IRC (which I didn't see until much later, sorry) I suspect you are trying to run commands in the **gergesbot2** tool while signed in as a differ... [19:08:00] 10Toolforge: There are some files that I cannot view, delete, or do anything to - https://phabricator.wikimedia.org/T355022 (10GergesSH) ` gergesshamon@tools-sgebastion-10:~$ become gergesbot2 tools.gergesbot2@tools-sgebastion-10:~$ whoami tools.gergesbot2 tools.gergesbot2@tools-sgebastion-10:~$ cd WikipediaBot... [19:52:18] (03PS4) 10Brouberol: Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) [20:00:54] (03PS5) 10Brouberol: Index all gitlab projects under repos/data-engineering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/990709 (https://phabricator.wikimedia.org/T355069) [20:08:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:55:34] (03PS1) 10Eugene233: Backslashes in some ISA messages [labs/tools/Isa] (m2c) - 10https://gerrit.wikimedia.org/r/990674 (https://phabricator.wikimedia.org/T299863) [23:00:48] (03PS1) 10Eugene233: Fix capitalization in some ISA messages [labs/tools/Isa] (m2c) - 10https://gerrit.wikimedia.org/r/990675 (https://phabricator.wikimedia.org/T354920) [23:08:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-bastion-6 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:35:13] 10Tool-Pageviews, 10Data-Engineering, 10Pageviews-API: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#9423357, @TheDJ wrote: > @MusikAnimal is this still an issue ? Since there hasn't happened anything in this ticket for... [23:51:58] 10tool-fast-ec: fast-ec discrepancies with xtools (total edit count) - https://phabricator.wikimedia.org/T325492 (10Enterprisey) so it took me a while to notice this, but xtools actually was matching fast-ec if you go by the sum of the "yearly bar graph" chart. unfortunately, that number turns out to be wrong du... [23:58:14] 10tool-fast-ec: fast-ec discrepancies with xtools (total edit count) - https://phabricator.wikimedia.org/T325492 (10MusikAnimal) > see T325492 for the ongoing investigation (thanks musik for actually doing the work) You mean {T355027} :) And no problem. Thank you for discovering this discrepancy!