[00:02:21] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T365539#9819416 (10LibUp-bot) [00:53:55] FIRING: MaxConntrack: Max conntrack at 81.01% on cloudvirt1042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:55:56] FIRING: MaxConntrack: Max conntrack at 90.34% on cloudvirt1042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:55:59] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1042:9100 - https://phabricator.wikimedia.org/T365540 (10phaultfinder) 03NEW [00:58:55] RESOLVED: MaxConntrack: Max conntrack at 89.54% on cloudvirt1042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:00:55] RESOLVED: MaxConntrack: Max conntrack at 90.34% on cloudvirt1042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [02:30:39] 10Tool-bub2: Add persistance to queue page on refresh - https://phabricator.wikimedia.org/T357236#9819554 (10theprotonade) a:03theprotonade [02:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:24:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:29:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:40:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [04:44:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:45:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [04:49:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:54:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:00:11] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:14:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:19:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:20:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:25:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:30:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:31:56] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:36:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:41:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:46:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:51:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:56:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:57:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:02:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:03:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:08:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:18:26] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:19:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:24:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:29:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:39:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:59:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:04:26] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:09:26] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:10:56] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:15:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:25:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:30:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:31:11] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:31:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:35:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:36:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:55:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:56:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:58:43] (03PS1) 10Muehlenhoff: Add stub secrets for mpic_next [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) [08:01:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:05:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:06:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:07:44] (03CR) 10Slyngshede: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) (owner: 10Muehlenhoff) [08:10:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:15:17] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for mpic_next [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) (owner: 10Muehlenhoff) [08:15:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:16:41] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:17:12] 10Cloud-VPS, 06SRE: Depleted connection tracking table on labvirt1010 - https://phabricator.wikimedia.org/T139598#9820055 (10taavi) [08:20:31] 10Toolforge (Toolforge iteration 09): [maintain-kubeusers] Increment default services quota - https://phabricator.wikimedia.org/T362520#9820057 (10dcaro) >>! In T362520#9807885, @taavi wrote: > We need to have some quota in place to prevent a misbehaving tool from taking kube-apiserver down by creating hundreds... [08:20:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:21:00] (03update) 10sstefanova: [lima-kilo] enable toolforge-weld installation [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/130 (owner: 10raymond-ndibe) [08:21:41] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:22:17] 06cloud-services-team, 10Cloud-VPS: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1042:9100 - https://phabricator.wikimedia.org/T365540#9820066 (10taavi) This is effectively a repeat of {T355222} as diffscan has at some point migrated to cloudvirt1042. On http... [08:22:40] 10Toolforge, 10GitLab (Infrastructure): Whitelist Toolforge hosts in Gitlab shared runners - https://phabricator.wikimedia.org/T365561 (10Sportzpikachu) 03NEW [08:22:56] 10Toolforge, 10GitLab (Infrastructure): Whitelist Toolforge hosts in Gitlab shared runners - https://phabricator.wikimedia.org/T365561#9820083 (10Sportzpikachu) [08:23:11] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:27:26] 06cloud-services-team, 10Cloud-VPS: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1042:9100 - https://phabricator.wikimedia.org/T365540#9820094 (10aborrero) I think doubling the size of the conntrack table should be just fine. [08:28:11] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:28:44] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [08:30:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:34:07] 06cloud-services-team, 10Toolforge: toolforge: admin tool /healthz returns 503 from time to time - https://phabricator.wikimedia.org/T365562 (10aborrero) 03NEW [08:35:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:36:22] 06cloud-services-team, 10Toolforge: toolforge: admin tool /healthz returns 503 from time to time - https://phabricator.wikimedia.org/T365562#9820128 (10aborrero) @taavi found one of the pods in the admin tool is secretly misbehaving: {P62847} [08:38:11] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:40:10] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [08:40:48] 06cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T365462#9820134 (10dcaro) This was caused by the OVS switch during the maintenance window: {T364459} [08:40:53] 06cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T365462#9820136 (10dcaro) 05Open→03Resolved a:03dcaro [08:41:34] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [08:43:54] 10Toolforge: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api - https://phabricator.wikimedia.org/T362075#9820145 (10dcaro) Maybe just `use: `? [08:53:30] 06cloud-services-team, 10Toolforge: toolforge: admin tool /healthz returns 503 from time to time - https://phabricator.wikimedia.org/T365562#9820192 (10taavi) 05Open→03Resolved a:03taavi [09:11:48] 06cloud-services-team, 10Toolforge, 07Documentation, 07Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919#9820257 (10dcaro) 05Open→03In progress p:05Triage→03High [09:12:03] 06cloud-services-team, 10Toolforge, 07Documentation, 07Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919#9820259 (10dcaro) 05In progress→03Open [09:13:15] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [09:21:09] 10Toolforge: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api - https://phabricator.wikimedia.org/T362075#9820275 (10fnegri) I like `use-image`, `image-from` or `reuse-image`. Basically, anything with the word `image` in it to clarify that the image is "reused", not the config. [09:21:51] (03approved) 10dcaro: dev: oapi-codegen updates [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/91 (owner: 10sstefanova) [09:21:56] (03update) 10dcaro: dev: oapi-codegen updates [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/91 (owner: 10sstefanova) [09:22:53] (03approved) 10dcaro: py3.11-bookworm-tox*: point to the right builder versions [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/39 (owner: 10aborrero) [09:22:55] (03update) 10dcaro: py3.11-bookworm-tox*: point to the right builder versions [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/39 (owner: 10aborrero) [09:30:36] (03merge) 10aborrero: py3.11-bookworm-tox*: point to the right builder versions [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/39 [09:32:40] 10Toolforge: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api - https://phabricator.wikimedia.org/T362075#9820293 (10dcaro) > I like use-image, image-from or reuse-image. Basically, anything with the word image in it to clarify that the image is "reused", not the config. I s... [09:33:57] (03update) 10sstefanova: dev: oapi-codegen updates [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/91 [09:38:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-idp-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:48:40] 10Toolforge: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api - https://phabricator.wikimedia.org/T362075#9820322 (10Slst2020) Hmm, we might not be the best folks to decide which term is more descriptive/unambiguous because we're too deep in the soup. 🙈 Could we somehow run a... [09:52:24] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [toolforge] Redis refusing connections - https://phabricator.wikimedia.org/T363709#9820325 (10fnegri) Today I randomly found the task {T318479} which makes me slightly worried that setting a timeout could cau... [09:55:09] (03update) 10aborrero: maintain_kubeusers: introduce resource abstraction [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/23 (https://phabricator.wikimedia.org/T364312) [10:02:24] (03merge) 10sstefanova: dev: oapi-codegen updates [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/91 [10:02:42] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [10:09:41] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-api: bump to 0.0.144-20240521144209-4947025a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/283 [10:13:10] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [10:28:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-idp-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:12:47] (03update) 10aborrero: maintain_kubeusers: introduce resource abstraction [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/23 (https://phabricator.wikimedia.org/T364312) [11:23:45] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [11:33:47] 06cloud-services-team, 10Toolforge (Quota-requests): Request increased quota for video-answer-tool Toolforge tool - https://phabricator.wikimedia.org/T365536#9820614 (10taavi) 05Open→03Resolved a:03taavi I increased the quota to 2G. [11:44:00] (03CR) 10D3r1ck01: [C:03+2] Move Wikifunctions services into Services [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1031054 (owner: 10Jforrester) [11:45:16] (03Merged) 10jenkins-bot: Move Wikifunctions services into Services [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1031054 (owner: 10Jforrester) [12:18:11] 10Toolforge: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066#9820783 (10Raymond_Ndibe) >>! In T341066#9692931, @dcaro wrote: >>>! In T341066#9691067, @Raymond_Ndibe wrote: >> how will this affect the current `3 continuous jobs` limit? does 2 replicas o... [12:19:01] (03update) 10aborrero: maintain_kubeusers: introduce resource abstraction [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/23 (https://phabricator.wikimedia.org/T364312) [12:22:53] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9820795 (10Raymond_Ndibe) this is done right @dcaro? we should mark it as resolved if so [12:32:48] 10Toolforge: [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#9820816 (10Raymond_Ndibe) a:03Raymond_Ndibe [12:34:42] (03update) 10aborrero: maintain_kubeusers: introduce resource abstraction [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/23 (https://phabricator.wikimedia.org/T364312) [12:36:50] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [12:47:23] 10Toolforge (Toolforge iteration 09): [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9820873 (10dcaro) a:03dcaro [12:48:00] (03update) 10sstefanova: Draft: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/93 [12:54:12] 10Toolforge (Toolforge iteration 09): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9820912 (10dcaro) @MBH hi! Just wondering if you had been able to setup the tunnel finally? [13:18:50] 10PAWS: move prometheus later in deploy - https://phabricator.wikimedia.org/T365590 (10rook) 03NEW [13:23:13] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#9821042 (10taavi) [13:35:56] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): [maintain-kubeusers,infra,k8s]: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies) - https://phabricator.wikimedia.org/T364312#9821141 (10dcaro) [13:36:30] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): [maintain-kubeusers,infra,k8s]: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies) - https://phabricator.wikimedia.org/T364312#9821130 (10aborrero) 05Open→03In progress [13:38:27] 10Toolforge (Toolforge iteration 09), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#9821164 (10dcaro) 05Stalled→03Open [13:44:41] 10Toolforge: remove "File log:" column from toolforge jobs list -o long output - https://phabricator.wikimedia.org/T361896#9821188 (10dcaro) a:05Raymond_Ndibe→03None [13:45:32] 10Toolforge: [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9821190 (10dcaro) a:05Slst2020→03None [13:46:37] 10Toolforge: [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#9821215 (10dcaro) a:05Raymond_Ndibe→03None [13:47:00] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#9821211 (10aborrero) [13:47:20] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9821206 (10aborrero) 05Open→03Stalled blocking until {T364984} is fixed, so we don't risk having another cloudvirt offline. [13:49:02] 10Toolforge (Toolforge iteration 09): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9821244 (10dcaro) [13:49:06] 10Toolforge (Toolforge iteration 09): [jobs-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363346#9821246 (10dcaro) [13:49:29] 10Toolforge (Toolforge iteration 09): [envvars-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363809#9821248 (10dcaro) [13:50:17] 10Toolforge (Toolforge iteration 09): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9821250 (10dcaro) [13:50:46] 10Toolforge (Toolforge iteration 09): [jobs-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363346#9821255 (10dcaro) [13:50:52] 10Toolforge (Toolforge iteration 09): [envvars-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363809#9821253 (10dcaro) [13:51:32] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): [api-gateway] add alert for uptime - https://phabricator.wikimedia.org/T348633#9821257 (10dcaro) [13:53:14] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9821300 (10Jclark-ctr) @Andrew @dcaro once we find out racking information we will be able to rack and image these fairly quickly these have arrived [14:18:57] 10PAWS: move prometheus later in deploy - https://phabricator.wikimedia.org/T365590#9821504 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/414 [14:19:00] vivian-rook opened https://github.com/toolforge/paws/pull/414 [14:20:48] 10PAWS: move prometheus later in deploy - https://phabricator.wikimedia.org/T365590#9821530 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/414 [14:20:57] 10PAWS: move prometheus later in deploy - https://phabricator.wikimedia.org/T365590#9821531 (10rook) 05Open→03Resolved [14:20:59] vivian-rook closed https://github.com/toolforge/paws/pull/414 [14:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:44:51] 10Cloud-VPS (Debian Buster Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): Replace deprecated Buster VMs in Cloud VPS - https://phabricator.wikimedia.org/T364399#9822128 (10Kgraessle) a:05Kgraessle→03None [16:02:21] 06cloud-services-team, 10Toolforge: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#9822205 (10Soda) My solution with Redis+Celery is to add a health check script that pings my worker(s), if the worker does not respond, Kubernetes kills the trainwreck and restart... [16:04:03] 06cloud-services-team, 10Toolforge: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#9822215 (10Soda) https://gitlab.wikimedia.org/tooloforge-repos/matchandsplit uses the configuration and appears to be working mostly fine. [16:07:41] (03approved) 10bd808: run tests on wmcs runners [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/45 (https://phabricator.wikimedia.org/T362401) (owner: 10jelto) [16:07:58] (03merge) 10bd808: run tests on wmcs runners [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/45 (https://phabricator.wikimedia.org/T362401) (owner: 10jelto) [16:27:32] 10PAWS: update openrefine - https://phabricator.wikimedia.org/T363732#9822340 (10rook) [16:28:02] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T365539#9822342 (10rook) →14Duplicate dup:03T363732 [16:28:03] 10PAWS: update openrefine - https://phabricator.wikimedia.org/T363732#9822344 (10rook) [16:36:37] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 [16:36:52] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-9 [16:40:44] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9822433 (10dcaro) [16:40:53] 10Toolforge (Toolforge iteration 10): [toolforge] simplify calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377#9822438 (10dcaro) [16:41:22] 10Toolforge (Toolforge iteration 10), 07Upstream: [builds-builder] golang based images get infinite nested loops for procfile entries - https://phabricator.wikimedia.org/T363417#9822436 (10dcaro) [16:41:31] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 10), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#9822442 (10dcaro) [16:42:12] 10Toolforge (Toolforge iteration 10), 07Upstream: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9822440 (10dcaro) [16:43:02] 10Toolforge (Toolforge iteration 10): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9822446 (10dcaro) [16:43:05] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [toolforge] Redis refusing connections - https://phabricator.wikimedia.org/T363709#9822431 (10dcaro) [16:43:29] 06cloud-services-team, 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9822444 (10dcaro) [16:43:33] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [jobs-api,jobs-cli] Support services in jobs - https://phabricator.wikimedia.org/T348758#9822455 (10dcaro) [16:44:05] 10Toolforge (Toolforge iteration 10): [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) - https://phabricator.wikimedia.org/T364822#9822457 (10dcaro) [16:44:36] 06cloud-services-team, 10Toolforge (Toolforge iteration 10): [maintain-kubeusers,infra,k8s]: introduce some logic to backfill maintain-kubeuser resources (like per-tool kyverno policies) - https://phabricator.wikimedia.org/T364312#9822451 (10dcaro) [16:44:37] 10Toolforge (Toolforge iteration 10): [builds-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363808#9822453 (10dcaro) [16:44:44] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] Figure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9822459 (10dcaro) [16:44:51] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9822464 (10dcaro) [16:45:47] 10Toolforge (Toolforge iteration 10): [builds-api, envvars-api] add oapi-codegen installation to makefile - https://phabricator.wikimedia.org/T362290#9822470 (10dcaro) [16:45:50] 10Toolforge (Toolforge iteration 10): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9822472 (10dcaro) [16:45:52] 10Toolforge (Toolforge iteration 10): [envvars-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363809#9822473 (10dcaro) [16:45:53] 10Toolforge (Toolforge iteration 10): [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation - https://phabricator.wikimedia.org/T356261#9822468 (10dcaro) [16:45:55] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9822462 (10dcaro) [16:45:57] 10Toolforge (Toolforge iteration 10): [jobs-api] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363346#9822474 (10dcaro) [16:45:59] 06cloud-services-team, 10Toolforge (Toolforge iteration 10): [api-gateway] add alert for uptime - https://phabricator.wikimedia.org/T348633#9822475 (10dcaro) [16:46:03] 06cloud-services-team, 10Toolforge (Toolforge iteration 10): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#9822476 (10dcaro) [16:46:12] 10Toolforge (Toolforge iteration 10), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#9822480 (10dcaro) [16:46:16] 10Toolforge (Toolforge iteration 10), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9822479 (10dcaro) [16:46:20] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 10): [toolforge] webservice logs crashes with some unicode chars - https://phabricator.wikimedia.org/T364609#9822481 (10dcaro) [16:46:24] 10Toolforge (Toolforge iteration 10): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9822482 (10dcaro) [16:46:28] 10Toolforge (Toolforge iteration 10): [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9822483 (10dcaro) [16:46:36] 06cloud-services-team, 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050#9822466 (10dcaro) [16:46:40] 10Toolforge (Toolforge iteration 10): [maintain-kubeusers] Increment default services quota - https://phabricator.wikimedia.org/T362520#9822485 (10dcaro) [16:46:44] 10Toolforge (Toolforge iteration 10): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9822486 (10dcaro) [16:46:48] 06cloud-services-team, 10Toolforge (Toolforge iteration 10): lima-kilo: container image caching - https://phabricator.wikimedia.org/T362967#9822484 (10dcaro) [16:46:52] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 10): [docs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313#9822487 (10dcaro) [16:46:56] 10Toolforge (Toolforge iteration 10), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9822488 (10dcaro) [16:47:53] 10Toolforge (Toolforge iteration 10): [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) - https://phabricator.wikimedia.org/T364822#9822516 (10dcaro) Just restarted the node and took it back into the pool, will try to debug more the next time it happens. [16:48:01] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9822518 (10dcaro) 05Open→03Resolved [16:48:23] 10Toolforge (Toolforge iteration 10), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9822524 (10dcaro) 05Resolved→03In progress [16:48:35] 10Toolforge (Toolforge iteration 10): [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) - https://phabricator.wikimedia.org/T364822#9822527 (10dcaro) 05In progress→03Resolved [16:55:37] 10Toolforge: Toolforge Aptfile not producing working copy of `ffmpeg` - https://phabricator.wikimedia.org/T365633 (10derenrich) 03NEW [17:54:48] FIRING: PuppetFailure: Puppet has failed on cloudbackup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:54:59] 06cloud-services-team: PuppetFailure Puppet failure on cloudbackup2003:9100 - https://phabricator.wikimedia.org/T365638 (10phaultfinder) 03NEW [17:56:45] FIRING: WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:59:48] FIRING: [3x] PuppetFailure: Puppet has failed on cloudbackup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:59:56] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640 (10phaultfinder) 03NEW [18:04:48] FIRING: [4x] PuppetFailure: Puppet has failed on cloudbackup1002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:04:58] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640#9822897 (10phaultfinder) [18:09:48] FIRING: [6x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:09:57] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640#9822909 (10phaultfinder) [18:14:48] FIRING: [7x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:14:55] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640#9822918 (10phaultfinder) [18:24:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:24:58] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640#9822929 (10phaultfinder) [18:34:19] 10Cloud-VPS (Debian Buster Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): Replace deprecated Buster VMs in Cloud VPS - https://phabricator.wikimedia.org/T364399#9822954 (10Kgraessle) [18:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:47:29] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T365640#9822974 (10taavi) This should be fixed with https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/4b77207a3c759e4d42b3e8d635d47a3ff7302bdb (cc @dcaro) [18:49:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:50:57] (03open) 10dancy: README.md: Clarify what command this repo implements [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/70 [18:50:58] (03update) 10dancy: README.md: Clarify what command this repo implements [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/70 [18:54:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:55:13] (03update) 10raymond-ndibe: [jobs-api] support services in jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/71 (https://phabricator.wikimedia.org/T348758) [19:04:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:05:00] (03open) 10dancy: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:05:01] (03update) 10dancy: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:05:54] (03update) 10dancy: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:05:57] (03update) 10dancy: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:06:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:07:11] 10Cloud-VPS (Debian Buster Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): Replace deprecated Buster VMs in Cloud VPS - https://phabricator.wikimedia.org/T364399#9823012 (10Scardenasmolinar) 05Open→03In progress [19:09:32] (03update) 10dancy: cli: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:09:35] (03update) 10dancy: cli: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 [19:09:48] FIRING: [6x] PuppetFailure: Puppet has failed on cloudbackup1002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:11:55] (03update) 10dancy: cli: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 (https://phabricator.wikimedia.org/T361437) [19:11:57] (03update) 10dancy: cli: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 (https://phabricator.wikimedia.org/T361437) [19:12:41] (03update) 10dancy: cli: webservice logs -f: Don't spam user w/ stack trace on control-C [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/39 (https://phabricator.wikimedia.org/T361437) [19:14:48] RESOLVED: [5x] PuppetFailure: Puppet has failed on cloudbackup1002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:20:16] (03open) 10raymond-ndibe: [maintain-kubeusers] increment default services quota [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/25 (https://phabricator.wikimedia.org/T362520) [19:20:17] 10Cloud-VPS (Debian Buster Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): Replace deprecated Buster VMs in Cloud VPS - https://phabricator.wikimedia.org/T364399#9823032 (10Kgraessle) a:03Kgraessle [19:24:53] (03update) 10raymond-ndibe: [maintain-kubeusers] increment default services quota [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/25 (https://phabricator.wikimedia.org/T362520) [19:36:38] 10Cloud-VPS (Quota-requests), 10Wikispore: Floating IP for Wikispore - https://phabricator.wikimedia.org/T365641 (10Tgr) 03NEW [20:39:42] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974) [20:39:50] 10Cloud-VPS (Debian Buster Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): Replace deprecated Buster VMs in Cloud VPS - https://phabricator.wikimedia.org/T364399#9823353 (10Kgraessle) [20:40:28] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974) [20:40:32] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974) [20:45:43] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974) [20:47:53] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974) [20:55:43] 10Tools: bldrwnsch update is broken – Unknown column 'pagelinks.pl_title' - https://phabricator.wikimedia.org/T365497#9823375 (10JJMC89) [22:09:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:14:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:46:57] FIRING: [3x] CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:09:56] (03update) 10raymond-ndibe: [jobs-api] add messages to all responses [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/85 (https://phabricator.wikimedia.org/T356974)