[00:06:01] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:43:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:53:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:43:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:53:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:58:42] (CloudVPSDesignateLeaks) resolved: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:12:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:22:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:43:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:53:41] (CloudVPSDesignateLeaks) resolved: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:06:01] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:29:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:49:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:06:01] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:44:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:59:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:15:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:25:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:10:34] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.691% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:30:34] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.917% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:35:29] !log dcaro@urcuchillay codesearch START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [11:40:15] !log dcaro@urcuchillay codesearch END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [11:40:58] !log dcaro@urcuchillay codesearch START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [11:51:16] !log dcaro@urcuchillay codesearch END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [11:58:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:03:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:06:01] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:46:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:51:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:32:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:37:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:42:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:10:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:15:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:05:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:10:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:29:26] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:34:26] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:06:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:11:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:13:19] (03PS10) 10Arturo Borrero Gonzalez: kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) [16:15:59] (03CR) 10David Caro: kubernetes: refactor static pod restart logic (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:16:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:17:50] (03CR) 10Arturo Borrero Gonzalez: kubernetes: refactor static pod restart logic (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:18:13] (03CR) 10Arturo Borrero Gonzalez: kubernetes: refactor static pod restart logic (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:21:11] (03CR) 10CI reject: [V: 04-1] kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:21:53] (03PS11) 10Majavah: kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:21:55] (03PS1) 10Majavah: kubernetes: Add missing overload for `missing_ok: bool` [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007942 [16:25:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007942 (owner: 10Majavah) [16:25:46] (03CR) 10Majavah: [C: 03+2] kubernetes: Add missing overload for `missing_ok: bool` [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007942 (owner: 10Majavah) [16:27:20] (03CR) 10Majavah: "One thing inline, otherwise seems fine." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:29:39] (03Merged) 10jenkins-bot: kubernetes: Add missing overload for `missing_ok: bool` [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007942 (owner: 10Majavah) [16:31:14] (03PS12) 10Arturo Borrero Gonzalez: kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) [16:31:16] (03PS5) 10Arturo Borrero Gonzalez: toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) [16:31:28] (03CR) 10Arturo Borrero Gonzalez: kubernetes: refactor static pod restart logic (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:32:23] (03CR) 10Majavah: [C: 03+1] "Thanks!" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:32:35] (03PS6) 10Arturo Borrero Gonzalez: toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) [16:33:57] (03CR) 10Arturo Borrero Gonzalez: toolforge: add restart-static-pods cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:34:48] (03PS7) 10Arturo Borrero Gonzalez: toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) [16:35:12] (03CR) 10Majavah: [C: 03+1] toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:38:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:39:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:39:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:42:30] (03Merged) 10jenkins-bot: kubernetes: refactor static pod restart logic [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1006529 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:42:54] (03Merged) 10jenkins-bot: toolforge: add restart-static-pods cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007604 (https://phabricator.wikimedia.org/T358476) (owner: 10Arturo Borrero Gonzalez) [16:43:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:57:26] !log fnegri@cloudcumin1001 loggerdiscordbot START - Cookbook wmcs.vps.create_project for project loggerdiscordbot in eqiad1 (T358337) [16:57:27] fnegri@cloudcumin1001: Unknown project "loggerdiscordbot" [16:57:28] T358337: Request creation of logger-discord-bot VPS project - https://phabricator.wikimedia.org/T358337 [16:58:03] !log fnegri@cloudcumin1001 loggerdiscordbot END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project loggerdiscordbot in eqiad1 (T358337) [16:58:03] fnegri@cloudcumin1001: Unknown project "loggerdiscordbot" [17:03:40] !log fnegri@cloudcumin1001 loggerdiscordbot START - Cookbook wmcs.vps.add_user_to_project for user 'dbeef' in role 'reader' (T358337) [17:03:44] T358337: Request creation of logger-discord-bot VPS project - https://phabricator.wikimedia.org/T358337 [17:04:20] !log fnegri@cloudcumin1001 loggerdiscordbot END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'dbeef' in role 'reader' (T358337) [17:11:13] !log fnegri@cloudcumin1001 mdwikioffline START - Cookbook wmcs.vps.create_project for project mdwikioffline in eqiad1 (T358023) [17:11:14] fnegri@cloudcumin1001: Unknown project "mdwikioffline" [17:11:14] T358023: Request creation of mdwiki-offline VPS project - https://phabricator.wikimedia.org/T358023 [17:16:50] !log fnegri@cloudcumin1001 mdwikioffline END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project mdwikioffline in eqiad1 (T358023) [17:16:54] T358023: Request creation of mdwiki-offline VPS project - https://phabricator.wikimedia.org/T358023 [17:18:26] !log fnegri@cloudcumin1001 mdwikioffline START - Cookbook wmcs.vps.add_user_to_project for user 'harej' in role 'reader' (T358023) [17:18:30] !log fnegri@cloudcumin1001 mdwikioffline END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'harej' in role 'reader' (T358023) [17:18:48] !log fnegri@cloudcumin1001 mdwikioffline START - Cookbook wmcs.vps.add_user_to_project for user 'harej' in role 'member' (T358023) [17:18:54] !log fnegri@cloudcumin1001 mdwikioffline END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'harej' in role 'member' (T358023) [17:19:01] !log fnegri@cloudcumin1001 mdwikioffline START - Cookbook wmcs.vps.add_user_to_project for user 'timmoody' in role 'member' (T358023) [17:19:07] !log fnegri@cloudcumin1001 mdwikioffline END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'timmoody' in role 'member' (T358023) [17:19:14] !log fnegri@cloudcumin1001 loggerdiscordbot START - Cookbook wmcs.vps.add_user_to_project for user 'dbeef' in role 'member' (T358337) [17:19:17] T358337: Request creation of logger-discord-bot VPS project - https://phabricator.wikimedia.org/T358337 [17:19:19] !log fnegri@cloudcumin1001 loggerdiscordbot END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'dbeef' in role 'member' (T358337) [17:21:57] 06cloud-services-team, 06Wikimedia-Medicine, 10Project-requests: Request creation of mdwiki-offline VPS project - https://phabricator.wikimedia.org/T358023#9591318 (10fnegri) 05Open→03Resolved a:03fnegri Dashes in the project name are [discouraged](https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/A... [17:42:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:43:20] 10Wikibugs, 15User-bd808: GitLab CI tests fail for MRs from forks because of missing secrets - https://phabricator.wikimedia.org/T358775#9591452 (10valhallasw) In theory we should be able to have the best of both worlds using https://docs.gitlab.com/ee/ci/pipelines/merge_request_pipelines.html#run-pipelines-in... [17:52:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:09:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:13:23] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591582 (10Dvorapa) Not sure what exactly you want me to do. If I start webservice with the probe, the /healthz endpoint is created automatically somehow... [18:21:37] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591609 (10dcaro) Is the code available anywhere for us to inspect? How did you implement the `/healthz` endpoint in your code? If you run it locally,... [18:22:33] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591615 (10Dvorapa) Also, shouldn't /healthz endpoint be generated automatically using lighttpd, same as mod_status? Or maybe is it a good strategy to us... [18:24:37] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591628 (10dcaro) >>! In T356905#9591615, @Dvorapa wrote: > Also, shouldn't /healthz endpoint be generated automatically by lighttpd, same as mod_status?... [18:26:58] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591649 (10dcaro) You can also just create the `healthz` file in the `public_html` to generate a silly endpoint (that would check that NFS is working btw... [18:27:24] 10Wikibugs, 15User-bd808: bd808's big pile of refactoring ideas - https://phabricator.wikimedia.org/T357851#9591654 (10bd808) [18:39:07] 10PAWS: Add wikibase-cli to paws - https://phabricator.wikimedia.org/T358649#9591744 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/381 [18:39:18] vivian-rook opened https://github.com/toolforge/paws/pull/381 [18:44:29] 06Toolforge-standards-committee: Adoption request for Muninnbot - https://phabricator.wikimedia.org/T358897 (10Frostly) [18:45:10] 06Toolforge-standards-committee: Adoption request for Muninnbot - https://phabricator.wikimedia.org/T358897#9591796 (10Frostly) 05Open→03Stalled will be unstalled on March 15 [19:02:11] (03PS1) 10Majavah: kubernetes: Cleanup namespace handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1007955 [19:07:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:12:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:15:07] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591943 (10Dvorapa) Sorry for the confusion, of course I've had a typo in my code. That's why the /healthz endpoint was working just for several seconds.... [19:16:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:25:22] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591985 (10Dvorapa) `If not passed, a simple TCP check will be used instead.` Does this mean the webservice restarts itself even if no --health-check-pa... [19:26:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:33:14] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9591993 (10dcaro) With the health check path, it is already restarting when the health oath returns !=200, so yes, that feature is already there :) For... [19:46:47] 05Grid-Engine-to-K8s-Migration, 15User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905#9592014 (10Dvorapa) I see, thank you for the explanation. So if the /healthz endpoint is set correctly and probe is pointed to that endpoint using the pa... [20:26:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:36:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:54:58] 10Wikibugs, 15User-bd808: GitLab CI tests fail for MRs from forks because of missing secrets - https://phabricator.wikimedia.org/T358775#9592144 (10bd808) >>! In T358775#9591452, @valhallasw wrote: > In theory we should be able to have the best of both worlds using https://docs.gitlab.com/ee/ci/pipelines/merge... [20:55:09] (CephSlowOps) firing: Ceph cluster in eqiad has 354 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:55:13] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T358907 (10phaultfinder) [20:55:51] (ProbeDown) firing: (3) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:59:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:00:09] (CephSlowOps) resolved: Ceph cluster in eqiad has 354 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [21:00:51] (ProbeDown) resolved: (3) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:06:56] (SystemdUnitDown) firing: The service unit ceph-osd@132.service is in failed status on host cloudcephosd1017. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1017 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:13:52] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [21:14:03] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [21:14:10] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:14:18] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [21:14:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [21:14:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:15:53] 10Toolforge: Alert when toolforge-deploy changes are not deployed - https://phabricator.wikimedia.org/T358908 (10taavi) [21:17:57] 10Toolforge: Alert when admin managed pods are having issues - https://phabricator.wikimedia.org/T358909 (10taavi) [21:27:17] (03PS1) 10Jforrester: Add browserslist-config-wikimedia [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1007978 [21:27:56] (03CR) 10CI reject: [V: 04-1] Add browserslist-config-wikimedia [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1007978 (owner: 10Jforrester) [21:29:00] (03PS2) 10Jforrester: Add browserslist-config-wikimedia [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1007978 [21:35:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:40:56] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:53:49] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:32:30] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:44:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:59:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:01:56] (SystemdUnitDown) firing: The systemd unit ceph-osd@132.service on node cloudcephosd1017 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1017 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:02:53] 06cloud-services-team: SystemdUnitDown Unit ceph-osd@132.service on node cloudcephosd1017 has been down for long. - https://phabricator.wikimedia.org/T358925 (10phaultfinder) [23:04:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:09:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown