[00:00:04] (03PS3) 10Ssingh: cp402[2468], cp403[0246]: decommission hosts for ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) [00:02:30] (03CR) 10Ssingh: [C: 03+2] cp402[2468], cp403[0246]: decommission hosts for ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [00:05:41] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ssingh) [00:06:16] RECOVERY - PyBal IPVS diff check on lvs4006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:07:50] RECOVERY - PyBal IPVS diff check on lvs4007 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:08:52] RECOVERY - PyBal IPVS diff check on lvs4005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:23:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:21] (ConfdResourceFailed) firing: (48) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:51:42] (03PS1) 10Ssingh: cp403[35]: decommission hosts as part of ulsfo refresh [puppet] - 10https://gerrit.wikimedia.org/r/849200 (https://phabricator.wikimedia.org/T317244) [01:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:15:31] (03CR) 10Ssingh: [C: 03+2] cp403[35]: decommission hosts as part of ulsfo refresh [puppet] - 10https://gerrit.wikimedia.org/r/849200 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [01:16:35] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ssingh) [01:19:31] (03PS1) 10Ssingh: cp4038: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849202 (https://phabricator.wikimedia.org/T319067) [01:22:06] (ConfdResourceFailed) firing: (48) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:25:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [01:28:12] (03CR) 10Ssingh: [C: 03+2] cp4038: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849202 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [01:29:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS buster [01:33:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [01:34:56] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:41:45] (JobUnavailable) firing: (13) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:47:51] (03PS1) 10Ssingh: cp4046: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849207 (https://phabricator.wikimedia.org/T317244) [01:50:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:50:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:45] (JobUnavailable) firing: (14) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:56:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [02:00:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [02:00:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:03] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:45] (JobUnavailable) firing: (10) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS buster [02:29:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp4038.ulsfo.wmnet with reason: failing Icinga check after commissioning; will debug tomorrow [02:29:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp4038.ulsfo.wmnet with reason: failing Icinga check after commissioning; will debug tomorrow [02:34:56] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:51:42] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:53:24] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:09:00] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:47:08] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:23:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:28:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:04:48] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: fix typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849376 [05:05:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: fix typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849376 (owner: 10Giuseppe Lavagetto) [05:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:17:26] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:22:21] (ConfdResourceFailed) firing: (40) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:38:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:39:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [05:42:43] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: further fixes to the logic [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849385 [05:42:45] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: add more tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849386 [05:43:47] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: further fixes to the logic [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849385 (owner: 10Giuseppe Lavagetto) [05:44:13] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:02] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:49:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:54:22] (03CR) 10Giuseppe Lavagetto: Add cookbook to restart pybal (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [05:54:43] (03PS8) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [05:54:52] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:48] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:59:50] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: add new env variables from the httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/848263 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [06:00:52] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:05:10] (03Merged) 10jenkins-bot: shellbox: add new env variables from the httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/848263 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [06:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:08:26] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [06:10:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 es1030 es1031', diff saved to https://phabricator.wikimedia.org/P36369 and previous config saved to /var/cache/conftool/dbconfig/20221026-061044-root.json [06:11:32] (03PS1) 10Marostegui: es1029, es1030, es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849388 [06:12:00] (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:38] (03CR) 10Marostegui: [C: 03+2] es1029, es1030, es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849388 (owner: 10Marostegui) [06:18:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:18:29] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [06:19:24] (03PS1) 10Urbanecm: [Growth] Enable structured mentor list everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849419 (https://phabricator.wikimedia.org/T310905) [06:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36370 and previous config saved to /var/cache/conftool/dbconfig/20221026-062056-root.json [06:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36371 and previous config saved to /var/cache/conftool/dbconfig/20221026-062103-root.json [06:21:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36372 and previous config saved to /var/cache/conftool/dbconfig/20221026-062108-root.json [06:21:10] (03CR) 10Giuseppe Lavagetto: "This patch broke building of all images in production minus the spark ones. I would be grateful if next time before merging such a change " [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [06:21:23] (03PS1) 10Marostegui: Revert "es1029, es1030, es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849153 [06:26:56] (03PS1) 10Giuseppe Lavagetto: role::builder: re-add the default uid mappings for docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/849461 [06:28:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::builder: re-add the default uid mappings for docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/849461 (owner: 10Giuseppe Lavagetto) [06:31:07] (03CR) 10Marostegui: [C: 03+2] Revert "es1029, es1030, es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849153 (owner: 10Marostegui) [06:33:17] <_joe_> !log build2001:~# docker-registryctl delete-tags docker-registry.discovery.wmnet/httpd-fcgi:2.4.38-7 (to fix the uid issues) [06:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:35] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [06:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2029 es2030 es2031', diff saved to https://phabricator.wikimedia.org/P36373 and previous config saved to /var/cache/conftool/dbconfig/20221026-063524-root.json [06:36:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36374 and previous config saved to /var/cache/conftool/dbconfig/20221026-063601-root.json [06:36:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36375 and previous config saved to /var/cache/conftool/dbconfig/20221026-063608-root.json [06:36:13] (03PS1) 10Marostegui: es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849462 [06:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36376 and previous config saved to /var/cache/conftool/dbconfig/20221026-063613-root.json [06:37:09] (03CR) 10Marostegui: [C: 03+2] es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849462 (owner: 10Marostegui) [06:38:00] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [06:39:27] PROBLEM - Host es2030 is DOWN: PING CRITICAL - Packet loss = 100% [06:39:37] mmm I downtimed it I think [06:39:46] Ah no, I didn't [06:40:55] RECOVERY - Host es2030 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [06:43:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 63949 [06:44:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63949 [06:45:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 16509 [06:45:51] (03PS1) 10Marostegui: Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849160 [06:46:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36377 and previous config saved to /var/cache/conftool/dbconfig/20221026-064607-root.json [06:46:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36378 and previous config saved to /var/cache/conftool/dbconfig/20221026-064614-root.json [06:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36379 and previous config saved to /var/cache/conftool/dbconfig/20221026-064622-root.json [06:47:01] (03CR) 10Marostegui: [C: 03+2] Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849160 (owner: 10Marostegui) [06:48:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16509 [06:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36380 and previous config saved to /var/cache/conftool/dbconfig/20221026-065106-root.json [06:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36381 and previous config saved to /var/cache/conftool/dbconfig/20221026-065113-root.json [06:51:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36382 and previous config saved to /var/cache/conftool/dbconfig/20221026-065118-root.json [06:52:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36383 and previous config saved to /var/cache/conftool/dbconfig/20221026-070112-root.json [07:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36384 and previous config saved to /var/cache/conftool/dbconfig/20221026-070119-root.json [07:01:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36385 and previous config saved to /var/cache/conftool/dbconfig/20221026-070127-root.json [07:04:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 25091 [07:04:36] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 25091 [07:04:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 25091 [07:05:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 25091 [07:06:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36386 and previous config saved to /var/cache/conftool/dbconfig/20221026-070611-root.json [07:06:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36387 and previous config saved to /var/cache/conftool/dbconfig/20221026-070618-root.json [07:06:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36388 and previous config saved to /var/cache/conftool/dbconfig/20221026-070623-root.json [07:08:49] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [07:09:18] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [07:09:38] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [07:10:07] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [07:10:27] (03CR) 10Elukey: "Left some comments :)" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [07:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36389 and previous config saved to /var/cache/conftool/dbconfig/20221026-071617-root.json [07:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36390 and previous config saved to /var/cache/conftool/dbconfig/20221026-071624-root.json [07:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36391 and previous config saved to /var/cache/conftool/dbconfig/20221026-071632-root.json [07:21:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36392 and previous config saved to /var/cache/conftool/dbconfig/20221026-072116-root.json [07:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36393 and previous config saved to /var/cache/conftool/dbconfig/20221026-072123-root.json [07:21:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36394 and previous config saved to /var/cache/conftool/dbconfig/20221026-072128-root.json [07:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36395 and previous config saved to /var/cache/conftool/dbconfig/20221026-073122-root.json [07:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36396 and previous config saved to /var/cache/conftool/dbconfig/20221026-073129-root.json [07:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36397 and previous config saved to /var/cache/conftool/dbconfig/20221026-073137-root.json [07:32:09] (03CR) 10JMeybohm: [C: 04-1] "Needs a bump to the chart version, apart from that it LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan) [07:36:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36398 and previous config saved to /var/cache/conftool/dbconfig/20221026-073621-root.json [07:36:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36399 and previous config saved to /var/cache/conftool/dbconfig/20221026-073628-root.json [07:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36400 and previous config saved to /var/cache/conftool/dbconfig/20221026-073633-root.json [07:44:57] (03CR) 10Ayounsi: [C: 03+1] "Is next step Dynamic load balancing? 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [07:46:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36401 and previous config saved to /var/cache/conftool/dbconfig/20221026-074627-root.json [07:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36402 and previous config saved to /var/cache/conftool/dbconfig/20221026-074634-root.json [07:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36403 and previous config saved to /var/cache/conftool/dbconfig/20221026-074642-root.json [07:48:47] (03PS1) 10Slyngshede: setup.py add missing attrs package. Require to build on MacOS [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/849464 [07:50:45] (03CR) 10CI reject: [V: 04-1] setup.py add missing attrs package. Require to build on MacOS [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/849464 (owner: 10Slyngshede) [07:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36404 and previous config saved to /var/cache/conftool/dbconfig/20221026-075126-root.json [07:51:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36405 and previous config saved to /var/cache/conftool/dbconfig/20221026-075133-root.json [07:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36406 and previous config saved to /var/cache/conftool/dbconfig/20221026-075138-root.json [07:52:51] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [07:53:04] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [07:53:52] (03CR) 10Giuseppe Lavagetto: "hi, there is already a patch doing this, see https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/822734" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/849464 (owner: 10Slyngshede) [07:54:39] <_joe_> slyngs: I am going to merge the other version of that patch [07:54:43] <_joe_> sorry for not doing it sooner [07:55:13] (03CR) 10Ayounsi: [C: 03+1] Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [07:55:28] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [07:55:55] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [07:56:18] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [07:56:44] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [07:57:04] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [07:57:21] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [07:58:00] (03Abandoned) 10Slyngshede: setup.py add missing attrs package. Require to build on MacOS [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/849464 (owner: 10Slyngshede) [07:58:40] _joe_: No problem, fixed is fixed :-) [07:58:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 (owner: 10BryanDavis) [08:00:04] jnuche and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T0800). [08:00:09] !log upload python3.9 packages for buster (component python39) [08:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:19] oh [08:00:27] yes its for that task [08:00:29] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [08:00:31] jnuche: so interestingly I thought Dan was running the train this week :] [08:00:38] I mixed it up sorry [08:00:56] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [08:01:03] (03Merged) 10jenkins-bot: Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 (owner: 10BryanDavis) [08:01:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36407 and previous config saved to /var/cache/conftool/dbconfig/20221026-080132-root.json [08:01:35] hashar: ah, that's no problem :) [08:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36408 and previous config saved to /var/cache/conftool/dbconfig/20221026-080139-root.json [08:01:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36409 and previous config saved to /var/cache/conftool/dbconfig/20221026-080147-root.json [08:02:08] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [08:02:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [08:03:28] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849467 (https://phabricator.wikimedia.org/T320512) [08:03:30] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849467 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:04:12] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849467 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36410 and previous config saved to /var/cache/conftool/dbconfig/20221026-080631-root.json [08:06:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36411 and previous config saved to /var/cache/conftool/dbconfig/20221026-080638-root.json [08:06:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36412 and previous config saved to /var/cache/conftool/dbconfig/20221026-080643-root.json [08:08:28] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.7 refs T320512 [08:08:33] T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512 [08:10:44] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [08:11:00] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [08:11:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:11:45] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [08:12:12] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [08:12:14] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.7 refs T320512 (duration: 03m 46s) [08:12:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:12:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:14:56] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [08:15:23] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [08:16:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:16:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36413 and previous config saved to /var/cache/conftool/dbconfig/20221026-081637-root.json [08:16:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36414 and previous config saved to /var/cache/conftool/dbconfig/20221026-081644-root.json [08:16:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36415 and previous config saved to /var/cache/conftool/dbconfig/20221026-081652-root.json [08:17:53] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [08:18:04] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [08:18:21] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [08:18:50] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [08:19:44] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [08:19:56] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:16] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [08:23:24] jnuche: so from last week, I did try to file task and add associated Kibana filters for each of them but I might have missed some [08:23:40] but overall last week has been a quiet train [08:23:52] if you wanna pair on triaging I am around :-] [08:25:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1015.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:25:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1015.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:26:35] (03PS1) 10Giuseppe Lavagetto: Add known uid mappings to the configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849472 [08:26:41] (03CR) 10Hashar: [C: 03+1] Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:26:50] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: add more tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849386 (owner: 10Giuseppe Lavagetto) [08:26:55] ACKNOWLEDGEMENT - Confd vcl based reload on cp4037 is CRITICAL: reload-vcl failed to run since 7h, 10 minutes. Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/Varnish [08:26:55] ACKNOWLEDGEMENT - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 7h, 10 minutes. Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/Varnish [08:26:55] ACKNOWLEDGEMENT - Confd vcl based reload on cp4047 is CRITICAL: reload-vcl failed to run since 7h, 9 minutes. Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/Varnish [08:26:55] ACKNOWLEDGEMENT - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 7h, 9 minutes. Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/Varnish [08:26:55] ACKNOWLEDGEMENT - PyBal backends health check on lvs4005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: testlb_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: testlb_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: testlb6_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: textlb_80: Servers cp4035.ulsfo.wmnet [08:26:55] ked down but pooled: textlb_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: testlb6_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: textlb6_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech. [08:26:56] a.org/wiki/PyBal [08:26:56] ACKNOWLEDGEMENT - PyBal backends health check on lvs4006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_80: Servers cp4033.ulsfo.wmnet are marked down but pooled: uploadlb6_443: Servers cp4034.ulsfo.wmnet, cp4033.ulsfo.wmnet, cp4026.ulsfo.wmnet, cp4024.ulsfo.wmnet are marked down but pooled: uploadlb_80: Servers cp4033.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4034.ulsfo.wmnet, cp4033.ulsfo.wmnet, cp4026.ulsfo.wm [08:26:57] 024.ulsfo.wmnet are marked down but pooled Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/PyBal [08:26:57] ACKNOWLEDGEMENT - PyBal backends health check on lvs4007 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: uploadlb_80: Servers cp4033.ulsfo.wmnet are marked down but pooled: testlb_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: testlb_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: textlb_80: Servers cp4035.ulsfo.wmne [08:26:58] rked down but pooled: testlb6_80: Servers cp4035.ulsfo.wmnet are marked down but pooled: uploadlb6_80: Servers cp4033.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4034.ulsfo.wmnet, cp4033.ulsfo.wmnet, cp4026.ulsfo.wmnet, cp4024.ulsfo.wmnet are marked down but pooled: uploadlb6_443: Servers cp4034.ulsfo.wmnet, cp4033.ulsfo.wmnet, cp4026.ulsfo.wmnet, cp4024.ulsfo.wmnet are marked down but pooled: textlb_443: Servers cp403 [08:26:58] wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: testlb6_443: Servers cp4036.ulsfo.wmnet, cp4035.ulsfo.wmnet, cp4032.ulsfo.wmnet are marked down but pooled: textlb6_443 Valentin Gutierrez T317247 - The acknowledgement expires at: 2022-10-27 08:26:37. https://wikitech.wikimedia.org/wiki/PyBal [08:27:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add known uid mappings to the configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849472 (owner: 10Giuseppe Lavagetto) [08:29:18] (03PS1) 10David Caro: global: replace labsproject by wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849473 [08:29:48] hashar: thanks for the effort to clean up the board, things look quiet so far [08:30:01] amazing! [08:30:43] I logged a small bug in the special protected titles page, but that's about it so far [08:31:18] (03PS2) 10Giuseppe Lavagetto: shellbox: skip system logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/848264 (https://phabricator.wikimedia.org/T301757) [08:31:20] (03PS1) 10Giuseppe Lavagetto: shellbox: correct env variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/849474 [08:31:22] (03PS1) 10Giuseppe Lavagetto: shellbox: switch to ecs logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/849475 [08:31:31] (03CR) 10David Caro: global: replace labsproject by wmcs_project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [08:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36416 and previous config saved to /var/cache/conftool/dbconfig/20221026-083142-root.json [08:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36417 and previous config saved to /var/cache/conftool/dbconfig/20221026-083149-root.json [08:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36418 and previous config saved to /var/cache/conftool/dbconfig/20221026-083157-root.json [08:35:29] (03CR) 10Filippo Giunchedi: "Partial PCC run https://puppet-compiler.wmflabs.org/pcc-worker1002/37722/" [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [08:36:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: skip system logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/848264 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [08:40:13] (03Merged) 10jenkins-bot: shellbox: skip system logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/848264 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [08:40:16] PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3313 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:34] PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3311 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:52] PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 68553.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:04] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 68554.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:18] (03PS5) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 [08:41:20] (03PS6) 10Jbond: motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [08:41:44] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 68605.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:54] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 68604.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:56] PROBLEM - MariaDB Replica IO: s3 on clouddb1017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3313 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:12] PROBLEM - MariaDB Replica IO: s1 on clouddb1017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3311 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:44] '12 [08:42:46] uff [08:44:42] (03PS8) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [08:46:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:29] (03CR) 10CI reject: [V: 04-1] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [08:46:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:39] (03PS9) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [08:47:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: Maintenance [08:47:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: Maintenance [08:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2115 (T321312)', diff saved to https://phabricator.wikimedia.org/P36420 and previous config saved to /var/cache/conftool/dbconfig/20221026-084741-ladsgroup.json [08:48:23] (03CR) 10CI reject: [V: 04-1] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [08:48:37] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Test adding wikifunctions.org in acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227) (owner: 10Vgutierrez) [08:49:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P36421 and previous config saved to /var/cache/conftool/dbconfig/20221026-084922-ladsgroup.json [08:49:24] (03PS10) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [08:49:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Not sure how long these changes will live (we're thinking of revisiting how we do all this stuff, see https://phabricator.wikimedia.org/T2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [08:49:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P36422 and previous config saved to /var/cache/conftool/dbconfig/20221026-084954-ladsgroup.json [08:50:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:50:07] (03CR) 10CI reject: [V: 04-1] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [08:50:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37755/console" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [08:50:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36423 and previous config saved to /var/cache/conftool/dbconfig/20221026-085022-ladsgroup.json [08:52:17] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add known_uid_mapping support to the production-images for spark (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/844445 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [08:52:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1015.eqiad.wmnet with OS bullseye [08:52:42] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS bullseye [08:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T321312)', diff saved to https://phabricator.wikimedia.org/P36424 and previous config saved to /var/cache/conftool/dbconfig/20221026-085257-ladsgroup.json [08:53:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes mediawiki config: Cleanup nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844991 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [08:53:47] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubernetes mediawiki config: Cleanup nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844991 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [08:55:37] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:55:50] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36425 and previous config saved to /var/cache/conftool/dbconfig/20221026-085634-ladsgroup.json [08:57:00] (03PS1) 10David Caro: p::wmcs:nfs: Fix typo in the defaults specificatio [puppet] - 10https://gerrit.wikimedia.org/r/849483 [08:59:25] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37756/console" [puppet] - 10https://gerrit.wikimedia.org/r/849483 (owner: 10David Caro) [09:00:38] (03PS2) 10David Caro: p::wmcs:nfs: Fix typo in the defaults specification [puppet] - 10https://gerrit.wikimedia.org/r/849483 [09:00:52] (03PS1) 10Jbond: hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849484 [09:01:09] (03PS3) 10David Caro: p::wmcs:nfs: Fix typo in the hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/849483 [09:03:04] (03CR) 10Jbond: [C: 03+2] hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849484 (owner: 10Jbond) [09:04:33] (03PS1) 10Elukey: ml-services: update revscoring's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/849485 (https://phabricator.wikimedia.org/T320374) [09:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P36426 and previous config saved to /var/cache/conftool/dbconfig/20221026-090459-ladsgroup.json [09:05:44] (03PS1) 10Vgutierrez: acme-chief: Add wikifunctions.org to the unified cert [puppet] - 10https://gerrit.wikimedia.org/r/849486 (https://phabricator.wikimedia.org/T313227) [09:06:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1015.eqiad.wmnet with reason: host reimage [09:07:00] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37758/console" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [09:07:49] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37759/console" [puppet] - 10https://gerrit.wikimedia.org/r/849486 (https://phabricator.wikimedia.org/T313227) (owner: 10Vgutierrez) [09:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P36427 and previous config saved to /var/cache/conftool/dbconfig/20221026-090803-ladsgroup.json [09:08:20] (03PS1) 10JMeybohm: Add role to kubemaster entry [puppet] - 10https://gerrit.wikimedia.org/r/849487 [09:10:09] (03PS1) 10Jbond: P:sretest: create a test to see if the escap code actully makes through [puppet] - 10https://gerrit.wikimedia.org/r/849488 [09:10:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1015.eqiad.wmnet with reason: host reimage [09:10:56] (03CR) 10Jbond: [C: 03+2] wmflib::ansi: add new ansi formatting function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [09:11:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:sretest: create a test to see if the escap code actully makes through [puppet] - 10https://gerrit.wikimedia.org/r/849488 (owner: 10Jbond) [09:11:05] (03CR) 10Filippo Giunchedi: [C: 03+1] Add role to kubemaster entry [puppet] - 10https://gerrit.wikimedia.org/r/849487 (owner: 10JMeybohm) [09:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36428 and previous config saved to /var/cache/conftool/dbconfig/20221026-091141-ladsgroup.json [09:12:35] (03PS1) 10Jbond: Revert "P:sretest: create a test to see if the escap code actull..." [puppet] - 10https://gerrit.wikimedia.org/r/849165 [09:12:39] (03PS1) 10Jbond: Revert "wmflib::ansi: add new ansi formatting function" [puppet] - 10https://gerrit.wikimedia.org/r/849506 [09:12:41] (03PS2) 10Muehlenhoff: idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013) [09:12:51] (03CR) 10AikoChou: [C: 03+1] ml-services: update revscoring's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/849485 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:13:14] (03PS11) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [09:13:25] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/849485 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:15:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:15:34] (03CR) 10Jbond: [C: 03+2] Revert "P:sretest: create a test to see if the escap code actull..." [puppet] - 10https://gerrit.wikimedia.org/r/849165 (owner: 10Jbond) [09:16:00] (03Abandoned) 10Jbond: Revert "wmflib::ansi: add new ansi formatting function" [puppet] - 10https://gerrit.wikimedia.org/r/849506 (owner: 10Jbond) [09:16:23] (03CR) 10Jbond: [C: 03+2] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [09:16:40] (03CR) 10Muehlenhoff: [C: 03+2] idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:17:21] (03PS1) 10Jbond: Revert "P:netbox::host: create a motd for the status" [puppet] - 10https://gerrit.wikimedia.org/r/849507 [09:17:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:netbox::host: create a motd for the status" [puppet] - 10https://gerrit.wikimedia.org/r/849507 (owner: 10Jbond) [09:17:39] jbond: I'll leave merging to you, can you sync my patch along? [09:17:50] !log add netbx yes will do thanks [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:38] (03PS1) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) [09:18:52] moritzm: thanks done [09:19:19] ack, thx [09:19:58] (03PS2) 10Muehlenhoff: statistics : Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) [09:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P36429 and previous config saved to /var/cache/conftool/dbconfig/20221026-092004-ladsgroup.json [09:20:08] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:21:48] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme-chief: Add wikifunctions.org to the unified cert [puppet] - 10https://gerrit.wikimedia.org/r/849486 (https://phabricator.wikimedia.org/T313227) (owner: 10Vgutierrez) [09:22:21] (ConfdResourceFailed) firing: (40) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:22:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:23:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P36430 and previous config saved to /var/cache/conftool/dbconfig/20221026-092310-ladsgroup.json [09:25:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1015.eqiad.wmnet with OS bullseye [09:26:04] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS bullseye completed: - ganeti1015 (**PASS**) - Downtimed on... [09:26:27] (03PS1) 10Majavah: openstack: modernize puppetleaks script [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) [09:26:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [09:26:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36431 and previous config saved to /var/cache/conftool/dbconfig/20221026-092647-ladsgroup.json [09:27:05] (03CR) 10CI reject: [V: 04-1] openstack: modernize puppetleaks script [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:27:53] (03PS2) 10Majavah: openstack: modernize puppetleaks script [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) [09:27:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:29:03] 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) a:03SLyngshede-WMF [09:29:04] (03PS1) 10Filippo Giunchedi: dns: generate HOST.mgmt records in all statuses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) [09:29:09] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) acme-chief will deploy the unified cert shipping `wikif... [09:29:36] 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) a:03SLyngshede-WMF [09:29:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) If someone could purge `https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.... [09:31:30] (03PS1) 10Jbond: R:system::role: fix linting and docs [puppet] - 10https://gerrit.wikimedia.org/r/849496 [09:31:42] (03CR) 10Btullis: [C: 03+2] Open up the postrges service to the analytics vlans [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [09:32:17] (03CR) 10Volans: [C: 03+1] "LGTM, let's carefully check the quite long diff that will be generated and possibly get a non-negative feedback from DCOps :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) (owner: 10Filippo Giunchedi) [09:32:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37760/console" [puppet] - 10https://gerrit.wikimedia.org/r/849496 (owner: 10Jbond) [09:33:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:33:07] (03CR) 10Jbond: [V: 03+1 C: 03+2] R:system::role: fix linting and docs [puppet] - 10https://gerrit.wikimedia.org/r/849496 (owner: 10Jbond) [09:34:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [09:35:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P36432 and previous config saved to /var/cache/conftool/dbconfig/20221026-093509-ladsgroup.json [09:35:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 8.048 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:35:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This change is scary and could be potentially dangerous. There could be hidden references to the old variable in unexpected ways." [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [09:38:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T321312)', diff saved to https://phabricator.wikimedia.org/P36433 and previous config saved to /var/cache/conftool/dbconfig/20221026-093816-ladsgroup.json [09:38:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [09:38:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [09:38:39] (03PS2) 10Hnowlan: kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 [09:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36434 and previous config saved to /var/cache/conftool/dbconfig/20221026-093842-ladsgroup.json [09:41:29] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) I've followed the steps mentioned by @TheresNoTime but sadly it didn't help at all. Please consider that varnish... [09:41:37] (03PS1) 10Jbond: R:system::role: colour system role based on its name [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) [09:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36435 and previous config saved to /var/cache/conftool/dbconfig/20221026-094154-ladsgroup.json [09:42:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:42:04] (03CR) 10CI reject: [V: 04-1] R:system::role: colour system role based on its name [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [09:42:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:42:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:42:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:42:20] 10SRE, 10serviceops, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10Clement_Goubert) [09:42:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T321312)', diff saved to https://phabricator.wikimedia.org/P36436 and previous config saved to /var/cache/conftool/dbconfig/20221026-094226-ladsgroup.json [09:42:28] 10SRE, 10serviceops, 10Performance-Team (Radar): Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 (10Clement_Goubert) 05Open→03Resolved [09:42:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37761/console" [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [09:42:48] (03CR) 10Volans: "I just did a quick pass on the python side, left a comment, not a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [09:43:58] (03CR) 10JMeybohm: [C: 03+2] Add role to kubemaster entry [puppet] - 10https://gerrit.wikimedia.org/r/849487 (owner: 10JMeybohm) [09:44:13] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:45:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:46:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36437 and previous config saved to /var/cache/conftool/dbconfig/20221026-094619-ladsgroup.json [09:46:49] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) Removing Traffic since haproxy/varnish/ATS isn't at fault here. [09:48:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321312)', diff saved to https://phabricator.wikimedia.org/P36438 and previous config saved to /var/cache/conftool/dbconfig/20221026-094841-ladsgroup.json [09:49:12] (03PS12) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:49:47] (03PS3) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) [09:50:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:51:44] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) Thanks @Vgutierrez, I'll take a closer look in a moment, but just noting from `deployment-ms-be05`: ` Oct 26 09:49:16 deplo... [09:52:05] (03CR) 10CI reject: [V: 04-1] api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [09:53:06] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) @TheresNoTime ats-be in deployment-cache-upload07 forwards the requests to deployment-ms-fe03.deployment-prep.eqiad.wmflabs... [09:55:19] (03CR) 10Vgutierrez: [C: 03+1] "looking good, PCC shows a NOOP on both production and deployment-prep environments: https://puppet-compiler.wmflabs.org/pcc-worker1003/377" [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [09:57:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 59605 [09:57:57] (03CR) 10Filippo Giunchedi: [C: 03+1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:58:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59605 [09:59:11] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:29] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10AlexisJazz) "If someone could purge.. that should fix it" Even if it worked, I'd rather not do that whenever I want to see a thumbnail o... [09:59:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9038 [09:59:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9038 [10:00:04] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [10:00:51] ^ sending peering requests to the phone providers where I'm going on vacations :) [10:01:20] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) [10:01:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P36439 and previous config saved to /var/cache/conftool/dbconfig/20221026-100125-ladsgroup.json [10:01:41] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:02:00] (03CR) 10Vgutierrez: [C: 03+1] Remove confd_experiment_fqdn support [puppet] - 10https://gerrit.wikimedia.org/r/845713 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [10:02:09] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10RhinosF1) >>! In T321654#8344934, @AlexisJazz wrote: > "If someone could purge.. that should fix it" > > Even if it worked, I'd rather n... [10:02:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:03:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:03:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:03:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:03:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36440 and previous config saved to /var/cache/conftool/dbconfig/20221026-100319-ladsgroup.json [10:03:28] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:03:34] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) ` samtar@deployment-ms-fe03:~$ swift list wikipedia-en-local-thumb.13 1/13/Bert_Self-portrait2.png/150px-Bert_Self-portrait... [10:03:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [10:03:38] (03CR) 10Filippo Giunchedi: dns: generate HOST.mgmt records in all statuses (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) (owner: 10Filippo Giunchedi) [10:03:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [10:03:42] (03CR) 10Vgutierrez: single_backend mode for production varnishes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [10:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T318950)', diff saved to https://phabricator.wikimedia.org/P36441 and previous config saved to /var/cache/conftool/dbconfig/20221026-100344-ladsgroup.json [10:03:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36442 and previous config saved to /var/cache/conftool/dbconfig/20221026-100354-ladsgroup.json [10:04:15] (03CR) 10Vgutierrez: [C: 03+1] "looks good, please add Bug: T288106 to the commit message before merging" [puppet] - 10https://gerrit.wikimedia.org/r/845651 (owner: 10BBlack) [10:05:07] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36443 and previous config saved to /var/cache/conftool/dbconfig/20221026-100532-ladsgroup.json [10:06:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318950)', diff saved to https://phabricator.wikimedia.org/P36444 and previous config saved to /var/cache/conftool/dbconfig/20221026-100610-ladsgroup.json [10:07:52] (03CR) 10Jbond: [C: 04-1] "looks fine to me bu there are a couple of issues which will cause an error" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [10:09:57] (03CR) 10Aqu: "👍" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [10:10:52] (03CR) 10Ladsgroup: [C: 03+2] Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605) (owner: 10Ladsgroup) [10:11:16] (03Merged) 10jenkins-bot: Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605) (owner: 10Ladsgroup) [10:11:25] (03PS1) 10Jelto: gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) [10:12:47] (03PS1) 10Btullis: Add a postgres user with an IPv6 RFC 4193 host match [puppet] - 10https://gerrit.wikimedia.org/r/849500 (https://phabricator.wikimedia.org/T319440) [10:13:54] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the wrong section - https://phabricator.wikimedia.org/T224977 (10jbond) [10:14:23] (03PS1) 10Clément Goubert: mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) [10:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P36445 and previous config saved to /var/cache/conftool/dbconfig/20221026-101631-ladsgroup.json [10:17:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36446 and previous config saved to /var/cache/conftool/dbconfig/20221026-101901-ladsgroup.json [10:19:19] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37769/console" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [10:20:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [10:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36447 and previous config saved to /var/cache/conftool/dbconfig/20221026-102039-ladsgroup.json [10:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36448 and previous config saved to /var/cache/conftool/dbconfig/20221026-102116-ladsgroup.json [10:22:27] (03PS1) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) [10:23:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:44] (03PS2) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) [10:28:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I wonder how this went unnoticed. I would have expected the hiera lookups to fail, or the services to be misconfigured, or both." [puppet] - 10https://gerrit.wikimedia.org/r/849483 (owner: 10David Caro) [10:28:16] (03PS1) 10Ladsgroup: beta: Start doing write both of externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849503 (https://phabricator.wikimedia.org/T321662) [10:28:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack::monitor::networktests: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837121 (owner: 10Muehlenhoff) [10:29:05] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36449 and previous config saved to /var/cache/conftool/dbconfig/20221026-103138-ladsgroup.json [10:31:40] jouncebot: nowandnext [10:31:40] No deployments scheduled for the next 2 hour(s) and 28 minute(s) [10:31:40] In 2 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1300) [10:31:45] (03CR) 10Ladsgroup: [C: 03+2] beta: Start doing write both of externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849503 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:32:28] (03Merged) 10jenkins-bot: beta: Start doing write both of externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849503 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:34:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321312)', diff saved to https://phabricator.wikimedia.org/P36450 and previous config saved to /var/cache/conftool/dbconfig/20221026-103407-ladsgroup.json [10:34:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:34:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T321312)', diff saved to https://phabricator.wikimedia.org/P36451 and previous config saved to /var/cache/conftool/dbconfig/20221026-103432-ladsgroup.json [10:35:03] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:23] (03CR) 10Cparle: [C: 03+2] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:35:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36452 and previous config saved to /var/cache/conftool/dbconfig/20221026-103545-ladsgroup.json [10:36:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36453 and previous config saved to /var/cache/conftool/dbconfig/20221026-103623-ladsgroup.json [10:36:50] (03Merged) 10jenkins-bot: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:38:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:40:23] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:40:49] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS [10:40:49] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS [10:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321312)', diff saved to https://phabricator.wikimedia.org/P36454 and previous config saved to /var/cache/conftool/dbconfig/20221026-104050-ladsgroup.json [10:40:51] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS [10:42:05] an issue on SPG, or maintenance ^? [10:42:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:42:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:45:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:46:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1023.eqiad.wmnet to cluster eqiad and group A [10:46:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:47:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 9.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:47:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1023.eqiad.wmnet to cluster eqiad and group A [10:48:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.539 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:19] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8342204, @MatthewVernon wrote: > `swift-drive-audit` is... [10:50:14] !log restarting blazegraph on wdqs1007 (BlazegraphFreeAllocatorsDecreasingRapidly) [10:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:23] (03PS2) 10Btullis: Add a postgres user with an IPv6 RFC 4193 host match [puppet] - 10https://gerrit.wikimedia.org/r/849500 (https://phabricator.wikimedia.org/T319440) [10:50:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36455 and previous config saved to /var/cache/conftool/dbconfig/20221026-105052-ladsgroup.json [10:50:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:50:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:50:57] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T318950)', diff saved to https://phabricator.wikimedia.org/P36456 and previous config saved to /var/cache/conftool/dbconfig/20221026-105102-ladsgroup.json [10:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318950)', diff saved to https://phabricator.wikimedia.org/P36457 and previous config saved to /var/cache/conftool/dbconfig/20221026-105129-ladsgroup.json [10:51:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:51:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37772/console" [puppet] - 10https://gerrit.wikimedia.org/r/849500 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [10:51:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T318950)', diff saved to https://phabricator.wikimedia.org/P36458 and previous config saved to /var/cache/conftool/dbconfig/20221026-105140-ladsgroup.json [10:52:46] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318950)', diff saved to https://phabricator.wikimedia.org/P36459 and previous config saved to /var/cache/conftool/dbconfig/20221026-105315-ladsgroup.json [10:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318950)', diff saved to https://phabricator.wikimedia.org/P36460 and previous config saved to /var/cache/conftool/dbconfig/20221026-105406-ladsgroup.json [10:55:10] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: fix typo in systemd::sysuser parameter name [puppet] - 10https://gerrit.wikimedia.org/r/849505 [10:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P36461 and previous config saved to /var/cache/conftool/dbconfig/20221026-105556-ladsgroup.json [10:56:34] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a postgres user with an IPv6 RFC 4193 host match [puppet] - 10https://gerrit.wikimedia.org/r/849500 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [10:56:46] (03CR) 10Muehlenhoff: [C: 03+1] "Oops, sorry for that :-)" [puppet] - 10https://gerrit.wikimedia.org/r/849505 (owner: 10Arturo Borrero Gonzalez) [10:57:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:59:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:30] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37774/" [puppet] - 10https://gerrit.wikimedia.org/r/849505 (owner: 10Arturo Borrero Gonzalez) [11:01:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: networktests: fix typo in systemd::sysuser parameter name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849505 (owner: 10Arturo Borrero Gonzalez) [11:04:19] (03PS2) 10Ssingh: cp4046: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849207 (https://phabricator.wikimedia.org/T317244) [11:06:21] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [11:07:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36462 and previous config saved to /var/cache/conftool/dbconfig/20221026-110822-ladsgroup.json [11:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36463 and previous config saved to /var/cache/conftool/dbconfig/20221026-110912-ladsgroup.json [11:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P36464 and previous config saved to /var/cache/conftool/dbconfig/20221026-111103-ladsgroup.json [11:11:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [11:11:17] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4046.ulsfo.wmnet with OS buster [11:12:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [11:16:30] (03PS4) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) [11:17:56] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: move log suppression to the admin vhost [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849530 [11:20:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [11:22:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: move log suppression to the admin vhost [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849530 (owner: 10Giuseppe Lavagetto) [11:22:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1023.eqiad.wmnet to cluster eqiad and group B [11:22:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1023.eqiad.wmnet to cluster eqiad and group B [11:22:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1015.eqiad.wmnet to cluster eqiad and group B [11:23:26] PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36465 and previous config saved to /var/cache/conftool/dbconfig/20221026-112328-ladsgroup.json [11:23:34] PROBLEM - Check whether ferm is active by checking the default input chain on dse-k8s-worker1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:23:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1015.eqiad.wmnet to cluster eqiad and group B [11:24:00] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS buster [11:24:07] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4046.ulsfo.wmnet with OS buster executed with errors: - cp4046 (**FA... [11:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36466 and previous config saved to /var/cache/conftool/dbconfig/20221026-112419-ladsgroup.json [11:25:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321312)', diff saved to https://phabricator.wikimedia.org/P36467 and previous config saved to /var/cache/conftool/dbconfig/20221026-112609-ladsgroup.json [11:26:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:26:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:26:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36468 and previous config saved to /var/cache/conftool/dbconfig/20221026-112634-ladsgroup.json [11:28:51] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8345166, @MatthewVernon wrote: > I think `swift::mount_... [11:29:06] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4046 [11:29:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4046 [11:30:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:40] PROBLEM - Check systemd state on ms-backup1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [11:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36469 and previous config saved to /var/cache/conftool/dbconfig/20221026-113333-ladsgroup.json [11:33:50] (03PS2) 10Giuseppe Lavagetto: shellbox: correct env variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/849474 [11:33:52] (03PS2) 10Giuseppe Lavagetto: shellbox: switch to ecs logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/849475 [11:33:54] (03PS1) 10Giuseppe Lavagetto: shellbox: bump httpd-fcgi [deployment-charts] - 10https://gerrit.wikimedia.org/r/849533 [11:33:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:19] (03CR) 10Volans: "did a quick pass of the init, I guess the rest might get changed based on the comments so I'll leave the rest for later" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [11:37:14] PROBLEM - Check whether ferm is active by checking the default input chain on ms-backup1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:37:31] (03PS8) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [11:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318950)', diff saved to https://phabricator.wikimedia.org/P36470 and previous config saved to /var/cache/conftool/dbconfig/20221026-113835-ladsgroup.json [11:38:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:38:41] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:38:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:38:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [11:38:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36471 and previous config saved to /var/cache/conftool/dbconfig/20221026-113856-ladsgroup.json [11:39:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318950)', diff saved to https://phabricator.wikimedia.org/P36472 and previous config saved to /var/cache/conftool/dbconfig/20221026-113925-ladsgroup.json [11:39:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [11:39:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [11:39:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:39:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36473 and previous config saved to /var/cache/conftool/dbconfig/20221026-113941-ladsgroup.json [11:40:03] 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10kostajh) > I don't see any outage tracked on https://wikitech.wikimedia.org/wiki/Incident_status so I'm filing this (and tagging SR... [11:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36474 and previous config saved to /var/cache/conftool/dbconfig/20221026-114109-ladsgroup.json [11:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36475 and previous config saved to /var/cache/conftool/dbconfig/20221026-114207-ladsgroup.json [11:44:44] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:45:57] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS buster [11:45:58] PROBLEM - Host netflow1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:46:09] (03CR) 10David Caro: p::wmcs:nfs: Fix typo in the hiera lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849483 (owner: 10David Caro) [11:46:22] !log sudo ipmitool -I lanplus -H "cp4046.mgmt.ulsfo.wmnet" -U root -E chassis power cycle [11:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:58] (03PS1) 10Ssingh: Revert "cp4046: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849510 [11:48:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P36476 and previous config saved to /var/cache/conftool/dbconfig/20221026-114840-ladsgroup.json [11:48:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: correct env variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/849474 (owner: 10Giuseppe Lavagetto) [11:49:10] RECOVERY - Check systemd state on dse-k8s-worker1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:49] (03CR) 10Ssingh: [C: 03+2] Revert "cp4046: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849510 (owner: 10Ssingh) [11:52:33] (03Merged) 10jenkins-bot: shellbox: correct env variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/849474 (owner: 10Giuseppe Lavagetto) [11:53:58] (03PS1) 10Ssingh: cp4039: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849534 (https://phabricator.wikimedia.org/T317244) [11:54:36] RECOVERY - Check whether ferm is active by checking the default input chain on dse-k8s-worker1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:56:07] (03CR) 10David Caro: [V: 03+1] global: replace labsproject by wmcs_project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [11:56:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36477 and previous config saved to /var/cache/conftool/dbconfig/20221026-115615-ladsgroup.json [11:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36478 and previous config saved to /var/cache/conftool/dbconfig/20221026-115714-ladsgroup.json [11:57:16] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: bump httpd-fcgi [deployment-charts] - 10https://gerrit.wikimedia.org/r/849533 (owner: 10Giuseppe Lavagetto) [11:58:40] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:42] (03CR) 10Muehlenhoff: "Looks good, one bug and a few nits left." [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [11:59:41] RECOVERY - Check systemd state on ms-backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [12:01:13] (03Merged) 10jenkins-bot: shellbox: bump httpd-fcgi [deployment-charts] - 10https://gerrit.wikimedia.org/r/849533 (owner: 10Giuseppe Lavagetto) [12:01:39] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Clean up stale/old confd errors automatically - https://phabricator.wikimedia.org/T321678 (10fgiunchedi) [12:01:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:05] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:57] !log draining ganeti1009 for eventual reimage T311687 [12:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:03] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [12:03:18] 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) Merge-Request: https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/21 [12:03:20] (03PS1) 10Marostegui: clouddb1013, clouddb1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849537 [12:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P36479 and previous config saved to /var/cache/conftool/dbconfig/20221026-120346-ladsgroup.json [12:03:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:04:56] (03CR) 10Ssingh: [C: 03+2] cp4039: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849534 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [12:05:41] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:06:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS buster [12:06:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.773 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.806 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:17] RECOVERY - Check whether ferm is active by checking the default input chain on ms-backup1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:07:22] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:13] (03PS1) 10Filippo Giunchedi: confd: cleanup stale errors [puppet] - 10https://gerrit.wikimedia.org/r/849539 (https://phabricator.wikimedia.org/T321678) [12:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36480 and previous config saved to /var/cache/conftool/dbconfig/20221026-121122-ladsgroup.json [12:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36481 and previous config saved to /var/cache/conftool/dbconfig/20221026-121220-ladsgroup.json [12:13:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to drbd [12:13:58] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:18:25] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) >>! In T321547#8343951, @BCornwall wrote: > Perhaps this is because the severity is set to warning rather than critical? For the IRC notifications to -traffic you are correct re: the severity, h... [12:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36482 and previous config saved to /var/cache/conftool/dbconfig/20221026-121853-ladsgroup.json [12:18:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:18:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:19:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T321312)', diff saved to https://phabricator.wikimedia.org/P36483 and previous config saved to /var/cache/conftool/dbconfig/20221026-121928-ladsgroup.json [12:21:31] (03PS1) 10Hokwelum: Update Academic Compuer Club mirror details [puppet] - 10https://gerrit.wikimedia.org/r/849542 [12:22:01] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:06] (03CR) 10JMeybohm: [C: 03+1] helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:22:20] (03PS1) 10JMeybohm: kubernetes: Actually use the master_fqdn instead of the cert name [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) [12:23:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:23:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to drbd [12:24:05] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:19] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [12:25:13] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321312)', diff saved to https://phabricator.wikimedia.org/P36484 and previous config saved to /var/cache/conftool/dbconfig/20221026-122550-ladsgroup.json [12:25:59] RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:26] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37777/console" [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:26:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2034 es20233 es2032', diff saved to https://phabricator.wikimedia.org/P36485 and previous config saved to /var/cache/conftool/dbconfig/20221026-122632-root.json [12:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36486 and previous config saved to /var/cache/conftool/dbconfig/20221026-122641-ladsgroup.json [12:26:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [12:26:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [12:26:47] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:26:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36487 and previous config saved to /var/cache/conftool/dbconfig/20221026-122652-ladsgroup.json [12:27:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36488 and previous config saved to /var/cache/conftool/dbconfig/20221026-122726-ladsgroup.json [12:27:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:27:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36489 and previous config saved to /var/cache/conftool/dbconfig/20221026-122748-ladsgroup.json [12:27:53] (03PS2) 10JMeybohm: kubernetes: Actually use the master_fqdn instead of the cert name [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) [12:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36490 and previous config saved to /var/cache/conftool/dbconfig/20221026-122905-ladsgroup.json [12:30:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36491 and previous config saved to /var/cache/conftool/dbconfig/20221026-123014-ladsgroup.json [12:30:46] (03PS5) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [12:30:48] (03PS16) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [12:30:50] (03PS4) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [12:31:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.794 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:32:21] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS buster [12:32:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36492 and previous config saved to /var/cache/conftool/dbconfig/20221026-123243-root.json [12:32:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36493 and previous config saved to /var/cache/conftool/dbconfig/20221026-123248-root.json [12:32:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36494 and previous config saved to /var/cache/conftool/dbconfig/20221026-123252-root.json [12:33:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.773 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:35:18] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1034 es1033 es1032', diff saved to https://phabricator.wikimedia.org/P36495 and previous config saved to /var/cache/conftool/dbconfig/20221026-123545-root.json [12:36:11] (03PS6) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [12:36:39] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1068 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:37:03] (03PS2) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [12:38:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to plain [12:38:21] jouncebot: nowandnext [12:38:21] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:21] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1300) [12:38:37] (03PS19) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:38:43] (03CR) 10Slyngshede: role::idm Basic deployment of IDM (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:38:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to plain [12:38:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:39:54] (03PS7) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [12:39:56] (03PS17) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [12:39:58] (03PS5) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [12:40:00] (03PS1) 10Jbond: sre.__init__.py: update minor formating nits [cookbooks] - 10https://gerrit.wikimedia.org/r/849544 [12:40:18] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:45] (03CR) 10Herron: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [12:40:50] (03CR) 10Herron: [C: 03+1] alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [12:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P36496 and previous config saved to /var/cache/conftool/dbconfig/20221026-124057-ladsgroup.json [12:41:03] (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [12:41:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36497 and previous config saved to /var/cache/conftool/dbconfig/20221026-124116-root.json [12:41:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36498 and previous config saved to /var/cache/conftool/dbconfig/20221026-124121-root.json [12:41:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36499 and previous config saved to /var/cache/conftool/dbconfig/20221026-124125-root.json [12:41:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:42:17] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:48] (03PS2) 10Hokwelum: Update Academic Computer Club mirror details [puppet] - 10https://gerrit.wikimedia.org/r/849542 [12:43:39] (03CR) 10Marostegui: [C: 03+2] clouddb1013, clouddb1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849537 (owner: 10Marostegui) [12:44:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36500 and previous config saved to /var/cache/conftool/dbconfig/20221026-124411-ladsgroup.json [12:44:14] (03PS20) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:44:49] (03CR) 10CI reject: [V: 04-1] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P36501 and previous config saved to /var/cache/conftool/dbconfig/20221026-124521-ladsgroup.json [12:45:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:14] !log installing isc-dhcp security updates [12:47:14] (03PS21) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36502 and previous config saved to /var/cache/conftool/dbconfig/20221026-124748-root.json [12:47:51] (03CR) 10CI reject: [V: 04-1] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36503 and previous config saved to /var/cache/conftool/dbconfig/20221026-124753-root.json [12:47:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36504 and previous config saved to /var/cache/conftool/dbconfig/20221026-124757-root.json [12:49:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 4.041 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:39] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation frontend for 5th round [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) [12:55:10] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [12:55:53] (03PS22) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:56:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P36505 and previous config saved to /var/cache/conftool/dbconfig/20221026-125604-ladsgroup.json [12:56:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36506 and previous config saved to /var/cache/conftool/dbconfig/20221026-125621-root.json [12:56:23] PROBLEM - Host cloudgw2003-dev is DOWN: PING CRITICAL - Packet loss = 100% [12:56:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36507 and previous config saved to /var/cache/conftool/dbconfig/20221026-125626-root.json [12:56:28] (03CR) 10CI reject: [V: 04-1] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:56:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36508 and previous config saved to /var/cache/conftool/dbconfig/20221026-125630-root.json [12:56:46] (03PS1) 10Btullis: Revert "Add a postgres user with an IPv6 RFC 4193 host match" [puppet] - 10https://gerrit.wikimedia.org/r/849514 [12:56:59] RECOVERY - Host cloudgw2003-dev is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [12:57:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:10] (03CR) 10Ottomata: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [12:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36509 and previous config saved to /var/cache/conftool/dbconfig/20221026-125918-ladsgroup.json [13:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P36510 and previous config saved to /var/cache/conftool/dbconfig/20221026-130027-ladsgroup.json [13:00:44] (03PS8) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [13:01:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:15] jouncebot: now [13:01:15] For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1300) [13:01:25] oh, it's window already! [13:01:28] i can deploy today [13:01:36] I can deploy too, and I also have a patch scheduled ^ [13:01:38] * ^^ [13:01:43] Lucas_WMDE: do you want to self-service? :) [13:01:46] idk what happened to jouncebot [13:01:47] yeah :) [13:01:52] go ahead then [13:02:00] hi [13:02:03] hi kostajh! [13:02:27] ah, I didn’t see that you also had a patch urbanecm ^^ [13:02:29] I’ll ping you when done [13:02:34] yeah yeah :) [13:02:34] unless you want me to deploy that too [13:02:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36511 and previous config saved to /var/cache/conftool/dbconfig/20221026-130253-root.json [13:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36512 and previous config saved to /var/cache/conftool/dbconfig/20221026-130258-root.json [13:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36513 and previous config saved to /var/cache/conftool/dbconfig/20221026-130302-root.json [13:03:14] i prefer self-serving, it's a bit tricky patch (needs a script to run on the affected wikis). [13:03:26] ok [13:03:36] (03CR) 10Lucas Werkmeister (WMDE): Add config for redirect badges on wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:03:44] (03PS5) 10Lucas Werkmeister (WMDE): Add config for redirect badges on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:03:54] (03PS1) 10JMeybohm: pontoon::lb: Use check-ssl instead of ssl-hello-chk [puppet] - 10https://gerrit.wikimedia.org/r/849547 [13:04:23] 10SRE, 10Traffic: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10fgiunchedi) [13:04:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:05:54] (03Merged) 10jenkins-bot: Add config for redirect badges on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827968 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [13:06:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS buster [13:06:16] (03CR) 10ArielGlenn: [C: 03+2] Update Academic Computer Club mirror details [puppet] - 10https://gerrit.wikimedia.org/r/849542 (owner: 10Hokwelum) [13:06:25] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:827968|Add config for redirect badges on wikidatawiki (T316637)]] [13:06:49] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:827968|Add config for redirect badges on wikidatawiki (T316637)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:06:58] testing [13:07:12] works [13:07:17] continuing [13:07:50] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/849547 (owner: 10JMeybohm) [13:08:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:03] (03PS2) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 [13:09:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:29] (03PS3) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 [13:09:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:17] (03PS2) 10JMeybohm: pontoon::lb: Use check-ssl instead of ssl-hello-chk [puppet] - 10https://gerrit.wikimedia.org/r/849547 [13:10:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321312)', diff saved to https://phabricator.wikimedia.org/P36514 and previous config saved to /var/cache/conftool/dbconfig/20221026-131110-ladsgroup.json [13:11:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36515 and previous config saved to /var/cache/conftool/dbconfig/20221026-131126-root.json [13:11:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:11:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36516 and previous config saved to /var/cache/conftool/dbconfig/20221026-131132-root.json [13:11:34] (03CR) 10JMeybohm: pontoon::lb: Use check-ssl instead of ssl-hello-chk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849547 (owner: 10JMeybohm) [13:11:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36517 and previous config saved to /var/cache/conftool/dbconfig/20221026-131135-root.json [13:11:42] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:827968|Add config for redirect badges on wikidatawiki (T316637)]] (duration: 05m 17s) [13:11:47] urbanecm: I’m done [13:11:52] okay, starting with mine [13:12:00] (03PS2) 10Urbanecm: [Growth] Enable structured mentor list everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849419 (https://phabricator.wikimedia.org/T310905) [13:12:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849419 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [13:12:58] (03Merged) 10jenkins-bot: [Growth] Enable structured mentor list everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849419 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [13:13:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:849419|[Growth] Enable structured mentor list everywhere (T310905)]] [13:13:36] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/migrateWikitextMentorList.php # T310905 [13:13:44] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:849419|[Growth] Enable structured mentor list everywhere (T310905)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:13:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/849547 (owner: 10JMeybohm) [13:14:23] (03CR) 10JMeybohm: [C: 03+2] pontoon::lb: Use check-ssl instead of ssl-hello-chk [puppet] - 10https://gerrit.wikimedia.org/r/849547 (owner: 10JMeybohm) [13:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36518 and previous config saved to /var/cache/conftool/dbconfig/20221026-131424-ladsgroup.json [13:14:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:14:31] testing [13:14:39] (03PS3) 10JMeybohm: pontoon::lb: Use check-ssl instead of ssl-hello-chk [puppet] - 10https://gerrit.wikimedia.org/r/849547 [13:14:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:14:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36519 and previous config saved to /var/cache/conftool/dbconfig/20221026-131446-ladsgroup.json [13:15:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36520 and previous config saved to /var/cache/conftool/dbconfig/20221026-131534-ladsgroup.json [13:15:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [13:15:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [13:15:40] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36521 and previous config saved to /var/cache/conftool/dbconfig/20221026-131544-ladsgroup.json [13:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36522 and previous config saved to /var/cache/conftool/dbconfig/20221026-131659-ladsgroup.json [13:17:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321312)', diff saved to https://phabricator.wikimedia.org/P36523 and previous config saved to /var/cache/conftool/dbconfig/20221026-131752-ladsgroup.json [13:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36524 and previous config saved to /var/cache/conftool/dbconfig/20221026-131758-root.json [13:18:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36525 and previous config saved to /var/cache/conftool/dbconfig/20221026-131803-root.json [13:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36526 and previous config saved to /var/cache/conftool/dbconfig/20221026-131807-root.json [13:18:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36527 and previous config saved to /var/cache/conftool/dbconfig/20221026-131810-ladsgroup.json [13:19:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:44] looks good, syncing [13:19:54] (03PS2) 10Urbanecm: GrowthExperiments: Enable link recommendation frontend for 5th round [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:19:56] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable link recommendation frontend for 5th round [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:20:19] <_joe_> jouncebot: now [13:20:19] For the next 0 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1300) [13:20:42] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation frontend for 5th round [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:21:08] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [13:21:21] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [13:22:21] (ConfdResourceFailed) firing: (40) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:22:38] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS buster [13:23:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:23:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:849419|[Growth] Enable structured mentor list everywhere (T310905)]] (duration: 10m 28s) [13:23:54] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [13:24:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849546 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:24:45] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:849546|GrowthExperiments: Enable link recommendation frontend for 5th round (T304549)]] [13:24:50] T304549: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 [13:25:09] !log urbanecm@deploy1002 urbanecm and kharlan: Backport for [[gerrit:849546|GrowthExperiments: Enable link recommendation frontend for 5th round (T304549)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:25:12] kostajh: your patch's at mwdebug1001, can you test please? [13:25:26] urbanecm: looking [13:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36528 and previous config saved to /var/cache/conftool/dbconfig/20221026-132631-root.json [13:26:37] urbanecm: lgtm! [13:26:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36529 and previous config saved to /var/cache/conftool/dbconfig/20221026-132637-root.json [13:26:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36530 and previous config saved to /var/cache/conftool/dbconfig/20221026-132640-root.json [13:26:43] syncing! [13:27:43] (03CR) 10Elukey: [C: 03+1] kubernetes: Actually use the master_fqdn instead of the cert name [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:28:23] !log installing curl security updates on buster [13:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:29:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:29:57] !log sudo ipmitool -I lanplus -H "cp4039.mgmt.ulsfo.wmnet" -U root -E chassis power cycle [13:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:30:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:849546|GrowthExperiments: Enable link recommendation frontend for 5th round (T304549)]] (duration: 05m 52s) [13:30:43] T304549: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 [13:30:44] and live [13:30:46] anything else, anyone? [13:30:50] thanks urbanecm! [13:30:59] no problem [13:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P36531 and previous config saved to /var/cache/conftool/dbconfig/20221026-133206-ladsgroup.json [13:32:46] <_joe_> urbanecm: is the backport window over? [13:32:50] _joe_: yes [13:32:59] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [13:32:59] !log UTC afternoon B&C window done [13:33:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P36532 and previous config saved to /var/cache/conftool/dbconfig/20221026-133259-ladsgroup.json [13:33:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36533 and previous config saved to /var/cache/conftool/dbconfig/20221026-133303-root.json [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36534 and previous config saved to /var/cache/conftool/dbconfig/20221026-133308-root.json [13:33:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36535 and previous config saved to /var/cache/conftool/dbconfig/20221026-133312-root.json [13:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36536 and previous config saved to /var/cache/conftool/dbconfig/20221026-133317-ladsgroup.json [13:33:28] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [13:34:16] (03PS1) 10Marostegui: mariadb: Switch x1 to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) [13:34:40] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [13:35:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [13:37:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:37:26] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [13:37:41] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [13:38:05] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [13:38:11] (03CR) 10Elukey: [C: 03+2] coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [13:38:19] PROBLEM - Host db1154 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [13:38:41] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 4.409 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:38:45] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [13:39:12] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [13:39:49] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:40:27] (03CR) 10Ladsgroup: "A tiny request: Can we wait for a couple of hours so I can reboot them in eqiad? I will start it soon." [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) (owner: 10Marostegui) [13:41:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:41:27] (03CR) 10Marostegui: "Yes, this change is a noop, as it requires switching it live in mysql too." [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) (owner: 10Marostegui) [13:41:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36537 and previous config saved to /var/cache/conftool/dbconfig/20221026-134136-root.json [13:41:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36538 and previous config saved to /var/cache/conftool/dbconfig/20221026-134141-root.json [13:41:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36539 and previous config saved to /var/cache/conftool/dbconfig/20221026-134145-root.json [13:41:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:43:15] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [13:43:28] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [13:44:22] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [13:44:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:44:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:44:47] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [13:45:14] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [13:45:37] (03PS1) 10Jbond: pcc: move cloud key to correct realm [puppet] - 10https://gerrit.wikimedia.org/r/849555 [13:45:45] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [13:46:03] (03CR) 10Andrew Bogott: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [13:46:08] (03CR) 10Ladsgroup: mariadb: Switch x1 to STATEMENT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) (owner: 10Marostegui) [13:46:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc: move cloud key to correct realm [puppet] - 10https://gerrit.wikimedia.org/r/849555 (owner: 10Jbond) [13:46:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:46:37] (03CR) 10Marostegui: mariadb: Switch x1 to STATEMENT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) (owner: 10Marostegui) [13:47:05] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P36540 and previous config saved to /var/cache/conftool/dbconfig/20221026-134712-ladsgroup.json [13:48:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P36541 and previous config saved to /var/cache/conftool/dbconfig/20221026-134806-ladsgroup.json [13:48:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36542 and previous config saved to /var/cache/conftool/dbconfig/20221026-134813-root.json [13:48:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36543 and previous config saved to /var/cache/conftool/dbconfig/20221026-134815-root.json [13:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36544 and previous config saved to /var/cache/conftool/dbconfig/20221026-134817-root.json [13:48:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36545 and previous config saved to /var/cache/conftool/dbconfig/20221026-134824-ladsgroup.json [13:49:50] !log restarting apache2 on lists.wikimedia.org to pick up curl security update [13:50:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:51:48] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [13:52:01] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [13:53:10] (03PS1) 10Btullis: Add dummy deployment users and tokens for spark-operator and spark [labs/private] - 10https://gerrit.wikimedia.org/r/849558 (https://phabricator.wikimedia.org/T321686) [13:53:32] (03PS18) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:53:56] (03CR) 10Jbond: "has been tested" [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:54:05] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [13:54:32] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [13:54:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS buster [13:55:38] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) I'm not too worried about this particular host, but does it reflect an upcoming issue with all other cloudvirts, or a least all other cloudvirts on that rack? [13:56:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36546 and previous config saved to /var/cache/conftool/dbconfig/20221026-135641-root.json [13:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36547 and previous config saved to /var/cache/conftool/dbconfig/20221026-135646-root.json [13:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36548 and previous config saved to /var/cache/conftool/dbconfig/20221026-135650-root.json [13:56:58] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [13:57:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.813 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:57:21] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [13:57:35] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [13:57:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.505 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:57:50] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [13:58:03] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [13:58:31] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [13:58:55] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [13:59:20] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [13:59:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Switch x1 to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/849548 (https://phabricator.wikimedia.org/T318518) (owner: 10Marostegui) [13:59:48] (03PS2) 10Filippo Giunchedi: confd: cleanup stale errors [puppet] - 10https://gerrit.wikimedia.org/r/849539 (https://phabricator.wikimedia.org/T321678) [14:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36549 and previous config saved to /var/cache/conftool/dbconfig/20221026-140219-ladsgroup.json [14:02:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:02:26] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:02:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:02:38] (03CR) 10Vgutierrez: [C: 03+1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [14:02:41] (03PS1) 10Muehlenhoff: lists: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) [14:02:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:02:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T318950)', diff saved to https://phabricator.wikimedia.org/P36550 and previous config saved to /var/cache/conftool/dbconfig/20221026-140250-ladsgroup.json [14:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321312)', diff saved to https://phabricator.wikimedia.org/P36551 and previous config saved to /var/cache/conftool/dbconfig/20221026-140312-ladsgroup.json [14:03:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36552 and previous config saved to /var/cache/conftool/dbconfig/20221026-140318-root.json [14:03:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36553 and previous config saved to /var/cache/conftool/dbconfig/20221026-140320-root.json [14:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36554 and previous config saved to /var/cache/conftool/dbconfig/20221026-140328-root.json [14:03:31] (03PS1) 10Elukey: coredns: add release and app name labels to Pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) [14:03:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:03:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:03:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [14:03:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36555 and previous config saved to /var/cache/conftool/dbconfig/20221026-140351-ladsgroup.json [14:04:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1103.eqiad.wmnet with reason: Maintenance [14:05:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318950)', diff saved to https://phabricator.wikimedia.org/P36556 and previous config saved to /var/cache/conftool/dbconfig/20221026-140503-ladsgroup.json [14:05:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1103.eqiad.wmnet with reason: Maintenance [14:05:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1103 (T321312)', diff saved to https://phabricator.wikimedia.org/P36557 and previous config saved to /var/cache/conftool/dbconfig/20221026-140510-ladsgroup.json [14:06:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36558 and previous config saved to /var/cache/conftool/dbconfig/20221026-140618-ladsgroup.json [14:07:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:07:09] (03PS2) 10Elukey: coredns: add standard labels to resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) [14:08:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:08:46] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:09:17] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) The other big issue is rebooting reliably - as currently set u... [14:11:46] !log disable interface et-1/0/2 on cr1-eqiad to bounce fpc 1 pic0 [14:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36559 and previous config saved to /var/cache/conftool/dbconfig/20221026-141146-root.json [14:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36560 and previous config saved to /var/cache/conftool/dbconfig/20221026-141151-root.json [14:11:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36561 and previous config saved to /var/cache/conftool/dbconfig/20221026-141155-root.json [14:12:44] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I looked at this but we didn’t test it; rather, we just read it and also didn't use PCC to check!" [puppet] - 10https://gerrit.wikimedia.org/r/849192 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [14:13:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64810/IPv6: Idle - evpn_switches_eqiad, AS64810/IPv4: Idle - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1103 (T321312)', diff saved to https://phabricator.wikimedia.org/P36562 and previous config saved to /var/cache/conftool/dbconfig/20221026-141328-ladsgroup.json [14:13:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:13:53] (03CR) 10Muehlenhoff: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [14:14:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS buster [14:16:25] (03PS6) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [14:16:32] (03CR) 10Jbond: "tested" [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [14:16:41] (03CR) 10Xcollazo: "Ok, I think we are ready here. Please merge if you agree." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [14:17:31] (03PS3) 10Elukey: coredns: add standard labels to resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) [14:17:55] (03PS4) 10Elukey: coredns: add standard labels to resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) [14:18:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 243, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36563 and previous config saved to /var/cache/conftool/dbconfig/20221026-141823-root.json [14:18:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36564 and previous config saved to /var/cache/conftool/dbconfig/20221026-141824-root.json [14:18:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36565 and previous config saved to /var/cache/conftool/dbconfig/20221026-141833-root.json [14:19:31] (03CR) 10Hnowlan: [C: 03+2] helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:20:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:20:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36566 and previous config saved to /var/cache/conftool/dbconfig/20221026-142010-ladsgroup.json [14:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36567 and previous config saved to /var/cache/conftool/dbconfig/20221026-142125-ladsgroup.json [14:23:05] (03Merged) 10jenkins-bot: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:25:25] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36568 and previous config saved to /var/cache/conftool/dbconfig/20221026-142651-root.json [14:26:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36569 and previous config saved to /var/cache/conftool/dbconfig/20221026-142656-root.json [14:26:59] (03PS1) 10Jbond: C:puppetmaster::scripts: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/849586 [14:27:01] (03PS1) 10Jbond: C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 [14:27:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36570 and previous config saved to /var/cache/conftool/dbconfig/20221026-142700-root.json [14:27:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:27:38] (03CR) 10ArielGlenn: rsync-via-primary.sh: replace labstore with clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [14:27:56] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) Possible all cloudvirts in that rack. i think your guys were in the process of moving those nodes in dedicated cloud racks is it still doable? [14:28:13] (03CR) 10CI reject: [V: 04-1] C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 (owner: 10Jbond) [14:28:31] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:28:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1103', diff saved to https://phabricator.wikimedia.org/P36571 and previous config saved to /var/cache/conftool/dbconfig/20221026-142834-ladsgroup.json [14:28:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [14:28:55] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:11] (03PS1) 10Urbanecm: kswiki: Switch to wikitext mentor provider back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849588 (https://phabricator.wikimedia.org/T310905) [14:29:24] jouncebot: nowandnext [14:29:24] No deployments scheduled for the next 3 hour(s) and 30 minute(s) [14:29:25] In 3 hour(s) and 30 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1800) [14:29:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS buster [14:30:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849588 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [14:30:07] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37784/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [14:30:11] volans: XioNoX: why is of all the report just the network report in Netbox is very slow [14:30:50] (03Merged) 10jenkins-bot: kswiki: Switch to wikitext mentor provider back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849588 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [14:31:07] papaul: there is some netbox slowness that arzhel and john were investigating, not smoking gun yet [14:31:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:849588|kswiki: Switch to wikitext mentor provider back (T310905)]] [14:31:18] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [14:31:36] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:849588|kswiki: Switch to wikitext mentor provider back (T310905)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:32:06] (03PS2) 10Jbond: C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 [14:32:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: cleanup stale errors [puppet] - 10https://gerrit.wikimedia.org/r/849539 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [14:32:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [14:33:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [14:33:51] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc: move cloud key to correct realm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849555 (owner: 10Jbond) [14:34:09] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: cleanup stale errors [puppet] - 10https://gerrit.wikimedia.org/r/849539 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [14:34:13] (03PS3) 10Jbond: C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 [14:34:57] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:13] (03CR) 10CI reject: [V: 04-1] C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 (owner: 10Jbond) [14:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36573 and previous config saved to /var/cache/conftool/dbconfig/20221026-143516-ladsgroup.json [14:35:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:849588|kswiki: Switch to wikitext mentor provider back (T310905)]] (duration: 04m 47s) [14:36:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36574 and previous config saved to /var/cache/conftool/dbconfig/20221026-143631-ladsgroup.json [14:37:18] (03PS4) 10Jbond: C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 [14:37:21] (03PS1) 10Filippo Giunchedi: confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 [14:37:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:37:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:37:36] ACKNOWLEDGEMENT - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service Brian_King incomplete data transfer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [14:38:07] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [14:38:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:38:42] (03CR) 10CI reject: [V: 04-1] confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 (owner: 10Filippo Giunchedi) [14:40:33] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I looked at this and it looks good! Thank you, Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [14:40:51] (03PS2) 10Filippo Giunchedi: confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 [14:41:23] (03PS1) 10Hnowlan: kubernetes: add deployment_services entry for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/849591 (https://phabricator.wikimedia.org/T233196) [14:41:42] (03CR) 10CI reject: [V: 04-1] confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 (owner: 10Filippo Giunchedi) [14:42:34] (03PS3) 10Filippo Giunchedi: confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 [14:43:04] (03PS2) 10Hnowlan: kubernetes: add deployment_services entry for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/849591 (https://phabricator.wikimedia.org/T233196) [14:43:09] (03PS2) 10Stef Dunlap: Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 [14:43:21] (03CR) 10Jbond: [C: 03+2] C:puppetmaster::scripts: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/849586 (owner: 10Jbond) [14:43:23] (03CR) 10Jbond: [C: 03+2] C:puppetmaster: Add realm override parameter [puppet] - 10https://gerrit.wikimedia.org/r/849587 (owner: 10Jbond) [14:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1103', diff saved to https://phabricator.wikimedia.org/P36575 and previous config saved to /var/cache/conftool/dbconfig/20221026-144341-ladsgroup.json [14:43:56] (03CR) 10Ladsgroup: [C: 03+1] "LGTM if we are sure restart of apache doesn't lead to mailman services (specially mailman-web) going crazy." [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:44:15] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01545 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:44:32] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: fixup tidy invocation [puppet] - 10https://gerrit.wikimedia.org/r/849589 (owner: 10Filippo Giunchedi) [14:45:09] (03PS3) 10Stef Dunlap: Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 [14:45:51] (03PS4) 10Stef Dunlap: Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 [14:46:18] (03CR) 10Muehlenhoff: lists: Enable profile::auto_restarts::service for Apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:47:02] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/849063 (owner: 10Muehlenhoff) [14:47:29] (03CR) 10Ladsgroup: [C: 03+1] lists: Enable profile::auto_restarts::service for Apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:47:31] (03CR) 10Stef Dunlap: Fixup development tooling for wider compatibility (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [14:47:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:59] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318950)', diff saved to https://phabricator.wikimedia.org/P36576 and previous config saved to /var/cache/conftool/dbconfig/20221026-145023-ladsgroup.json [14:50:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:50:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:50:29] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:50:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T318950)', diff saved to https://phabricator.wikimedia.org/P36577 and previous config saved to /var/cache/conftool/dbconfig/20221026-145033-ladsgroup.json [14:51:16] (03PS2) 10Volans: CORE_DATACENTERS: use the wmflib constant [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 [14:51:31] (03CR) 10BBlack: Split confd file definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [14:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318950)', diff saved to https://phabricator.wikimedia.org/P36578 and previous config saved to /var/cache/conftool/dbconfig/20221026-145138-ladsgroup.json [14:51:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [14:51:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [14:51:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T318950)', diff saved to https://phabricator.wikimedia.org/P36579 and previous config saved to /var/cache/conftool/dbconfig/20221026-145148-ladsgroup.json [14:51:58] (03CR) 10Jbond: [C: 03+2] hieradata pcc: add devtools puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [14:52:06] (03PS2) 10Jbond: hieradata pcc: add devtools puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [14:52:09] (03CR) 10Jbond: [V: 03+2] hieradata pcc: add devtools puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [14:52:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:52:38] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: add deployment_services entry for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/849591 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:52:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318950)', diff saved to https://phabricator.wikimedia.org/P36580 and previous config saved to /var/cache/conftool/dbconfig/20221026-145246-ladsgroup.json [14:53:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:08] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy deployment users and tokens for spark-operator and spark [labs/private] - 10https://gerrit.wikimedia.org/r/849558 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [14:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318950)', diff saved to https://phabricator.wikimedia.org/P36581 and previous config saved to /var/cache/conftool/dbconfig/20221026-145314-ladsgroup.json [14:53:25] (03CR) 10BBlack: single_backend mode for production varnishes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [14:54:36] (03PS4) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) [14:54:38] (03PS2) 10BBlack: Remove confd_experiment_fqdn support [puppet] - 10https://gerrit.wikimedia.org/r/845713 (https://phabricator.wikimedia.org/T288106) [14:54:40] (03PS6) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [14:54:42] (03PS6) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [14:56:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [14:56:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] "i happened to be looking at this for something else so i have also added the following to the puppetmaster in the horizon project and manu" [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [14:58:07] (03PS7) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 (https://phabricator.wikimedia.org/T288106) [14:58:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:58:42] (03CR) 10BBlack: cp4045: enable single_backend mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845651 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [14:58:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1103 (T321312)', diff saved to https://phabricator.wikimedia.org/P36584 and previous config saved to /var/cache/conftool/dbconfig/20221026-145848-ladsgroup.json [14:58:53] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add deployment_services entry for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/849591 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:58:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [14:59:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:59:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [14:59:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T321312)', diff saved to https://phabricator.wikimedia.org/P36585 and previous config saved to /var/cache/conftool/dbconfig/20221026-145924-ladsgroup.json [14:59:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [14:59:26] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [15:00:14] (03CR) 10Volans: [C: 03+2] CORE_DATACENTERS: use the wmflib constant [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 (owner: 10Volans) [15:00:29] (03PS2) 10Volans: tox.ini: explain why there are old Python versions [cookbooks] - 10https://gerrit.wikimedia.org/r/849034 (https://phabricator.wikimedia.org/T289222) [15:00:40] (03CR) 10Volans: [C: 03+2] "just a comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/849034 (https://phabricator.wikimedia.org/T289222) (owner: 10Volans) [15:01:28] (03PS1) 10Jbond: R:swift::label_filesystem: jst check that any lable is on the disk [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) [15:02:38] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37787/console" [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:02:57] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [15:03:25] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:03:27] (03CR) 10Jelto: hieradata pcc: add devtools puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [15:04:28] (03CR) 10Vgutierrez: [C: 03+1] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:04:42] (03Merged) 10jenkins-bot: CORE_DATACENTERS: use the wmflib constant [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 (owner: 10Volans) [15:04:44] (03Merged) 10jenkins-bot: tox.ini: explain why there are old Python versions [cookbooks] - 10https://gerrit.wikimedia.org/r/849034 (https://phabricator.wikimedia.org/T289222) (owner: 10Volans) [15:04:52] (03CR) 10Vgutierrez: [C: 03+1] cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:04:58] (03CR) 10BBlack: [C: 03+2] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:05:08] (03CR) 10BBlack: [C: 03+2] Remove confd_experiment_fqdn support [puppet] - 10https://gerrit.wikimedia.org/r/845713 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:05:12] (03CR) 10BBlack: [C: 03+2] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:05:45] (03CR) 10BBlack: [C: 03+2] cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:07:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36586 and previous config saved to /var/cache/conftool/dbconfig/20221026-150752-ladsgroup.json [15:08:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36587 and previous config saved to /var/cache/conftool/dbconfig/20221026-150821-ladsgroup.json [15:10:55] (03CR) 10Volans: "replies inline (I didn't do a new pass)" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [15:11:26] (03PS1) 10Volans: sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 [15:12:12] (03PS1) 10BBlack: ulsfo: single_backend for all new cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/849598 (https://phabricator.wikimedia.org/T317244) [15:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T321312)', diff saved to https://phabricator.wikimedia.org/P36588 and previous config saved to /var/cache/conftool/dbconfig/20221026-151216-ladsgroup.json [15:13:17] (03CR) 10BBlack: [C: 03+2] ulsfo: single_backend for all new cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/849598 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack) [15:15:12] (03CR) 10CI reject: [V: 04-1] sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:15:34] (03PS1) 10Filippo Giunchedi: confd: brown paperbag fix for tidy [puppet] - 10https://gerrit.wikimedia.org/r/849599 (https://phabricator.wikimedia.org/T321678) [15:16:49] (03CR) 10BBlack: [C: 03+1] "LGTM! [other than the minor nit about a comment format that CI is bailing on]. Thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:16:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37791/console" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [15:16:54] I'm seeking a kind soul to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/849599 [15:17:02] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) > as currently set up, puppet is unhappy if the SSDs come up as anythi... [15:18:56] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005982 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:19:19] (03PS2) 10Volans: sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 [15:20:32] (03CR) 10BBlack: [C: 03+1] "Looks right to me!" [puppet] - 10https://gerrit.wikimedia.org/r/849599 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [15:20:46] (03CR) 10Ssingh: [C: 03+1] "Looks fine, thanks a lot for working on this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:21:06] thank you bblack, appreciate it [15:21:09] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: brown paperbag fix for tidy [puppet] - 10https://gerrit.wikimedia.org/r/849599 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [15:22:42] (03CR) 10jenkins-bot: sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36589 and previous config saved to /var/cache/conftool/dbconfig/20221026-152259-ladsgroup.json [15:23:19] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Thanks, looking at the config, and reading some docs, we have it set up so it should not have any impact: ` c... [15:23:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36590 and previous config saved to /var/cache/conftool/dbconfig/20221026-152327-ladsgroup.json [15:23:59] (03PS3) 10Volans: sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 [15:24:20] (03CR) 10Volans: "doh, failed to commit locally sent PS2 without changes... fixed in PS3" [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:25:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 5.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:39] (03PS1) 10Filippo Giunchedi: confd: don't backup tidied files [puppet] - 10https://gerrit.wikimedia.org/r/849600 (https://phabricator.wikimedia.org/T321678) [15:27:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:27:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P36591 and previous config saved to /var/cache/conftool/dbconfig/20221026-152724-ladsgroup.json [15:28:08] (03CR) 10Ottomata: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [15:28:19] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:28:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4039.ulsfo.wmnet with OS buster [15:28:47] (03CR) 10Ottomata: [C: 03+1] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [15:29:35] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [15:31:25] (03CR) 10Muehlenhoff: lists: Enable profile::auto_restarts::service for Apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:32:04] (03Merged) 10jenkins-bot: sre.hosts.provision: adjust for Dell R450 [cookbooks] - 10https://gerrit.wikimedia.org/r/849597 (owner: 10Volans) [15:32:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:32:45] <_joe_> uhm what's going on? [15:33:22] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Urbanecm) [15:34:12] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) It's starting to swap, I'm increasing the memory. We need to restart the VM. [15:35:49] (03CR) 10BBlack: "Just noticed this patch now, sorry! I have already gotten rid of this variable in an extremely similar patch as part of a series working t" [puppet] - 10https://gerrit.wikimedia.org/r/817298 (https://phabricator.wikimedia.org/T288106) (owner: 10Jbond) [15:36:02] (03Abandoned) 10BBlack: P:cache::varnish::frontend: Drop confd_experiment_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/817298 (https://phabricator.wikimedia.org/T288106) (owner: 10Jbond) [15:37:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:37:09] !log ladsgroup@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM lists1001.wikimedia.org [15:37:46] !log emergency reboot of lists1001 (T321703) [15:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318950)', diff saved to https://phabricator.wikimedia.org/P36592 and previous config saved to /var/cache/conftool/dbconfig/20221026-153805-ladsgroup.json [15:38:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:38:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:38:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T318950)', diff saved to https://phabricator.wikimedia.org/P36593 and previous config saved to /var/cache/conftool/dbconfig/20221026-153816-ladsgroup.json [15:38:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318950)', diff saved to https://phabricator.wikimedia.org/P36594 and previous config saved to /var/cache/conftool/dbconfig/20221026-153834-ladsgroup.json [15:39:15] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:00] (03PS1) 10Ssingh: cp4040: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849602 (https://phabricator.wikimedia.org/T317244) [15:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318950)', diff saved to https://phabricator.wikimedia.org/P36595 and previous config saved to /var/cache/conftool/dbconfig/20221026-154029-ladsgroup.json [15:40:56] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=codfw [15:41:13] (03CR) 10Ssingh: [C: 03+2] cp4040: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849602 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [15:41:58] (03CR) 10Jbond: role::idm Basic deployment of IDM (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:42:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:42:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P36596 and previous config saved to /var/cache/conftool/dbconfig/20221026-154231-ladsgroup.json [15:43:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS buster [15:43:12] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS buster [15:43:19] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:41] (03CR) 10Jbond: "when rtunning this cookbook there where errors with icinga. this asked if i wanted to exiut which i did, which cause the cookbook to skip" [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [15:45:04] (03CR) 10ArielGlenn: "Ran PCC on snapshot1008, which handles all the "misc" dump jobs, to pick up the rest of the changes: https://puppet-compiler.wmflabs.org/p" [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [15:46:33] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [15:47:05] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37795/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:47:06] (ConfdResourceFailed) resolved: (6) confd resource _var_lib_gdnsd_discovery-netbox.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:47:08] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [15:47:27] (03CR) 10Muehlenhoff: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:47:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [15:50:04] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) That basically meant the VM is gone for possibly ten minutes. sigh. [15:51:54] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED [15:52:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [15:54:09] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4040.ulsfo.wmnet with OS buster [15:54:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [15:54:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS buster executed with errors: - cp4040 (**FA... [15:55:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36597 and previous config saved to /var/cache/conftool/dbconfig/20221026-155536-ladsgroup.json [15:56:31] (03PS1) 10Hnowlan: thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) [15:57:28] (03CR) 10Clément Goubert: [C: 03+1] thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:57:30] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [15:57:33] PROBLEM - Host lists1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T321312)', diff saved to https://phabricator.wikimedia.org/P36598 and previous config saved to /var/cache/conftool/dbconfig/20221026-155738-ladsgroup.json [15:58:05] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [15:58:07] (03PS2) 10Hnowlan: thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) [15:58:44] (03PS1) 10Urbanecm: Revert "kswiki: Switch to wikitext mentor provider back" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849626 (https://phabricator.wikimedia.org/T310905) [15:58:51] jouncebot: nowandnext [15:58:51] No deployments scheduled for the next 2 hour(s) and 1 minute(s) [15:58:51] In 2 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T1800) [15:58:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849626 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [15:58:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:37] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED [15:59:43] (03Merged) 10jenkins-bot: Revert "kswiki: Switch to wikitext mentor provider back" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849626 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [16:00:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:849626|Revert "kswiki: Switch to wikitext mentor provider back" (T310905)]] [16:00:10] (03CR) 10Clément Goubert: [C: 03+1] thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [16:00:14] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [16:00:33] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:849626|Revert "kswiki: Switch to wikitext mentor provider back" (T310905)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [16:00:49] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4040 [16:00:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4040 [16:01:21] PROBLEM - Check systemd state on kubernetes2022 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Appledora) Hey @Dzahn , for now I have updated my email with the contractor email. Hope this helps! [16:02:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS buster [16:03:01] !log ladsgroup@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM lists1001.wikimedia.org [16:03:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [16:03:29] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:03:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:04:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:04:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:849626|Revert "kswiki: Switch to wikitext mentor provider back" (T310905)]] (duration: 04m 16s) [16:04:27] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) >>! In T308677#8339622, @jbond wrote: >> luckily puppet doesn'... [16:04:33] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:05:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:05:52] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) ...so I think your change might at least get us back to system... [16:06:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:07:07] (03CR) 10MVernon: [C: 03+1] "I think this might well help with the unreliable booting issue; I don't know if checking the label looks at least vaguely plausible (say `" [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:07:21] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37797/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [16:07:34] (03CR) 10Hnowlan: [C: 03+2] thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:08:59] (KubernetesAPILatency) firing: (16) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:36] (03PS1) 10AikoChou: ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) [16:10:41] (03PS2) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) [16:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36599 and previous config saved to /var/cache/conftool/dbconfig/20221026-161042-ladsgroup.json [16:10:55] (03CR) 10CI reject: [V: 04-1] WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [16:11:02] (03Merged) 10jenkins-bot: thumbor: disable TLS for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/849605 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:11:04] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM lists1001.wikimedia.org [16:11:05] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM lists1001.wikimedia.org [16:11:13] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4040.ulsfo.wmnet with OS buster [16:11:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:13:12] (03CR) 10CI reject: [V: 04-1] ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [16:14:01] RECOVERY - Host lists1001 is UP: PING OK - Packet loss = 0%, RTA = 5.29 ms [16:15:23] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:15:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:16:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:17:38] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10BBlack) a:03BBlack Updates! Since this ticket was last active, there's been progress on various fronts with th... [16:17:57] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) It's back online but quite slow. It can be something with its databases. [16:22:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:22:59] (03PS2) 10AikoChou: ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) [16:23:09] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:23:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T321312)', diff saved to https://phabricator.wikimedia.org/P36600 and previous config saved to /var/cache/conftool/dbconfig/20221026-162316-ladsgroup.json [16:24:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:24:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:24:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:13] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318950)', diff saved to https://phabricator.wikimedia.org/P36601 and previous config saved to /var/cache/conftool/dbconfig/20221026-162549-ladsgroup.json [16:25:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:25:55] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:26:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:27:37] RECOVERY - Check systemd state on kubernetes2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:19] (03CR) 10Btullis: [C: 03+2] Revert "Add a postgres user with an IPv6 RFC 4193 host match" [puppet] - 10https://gerrit.wikimedia.org/r/849514 (owner: 10Btullis) [16:29:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321312)', diff saved to https://phabricator.wikimedia.org/P36602 and previous config saved to /var/cache/conftool/dbconfig/20221026-162948-ladsgroup.json [16:30:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.872 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:30:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:08] (03CR) 10BCornwall: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [16:32:46] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) ` [Wed Oct 26 16:16:13.798383 2022] [mpm_event:error] [pid 507:tid 140022112863360] AH10159: server is within MinSpareThreads of MaxRequestWorkers, cons... [16:33:52] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Urbanecm) >>! In T321703#8346547, @Ladsgroup wrote: > It's back online but quite slow. It can be something with its databases. It's slow, but still somewhat faste... [16:43:19] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10BCornwall) Seems reasonable to me. It looks like the alert fired as you demonstrated. [16:43:52] (03CR) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [16:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36603 and previous config saved to /var/cache/conftool/dbconfig/20221026-164455-ladsgroup.json [16:45:39] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10wiki_willy) a:03Cmjohnson [16:47:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10wiki_willy) a:03Cmjohnson [16:54:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS buster [16:55:43] (03PS1) 10Andrew Bogott: Cinder policy: allow volume:get_all_transfers for all users [puppet] - 10https://gerrit.wikimedia.org/r/849629 [16:56:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:56:24] (03CR) 10CI reject: [V: 04-1] Cinder policy: allow volume:get_all_transfers for all users [puppet] - 10https://gerrit.wikimedia.org/r/849629 (owner: 10Andrew Bogott) [17:00:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36604 and previous config saved to /var/cache/conftool/dbconfig/20221026-170001-ladsgroup.json [17:01:57] !log hashar@deploy1002 Started deploy [releng/phatality@d8dfa72]: Update Phatality on codfw for OpenSearch Dashboard 2.2.0 # T304440 [17:02:02] T304440: Test and upgrade OpenSearch to 2.2.0 - https://phabricator.wikimedia.org/T304440 [17:02:24] !log hashar@deploy1002 Finished deploy [releng/phatality@d8dfa72]: Update Phatality on codfw for OpenSearch Dashboard 2.2.0 # T304440 (duration: 00m 27s) [17:05:05] (03PS1) 10Andrew Bogott: Cinder policy: allow volume:get_all_transfers for all users [puppet] - 10https://gerrit.wikimedia.org/r/849630 [17:14:24] (03PS1) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) [17:15:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321312)', diff saved to https://phabricator.wikimedia.org/P36605 and previous config saved to /var/cache/conftool/dbconfig/20221026-171508-ladsgroup.json [17:15:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:15:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36606 and previous config saved to /var/cache/conftool/dbconfig/20221026-171534-ladsgroup.json [17:15:39] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:15:40] (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [17:20:16] (03PS5) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [17:20:44] (03CR) 10Ahmon Dancy: [C: 03+1] opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [17:20:47] (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [17:20:49] (03Abandoned) 10Andrew Bogott: Cinder policy: allow volume:get_all_transfers for all users [puppet] - 10https://gerrit.wikimedia.org/r/849629 (owner: 10Andrew Bogott) [17:20:55] (03CR) 10Andrew Bogott: [C: 03+2] Cinder policy: allow volume:get_all_transfers for all users [puppet] - 10https://gerrit.wikimedia.org/r/849630 (owner: 10Andrew Bogott) [17:21:31] (03CR) 10Andrew Bogott: [C: 03+2] rsync-via-primary.sh: replace labstore with clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [17:21:38] (03CR) 10Andrew Bogott: [C: 03+2] Dumps: remove a bunch of references to labstore1006 and labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/849192 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [17:21:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:21:58] (03PS2) 10Andrew Bogott: rsync-via-primary.sh: replace labstore with clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) [17:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36607 and previous config saved to /var/cache/conftool/dbconfig/20221026-172159-ladsgroup.json [17:28:39] (03PS1) 10BBlack: varnish: remove unused definition of $fe_mem_gb [puppet] - 10https://gerrit.wikimedia.org/r/849632 [17:28:41] (03PS1) 10BBlack: [WIP] varnish: increase fe cache memory utilization [puppet] - 10https://gerrit.wikimedia.org/r/849633 [17:29:55] !log hashar@deploy1002 Started deploy [releng/phatality@d8dfa72]: (no justification provided) [17:30:07] !log hashar@deploy1002 Finished deploy [releng/phatality@d8dfa72]: (no justification provided) (duration: 00m 12s) [17:31:16] !log hashar@deploy1002 Started deploy [releng/phatality@d8dfa72]: (no justification provided) [17:31:30] !log hashar@deploy1002 Finished deploy [releng/phatality@d8dfa72]: (no justification provided) (duration: 00m 13s) [17:31:49] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4040.ulsfo.wmnet with OS buster [17:32:01] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) a:03cmooney [17:32:08] (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [17:35:07] !log hashar@deploy1002 Started deploy [releng/phatality@d8dfa72]: (no justification provided) [17:35:20] !log hashar@deploy1002 Finished deploy [releng/phatality@d8dfa72]: (no justification provided) (duration: 00m 12s) [17:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36608 and previous config saved to /var/cache/conftool/dbconfig/20221026-173705-ladsgroup.json [17:39:36] (03CR) 10BBlack: [C: 04-1] "Looks like it's on a good track, couple of technical fixups in comments" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:42:29] (03PS1) 10Ssingh: Revert "cp4040: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849611 [17:42:47] (03CR) 10BBlack: [C: 04-1] varnish: Conditionally set WMF-Last-Access cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:43:38] (03CR) 10Ssingh: [C: 03+2] Revert "cp4040: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849611 (owner: 10Ssingh) [17:48:06] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) Hmm... so I went back a moment ago to look at this when I got some time, and of course the report has re-run and completed ok.... [17:48:52] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [17:49:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:50:21] (03PS1) 10Ssingh: cp4046: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849634 (https://phabricator.wikimedia.org/T317244) [17:51:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:51:33] (03CR) 10Hokwelum: [C: 03+1] "We also ran PCC on more dumpsdata hosts (https://puppet-compiler.wmflabs.org/pcc-worker1002/37798/), After looking at all PCC changes rela" [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [17:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36609 and previous config saved to /var/cache/conftool/dbconfig/20221026-175212-ladsgroup.json [17:52:14] (03CR) 10Ssingh: [C: 03+2] cp4046: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849634 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [17:54:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [18:00:42] !log dbmaint on s1@eqiad (T321562) [18:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:48] T321562: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 [18:00:56] !log dbmaint on s3@eqiad (T321562) [18:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:02] !log dbmaint on s5@eqiad (T321562) [18:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:10] !log dbmaint on s8@eqiad (T321562) [18:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:27] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Ladsgroup) To make it show up in https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [18:02:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:03:51] PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36610 and previous config saved to /var/cache/conftool/dbconfig/20221026-180718-ladsgroup.json [18:07:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [18:07:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [18:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P36611 and previous config saved to /var/cache/conftool/dbconfig/20221026-180742-ladsgroup.json [18:09:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:59] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS buster [18:14:09] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1048 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:20:11] (03PS2) 10Andrew Bogott: Dumps: remove a bunch of references to labstore1006 and labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/849192 (https://phabricator.wikimedia.org/T309346) [18:28:29] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:28:59] (KubernetesAPILatency) firing: (17) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:15] RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [18:32:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:32:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:33:36] (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) [18:33:44] (03PS3) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) [18:41:51] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS buster [18:42:16] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:42:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:45:09] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:45:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [18:49:03] (03PS1) 10Ssingh: aptrepo: add trafficserver9 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/849640 (https://phabricator.wikimedia.org/T321309) [19:00:14] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS buster [19:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P36612 and previous config saved to /var/cache/conftool/dbconfig/20221026-190758-ladsgroup.json [19:08:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [19:08:38] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS buster [19:10:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [19:10:43] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS buster [19:13:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS buster [19:16:36] (03PS1) 10Ssingh: Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) [19:16:53] (03PS2) 10Kosta Harlan: [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) [19:17:00] (03PS3) 10Kosta Harlan: [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) [19:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36613 and previous config saved to /var/cache/conftool/dbconfig/20221026-192305-ladsgroup.json [19:31:46] (03PS1) 10Ssingh: Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) [19:38:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36614 and previous config saved to /var/cache/conftool/dbconfig/20221026-193811-ladsgroup.json [19:39:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [19:39:09] (03CR) 10CI reject: [V: 04-1] Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:42:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [19:47:15] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:08] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10wiki_willy) a:03Jclark-ctr [19:50:55] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Jclark-ctr) Will stop in tonight to take a look at this server [19:53:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P36615 and previous config saved to /var/cache/conftool/dbconfig/20221026-195318-ladsgroup.json [19:53:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:53:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36616 and previous config saved to /var/cache/conftool/dbconfig/20221026-195342-ladsgroup.json [19:57:43] (03CR) 10CI reject: [V: 04-1] Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36617 and previous config saved to /var/cache/conftool/dbconfig/20221026-200009-ladsgroup.json [20:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [20:05:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:05:46] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10Volans) This seems related to the Netbox slowness that we've seen recently that @ayounsi was looking at, but no smoking gun was found s... [20:05:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS buster [20:08:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36618 and previous config saved to /var/cache/conftool/dbconfig/20221026-201516-ladsgroup.json [20:24:18] (03CR) 10Volans: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [20:30:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36619 and previous config saved to /var/cache/conftool/dbconfig/20221026-203022-ladsgroup.json [20:30:24] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:31:23] (03PS3) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) [20:34:22] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [20:39:09] jouncebot: nowandnext [20:39:10] For the next 0 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221026T2000) [20:39:10] In 9 hour(s) and 20 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0600) [20:39:26] since there is nothing happening in B&C, I'll ship a secpatch [20:42:22] !log Deploying security patch for T321733 [20:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36620 and previous config saved to /var/cache/conftool/dbconfig/20221026-204529-ladsgroup.json [20:45:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:45:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:45:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:45:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T321312)', diff saved to https://phabricator.wikimedia.org/P36621 and previous config saved to /var/cache/conftool/dbconfig/20221026-204553-ladsgroup.json [20:46:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:46:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:46:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:50:28] * urbanecm done with the security deployment [20:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321312)', diff saved to https://phabricator.wikimedia.org/P36622 and previous config saved to /var/cache/conftool/dbconfig/20221026-205218-ladsgroup.json [20:56:11] (03CR) 10RLazarus: "I like the new separated structure of slo_definitions.json, but can we go the opposite way with the UI -- that is, keep one dashboard per " [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [21:00:49] (03CR) 10Volans: [C: 04-1] "Changing my vote to -1 not because of issues with the code but because with the counter-proposal in https://phabricator.wikimedia.org/T320" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) (owner: 10Filippo Giunchedi) [21:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36623 and previous config saved to /var/cache/conftool/dbconfig/20221026-210724-ladsgroup.json [21:08:13] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:15:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:16:15] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:22:01] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36624 and previous config saved to /var/cache/conftool/dbconfig/20221026-212230-ladsgroup.json [21:23:36] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:27:54] PROBLEM - Host db1154.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:59] 10SRE, 10User-MoritzMuehlenhoff: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10bd808) [21:28:42] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:28:53] 10SRE, 10cloud-services-team: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set' - https://phabricator.wikimedia.org/T221115 (10bd808) 05Open→03Invalid [21:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321312)', diff saved to https://phabricator.wikimedia.org/P36625 and previous config saved to /var/cache/conftool/dbconfig/20221026-213737-ladsgroup.json [21:37:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [21:37:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [21:38:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T321312)', diff saved to https://phabricator.wikimedia.org/P36626 and previous config saved to /var/cache/conftool/dbconfig/20221026-213801-ladsgroup.json [21:44:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321312)', diff saved to https://phabricator.wikimedia.org/P36627 and previous config saved to /var/cache/conftool/dbconfig/20221026-214412-ladsgroup.json [21:46:20] RECOVERY - Host db1154.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [21:49:12] (03PS22) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [21:49:37] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:57:44] RECOVERY - Host db1154 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [21:57:52] PROBLEM - mysqld processes on db1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:57:52] PROBLEM - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:57:52] PROBLEM - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:57:52] PROBLEM - MariaDB read only s3 on db1154 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [21:57:52] PROBLEM - MariaDB read only s5 on db1154 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [21:58:00] PROBLEM - MariaDB Replica Lag: s8 on db1154 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:00] PROBLEM - MariaDB Replica SQL: s8 on db1154 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:00] PROBLEM - MariaDB Replica IO: s5 on db1154 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:00] PROBLEM - MariaDB Replica IO: s8 on db1154 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:12] PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:36] PROBLEM - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:50] PROBLEM - MariaDB Replica SQL: s3 on db1154 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:59:00] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:59:00] PROBLEM - Check systemd state on db1154 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:02] PROBLEM - puppet last run on db1154 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:59:06] PROBLEM - MariaDB read only s1 on db1154 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [21:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36628 and previous config saved to /var/cache/conftool/dbconfig/20221026-215919-ladsgroup.json [21:59:24] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:59:40] PROBLEM - MariaDB Replica IO: s1 on db1154 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:59:42] PROBLEM - MariaDB read only s8 on db1154 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:00:51] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Jclark-ctr) Server was in a bootloop pulled server down to minimum configuration. After added hardware back preformed hardware Test passed. No errors present at this time [22:01:02] RECOVERY - Check systemd state on db1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:04] RECOVERY - puppet last run on db1154 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:07:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Dzahn) @Appledora Thank you! I'll make sure to get this resolved soon. [22:14:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36629 and previous config saved to /var/cache/conftool/dbconfig/20221026-221426-ladsgroup.json [22:14:35] (03PS23) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [22:14:45] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:15:51] (03CR) 10Dzahn: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [22:17:23] (03CR) 10Dzahn: [C: 03+1] "impressed you could mix both cloud and prod hosts in the same compiler run, I had been under the impression (at some point) that did not w" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [22:29:14] (KubernetesAPILatency) firing: (17) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:29:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321312)', diff saved to https://phabricator.wikimedia.org/P36630 and previous config saved to /var/cache/conftool/dbconfig/20221026-222932-ladsgroup.json [22:29:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [22:29:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [22:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T321312)', diff saved to https://phabricator.wikimedia.org/P36631 and previous config saved to /var/cache/conftool/dbconfig/20221026-222956-ladsgroup.json [22:32:36] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [22:34:38] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [22:35:05] (03PS1) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [22:35:15] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:35:24] (03PS2) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [22:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321312)', diff saved to https://phabricator.wikimedia.org/P36632 and previous config saved to /var/cache/conftool/dbconfig/20221026-223617-ladsgroup.json [22:37:37] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:37:45] (03PS1) 10Ssingh: cp4048: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849702 (https://phabricator.wikimedia.org/T317244) [22:38:45] (03CR) 10Ssingh: [C: 03+2] cp4048: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/849702 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [22:39:15] andrewbogott: OK to merge your change? [22:41:00] (03CR) 10Dzahn: "ok, so then let's try the interpolation function (lookup in Hiera itself)" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:45:08] (03PS3) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [22:45:18] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:46:44] RECOVERY - MariaDB Replica IO: s1 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:46:46] RECOVERY - MariaDB read only s8 on db1154 is OK: Version 10.4.26-MariaDB-log, Uptime 16s, read_only: True, event_scheduler: True, 11.74 QPS, connection latency: 0.004502s, query latency: 0.000447s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:46:56] RECOVERY - mysqld processes on db1154 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [22:46:56] RECOVERY - MariaDB Replica IO: s3 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:46:58] RECOVERY - MariaDB read only s5 on db1154 is OK: Version 10.4.26-MariaDB-log, Uptime 32s, read_only: True, event_scheduler: True, 2654.82 QPS, connection latency: 0.005773s, query latency: 0.000582s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:46:58] RECOVERY - MariaDB read only s3 on db1154 is OK: Version 10.4.26-MariaDB-log, Uptime 37s, read_only: True, event_scheduler: True, 2286.01 QPS, connection latency: 0.004550s, query latency: 0.000331s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:47:04] RECOVERY - MariaDB Replica IO: s5 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:04] RECOVERY - MariaDB Replica SQL: s8 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:04] RECOVERY - MariaDB Replica IO: s5 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:04] RECOVERY - MariaDB Replica IO: s8 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:16] RECOVERY - MariaDB Replica SQL: s1 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:47:26] RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:42] RECOVERY - MariaDB Replica IO: s8 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:52] RECOVERY - MariaDB Replica SQL: s3 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:47:56] RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:48:04] RECOVERY - MariaDB Replica SQL: s5 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:48:14] RECOVERY - MariaDB read only s1 on db1154 is OK: Version 10.4.26-MariaDB-log, Uptime 116s, read_only: True, event_scheduler: True, 2677.80 QPS, connection latency: 0.004043s, query latency: 0.000258s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:49:01] (03CR) 10Dzahn: "wow, I did not even know there was automatic sync now. Was thinking we still do that manually. this is cool" [puppet] - 10https://gerrit.wikimedia.org/r/849556 (owner: 10Jelto) [22:49:05] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Ladsgroup) 05Open→03Resolved Thanks. I started mysql in instances and started replication. Replication is flowing but it takes a bit to get all the updates (a couple of hours). This is resolv... [22:49:24] (03PS4) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [22:49:34] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:49:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS buster [22:50:08] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4048.ulsfo.wmnet with OS buster [22:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36633 and previous config saved to /var/cache/conftool/dbconfig/20221026-225123-ladsgroup.json [22:51:56] (03CR) 10Dzahn: "@dancy so the "lookup in Hiera" here would by suggestion. right now the rebase issue is just weird though" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:52:23] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:52:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:52:41] (03PS5) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [22:52:51] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [22:52:56] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:10] sukhe: sorry, es [22:53:13] *yes [22:53:21] I checked it was old and merged it, so my sorry too :) [22:53:37] works for me! [22:54:26] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:55:08] (03PS24) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [22:55:18] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:56:02] (03PS1) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) [22:56:12] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:00:00] (03CR) 10Dzahn: "currently I don't get why I get these "please rebase/cross dependency"-issues, when at the same time it does not even offer to rebase in U" [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:01:03] (03CR) 10Ahmon Dancy: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:02:09] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:03:02] (03CR) 10Ahmon Dancy: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:06:12] (03CR) 10Dzahn: "I see https://phabricator.wikimedia.org/T308943 . gotcha! thank you, no rush" [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:06:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36634 and previous config saved to /var/cache/conftool/dbconfig/20221026-230630-ladsgroup.json [23:08:03] (03CR) 10Dzahn: "@Raymond the rebase issue you see is not your fault. currently there is a bug investigated at https://phabricator.wikimedia.org/T308943 gi" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [23:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:09:58] !sal Restarted Zuul CI server due to stall ssh connections which went against the max per user connection limit in Gerrit #T308943 [23:09:58] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [23:09:59] T308943: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943 [23:12:49] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:13:59] (KubernetesAPILatency) firing: (18) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:14:54] (03PS2) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) [23:14:58] (03CR) 10jenkins-bot: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:16:39] (03CR) 10Ahmon Dancy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:17:18] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:17:22] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [23:18:12] (03CR) 10Dzahn: "yes, thanks. this is a different issue now that I can handle: parameter 'gitlab_runner_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:18:51] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:19:32] (03CR) 10Dzahn: "yes, thanks! it works again. this is different issue: parameter 'gitlab_runner_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:19:59] (03Abandoned) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849711 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:20:30] (03CR) 10Dzahn: "@Raymond it works again, dancy fixed it" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [23:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321312)', diff saved to https://phabricator.wikimedia.org/P36635 and previous config saved to /var/cache/conftool/dbconfig/20221026-232136-ladsgroup.json [23:22:53] (03PS6) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [23:23:40] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4048.ulsfo.wmnet with OS buster [23:23:50] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4048.ulsfo.wmnet with OS buster executed with errors: - cp4048 (**FA... [23:24:11] (03CR) 10Dzahn: "@JBond Is there a way to use the "interpolation function"/lookup in Hiera but still use the correct data type? I want "array of hosts" but" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:25:08] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:27:22] RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:27:24] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:31:01] (03PS1) 10Ssingh: Revert "cp4048: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849669 [23:32:49] (03CR) 10Ssingh: [C: 03+2] Revert "cp4048: update site.pp and related configs for cp role" [puppet] - 10https://gerrit.wikimedia.org/r/849669 (owner: 10Ssingh) [23:35:38] win 28 [23:37:17] (03PS7) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [23:37:48] (03PS1) 10Ssingh: cp4044: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849713 (https://phabricator.wikimedia.org/T317244) [23:38:59] (KubernetesAPILatency) firing: (18) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:39:07] (03CR) 10Ssingh: [C: 03+2] cp4044: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849713 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [23:39:28] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [23:40:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS buster [23:50:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:14] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:53:34] PROBLEM - Check systemd state on kubernetes1011 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:48] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:56:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:58:02] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:58:59] (KubernetesAPILatency) firing: (17) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency