[00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921315 [00:39:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921315 (owner: 10TrainBranchBot) [00:46:01] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921315 (owner: 10TrainBranchBot) [00:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:33] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [02:11:32] (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:05] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [02:18:19] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:19:55] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:20:21] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:21:55] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:26:32] (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:17] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:28:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:37] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:32] (JobUnavailable) resolved: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:39:09] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:39:41] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:40:33] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:47] (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:35] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.201 second response time https://wikitech.wikimedia.org/wiki/Swift [02:45:01] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Swift [02:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:50:23] PROBLEM - Swift https frontend on moss-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.325 second response time https://wikitech.wikimedia.org/wiki/Swift [02:51:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:51:49] RECOVERY - Swift https frontend on moss-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift [02:52:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:53:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:23] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:01:37] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:02:23] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:03:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:15] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.207 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:22] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:25] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, moss-fe2001.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:09:43] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:11:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:13:27] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:14:49] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:21] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:17:11] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:19:22] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:19:39] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:19:41] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:03] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Swift [03:21:49] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:23:49] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.367 second response time https://wikitech.wikimedia.org/wiki/Swift [03:24:17] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:24:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:25:15] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [03:27:11] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:13] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift [03:28:01] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:28:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:35:07] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:35:47] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:36:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.401 second response time https://wikitech.wikimedia.org/wiki/Swift [03:37:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:49] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:41:13] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Swift [03:42:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:45:35] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.375 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:03] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:21] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.363 second response time https://wikitech.wikimedia.org/wiki/Swift [03:48:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:48:47] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [03:56:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:59:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:03:47] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:09:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:12:25] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:17:37] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:18:17] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:23:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:25:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:28:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:28:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:30:55] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:34:01] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:34:39] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:38:41] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:38:41] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:40:05] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Swift [04:40:09] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:42:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:43:07] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Swift [04:43:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:48:35] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:49:11] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.271 second response time https://wikitech.wikimedia.org/wiki/Swift [04:50:35] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [04:53:39] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.286 second response time https://wikitech.wikimedia.org/wiki/Swift [04:56:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:56:39] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [04:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:58:49] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:59:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:02:29] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:35] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:03:31] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:04:45] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.374 second response time https://wikitech.wikimedia.org/wiki/Swift [05:06:09] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [05:06:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:06:39] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:09:37] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [05:13:23] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:03] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:15:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:17:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Failover es1, es2 and es3 masters for kernel reboots', diff saved to https://phabricator.wikimedia.org/P48405 and previous config saved to /var/cache/conftool/dbconfig/20230522-051723-marostegui.json [05:18:22] (03PS1) 10Marostegui: es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921771 [05:18:37] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.242 second response time https://wikitech.wikimedia.org/wiki/Swift [05:19:04] (03CR) 10Marostegui: [C: 03+2] es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921771 (owner: 10Marostegui) [05:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029, es1030, es1031 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P48406 and previous config saved to /var/cache/conftool/dbconfig/20230522-051957-root.json [05:21:39] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [05:21:55] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:23] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48407 and previous config saved to /var/cache/conftool/dbconfig/20230522-052746-root.json [05:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48408 and previous config saved to /var/cache/conftool/dbconfig/20230522-052753-root.json [05:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48409 and previous config saved to /var/cache/conftool/dbconfig/20230522-052800-root.json [05:28:10] (03PS1) 10Marostegui: Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921556 [05:28:46] (03CR) 10Marostegui: [C: 03+2] Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921556 (owner: 10Marostegui) [05:29:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10ayounsi) I also like option 5 (hard-coding the conditional in Jinja to not configure RA if the device name starts... [05:29:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T337203 [05:29:25] T337203: Switchover es4 master (es2021 -> es2020) - https://phabricator.wikimedia.org/T337203 [05:29:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T337203 [05:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2020 with weight 0 T337203', diff saved to https://phabricator.wikimedia.org/P48410 and previous config saved to /var/cache/conftool/dbconfig/20230522-052938-root.json [05:31:50] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:32:03] (03PS1) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/921772 (https://phabricator.wikimedia.org/T337203) [05:33:54] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:34:15] (03PS2) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/921772 (https://phabricator.wikimedia.org/T337203) [05:34:21] !log Starting es4 codfw failover from es2021 to es2020 - T337203 [05:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:27] T337203: Switchover es4 master (es2021 -> es2020) - https://phabricator.wikimedia.org/T337203 [05:34:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/921772 (https://phabricator.wikimedia.org/T337203) (owner: 10Marostegui) [05:35:50] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:35:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2020 to es4 codfw primaryT337203', diff saved to https://phabricator.wikimedia.org/P48411 and previous config saved to /var/cache/conftool/dbconfig/20230522-053554-marostegui.json [05:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2021 T337203', diff saved to https://phabricator.wikimedia.org/P48412 and previous config saved to /var/cache/conftool/dbconfig/20230522-053705-marostegui.json [05:37:56] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:42:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48413 and previous config saved to /var/cache/conftool/dbconfig/20230522-054251-root.json [05:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48414 and previous config saved to /var/cache/conftool/dbconfig/20230522-054258-root.json [05:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48415 and previous config saved to /var/cache/conftool/dbconfig/20230522-054304-root.json [05:49:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48416 and previous config saved to /var/cache/conftool/dbconfig/20230522-055120-root.json [05:51:24] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:52:47] (03PS2) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) [05:57:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48417 and previous config saved to /var/cache/conftool/dbconfig/20230522-055756-root.json [05:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48418 and previous config saved to /var/cache/conftool/dbconfig/20230522-055803-root.json [05:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48419 and previous config saved to /var/cache/conftool/dbconfig/20230522-055809-root.json [06:01:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:02:07] (ProbeDown) firing: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:49] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Swift [06:02:53] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:07:07] (ProbeDown) resolved: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:13] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:27] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:09:25] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:47] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2021', diff saved to https://phabricator.wikimedia.org/P48420 and previous config saved to /var/cache/conftool/dbconfig/20230522-061033-marostegui.json [06:10:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48421 and previous config saved to /var/cache/conftool/dbconfig/20230522-061040-root.json [06:13:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48422 and previous config saved to /var/cache/conftool/dbconfig/20230522-061300-root.json [06:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48423 and previous config saved to /var/cache/conftool/dbconfig/20230522-061307-root.json [06:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48424 and previous config saved to /var/cache/conftool/dbconfig/20230522-061314-root.json [06:15:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T337204 [06:15:13] T337204: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T337204 [06:15:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2024 with weight 0 T337204', diff saved to https://phabricator.wikimedia.org/P48425 and previous config saved to /var/cache/conftool/dbconfig/20230522-061524-root.json [06:15:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T337204 [06:15:41] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:16:31] (03PS1) 10Marostegui: mariadb: Promote es2024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/921877 (https://phabricator.wikimedia.org/T337204) [06:17:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/921877 (https://phabricator.wikimedia.org/T337204) (owner: 10Marostegui) [06:17:47] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:52] !log Starting es5 codfw failover from es2023 to es2024 - T337204 [06:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:39] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:19:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2023 T337204', diff saved to https://phabricator.wikimedia.org/P48426 and previous config saved to /var/cache/conftool/dbconfig/20230522-061925-root.json [06:19:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight to es2024', diff saved to https://phabricator.wikimedia.org/P48427 and previous config saved to /var/cache/conftool/dbconfig/20230522-061947-marostegui.json [06:20:59] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service,rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48428 and previous config saved to /var/cache/conftool/dbconfig/20230522-062545-root.json [06:28:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48429 and previous config saved to /var/cache/conftool/dbconfig/20230522-062805-root.json [06:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48430 and previous config saved to /var/cache/conftool/dbconfig/20230522-062812-root.json [06:28:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48431 and previous config saved to /var/cache/conftool/dbconfig/20230522-062818-root.json [06:29:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:29:30] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.222 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:22] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [06:31:28] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:31:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48432 and previous config saved to /var/cache/conftool/dbconfig/20230522-063151-root.json [06:33:17] (03PS1) 10Marostegui: mariadb: Decommission db1121 [puppet] - 10https://gerrit.wikimedia.org/r/921881 (https://phabricator.wikimedia.org/T336725) [06:33:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1121.eqiad.wmnet [06:37:08] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:37:50] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast2002 [06:38:39] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:38:52] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:39:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1121 [puppet] - 10https://gerrit.wikimedia.org/r/921881 (https://phabricator.wikimedia.org/T336725) (owner: 10Marostegui) [06:40:03] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10MoritzMuehlenhoff) >>! In T336995#8865388, @Dzahn wrote: > This host is still in Icinga.. so not removed from puppet db or something... We have a rare race in the decom cookbook,... [06:40:33] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1121.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:40:39] (03PS1) 10Stevemunene: Create the jupyter notebook config folder [puppet] - 10https://gerrit.wikimedia.org/r/921885 (https://phabricator.wikimedia.org/T336036) [06:40:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48433 and previous config saved to /var/cache/conftool/dbconfig/20230522-064050-root.json [06:41:16] (03CR) 10CI reject: [V: 04-1] Create the jupyter notebook config folder [puppet] - 10https://gerrit.wikimedia.org/r/921885 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [06:41:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1121.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:41:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:41:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1121.eqiad.wmnet [06:42:22] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:42:43] (03CR) 10Muehlenhoff: [C: 03+1] "FYI; there was a new security update for imagemagick, we either need to bump to +deb10u5 or remove the version-specific annotation:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [06:42:49] 10ops-eqiad, 10decommission-hardware: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 (10Marostegui) [06:43:09] (03PS2) 10Stevemunene: Create the jupyter notebook config folder [puppet] - 10https://gerrit.wikimedia.org/r/921885 (https://phabricator.wikimedia.org/T336036) [06:43:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48434 and previous config saved to /var/cache/conftool/dbconfig/20230522-064310-root.json [06:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48435 and previous config saved to /var/cache/conftool/dbconfig/20230522-064317-root.json [06:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48436 and previous config saved to /var/cache/conftool/dbconfig/20230522-064323-root.json [06:43:38] 10ops-eqiad, 10decommission-hardware: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 (10Marostegui) a:05Marostegui→03Jclark-ctr [06:43:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:44:02] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.257 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:04] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:44:42] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:45:06] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:45:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts bast2002 [06:45:12] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast2002` - bast2002 (**FAIL**) - //Unable to find/resolve the mgmt DNS record, using th... [06:45:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T337206', diff saved to https://phabricator.wikimedia.org/P48437 and previous config saved to /var/cache/conftool/dbconfig/20230522-064541-root.json [06:45:46] T337206: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 [06:46:05] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41254/console" [puppet] - 10https://gerrit.wikimedia.org/r/919826 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [06:46:13] (03PS1) 10Marostegui: db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921984 (https://phabricator.wikimedia.org/T337206) [06:46:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48438 and previous config saved to /var/cache/conftool/dbconfig/20230522-064656-root.json [06:47:07] (03CR) 10Marostegui: [C: 03+2] db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921984 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui) [06:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:53:20] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.360 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:36] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [06:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48439 and previous config saved to /var/cache/conftool/dbconfig/20230522-065555-root.json [06:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48440 and previous config saved to /var/cache/conftool/dbconfig/20230522-065815-root.json [06:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48441 and previous config saved to /var/cache/conftool/dbconfig/20230522-065822-root.json [06:58:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48442 and previous config saved to /var/cache/conftool/dbconfig/20230522-065828-root.json [06:59:33] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) >>! In T330884#8865295, @ayounsi wrote: >> I copied over samplicator from bullseye-wikimedia to bookworm-wikimedia (the only dependency is glibc itself)... [06:59:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:00:50] 10SRE, 10Infrastructure-Foundations: Import/create samplicator source package - https://phabricator.wikimedia.org/T337208 (10MoritzMuehlenhoff) [07:00:58] 10SRE, 10Infrastructure-Foundations: Import/create samplicator source package - https://phabricator.wikimedia.org/T337208 (10MoritzMuehlenhoff) p:05Triage→03Low [07:02:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48443 and previous config saved to /var/cache/conftool/dbconfig/20230522-070200-root.json [07:06:06] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.427 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:08] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [07:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48444 and previous config saved to /var/cache/conftool/dbconfig/20230522-071059-root.json [07:13:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48445 and previous config saved to /var/cache/conftool/dbconfig/20230522-071319-root.json [07:13:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48446 and previous config saved to /var/cache/conftool/dbconfig/20230522-071326-root.json [07:13:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48447 and previous config saved to /var/cache/conftool/dbconfig/20230522-071333-root.json [07:15:58] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48448 and previous config saved to /var/cache/conftool/dbconfig/20230522-071705-root.json [07:18:20] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48449 and previous config saved to /var/cache/conftool/dbconfig/20230522-072604-root.json [07:28:34] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:28:38] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48450 and previous config saved to /var/cache/conftool/dbconfig/20230522-073210-root.json [07:32:22] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.206 second response time https://wikitech.wikimedia.org/wiki/Swift [07:32:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:32:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:32:58] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:33:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch kadmin server back to krb1001 [puppet] - 10https://gerrit.wikimedia.org/r/921242 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [07:34:20] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Swift [07:35:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [07:36:07] (03PS2) 10Muehlenhoff: Add insetup variant for undefined ownership [puppet] - 10https://gerrit.wikimedia.org/r/869777 [07:37:55] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [07:39:26] (03PS4) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [07:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48451 and previous config saved to /var/cache/conftool/dbconfig/20230522-074109-root.json [07:42:27] (03CR) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [07:44:33] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:46] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48452 and previous config saved to /var/cache/conftool/dbconfig/20230522-074715-root.json [07:48:50] (03CR) 10Muehlenhoff: [C: 03+2] Retire sre.aqs.roll-restart [cookbooks] - 10https://gerrit.wikimedia.org/r/920704 (https://phabricator.wikimedia.org/T330889) (owner: 10Muehlenhoff) [07:49:04] (03CR) 10Vgutierrez: [C: 03+1] varnish: fix call to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/921617 (https://phabricator.wikimedia.org/T337142) (owner: 10Volans) [07:51:18] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:05] (03CR) 10Muehlenhoff: [C: 03+2] "Removed the pyc files per" [cookbooks] - 10https://gerrit.wikimedia.org/r/920704 (https://phabricator.wikimedia.org/T330889) (owner: 10Muehlenhoff) [07:54:33] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:55:52] PROBLEM - Kerberos KAdmin daemon on krb2002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [07:56:10] PROBLEM - Kerberos Kpropd daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kpropd https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [07:56:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48453 and previous config saved to /var/cache/conftool/dbconfig/20230522-075613-root.json [07:58:31] (03CR) 10Filippo Giunchedi: [C: 03+2] cadvisor: disable percpu and cpuLoad metric classes [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:01:46] krb2002/1001 are expected, part of the failover [08:02:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48454 and previous config saved to /var/cache/conftool/dbconfig/20230522-080219-root.json [08:02:46] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: rollout cadvisor to PoPs [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:04:33] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:01] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [08:13:57] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:39] (03PS1) 10Filippo Giunchedi: profile: fix cadvisor deployment to PoPs [puppet] - 10https://gerrit.wikimedia.org/r/922057 (https://phabricator.wikimedia.org/T108027) [08:17:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48455 and previous config saved to /var/cache/conftool/dbconfig/20230522-081724-root.json [08:18:14] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41255/console" [puppet] - 10https://gerrit.wikimedia.org/r/922057 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:19:24] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: fix cadvisor deployment to PoPs [puppet] - 10https://gerrit.wikimedia.org/r/922057 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:19:33] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) [08:20:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) 05Open→03Resolved The old cookbook has been removed and the docs were updated. [08:22:11] !log installing systemd security updates [08:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:45] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [08:28:00] !log drain Arelion link between cr1-codfw and cr3-eqsin to mitigate packet loss eqiad <-> eqsin [08:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:02] (03PS1) 10Muehlenhoff: Remove krb2001 from list of active KDCs [puppet] - 10https://gerrit.wikimedia.org/r/922060 [08:38:07] (03CR) 10Cparle: [C: 03+2] [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921701 (owner: 10Matthias Mullie) [08:38:55] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921701 (owner: 10Matthias Mullie) [08:43:59] (03CR) 10Btullis: [C: 03+1] Create the jupyter notebook config folder [puppet] - 10https://gerrit.wikimedia.org/r/921885 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [08:46:12] !log Stop mysql on db2160 (haproxy irc alerts will be generated) [08:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:26] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10MoritzMuehlenhoff) >>! In T336995#8869914, @MoritzMuehlenhoff wrote: >>>! In T336995#8865388, @Dzahn wrote: >> This host is still in Icinga.. so not removed from puppet db or someth... [08:54:27] (03CR) 10David Caro: [C: 03+2] "Tested on toolsbeta, LGTM ❤️" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/921620 (https://phabricator.wikimedia.org/T337182) (owner: 10Lucas Werkmeister) [08:55:32] (03Merged) 10jenkins-bot: Restart Kubernetes webservices more cleanly [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/921620 (https://phabricator.wikimedia.org/T337182) (owner: 10Lucas Werkmeister) [08:55:58] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10jcrespo) I thought this was a mere software automation temporary failures, but looking at the RAID log, a disk was temporarily removed from the RAID and rebuilt (outside of the setup peri... [08:58:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:01:48] (03CR) 10Muehlenhoff: [C: 03+2] Remove krb2001 from list of active KDCs [puppet] - 10https://gerrit.wikimedia.org/r/922060 (owner: 10Muehlenhoff) [09:01:56] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10jcrespo) @Jhancock.wm @wiki_willy I am unsure how to proceed here, if this was a host that is in production for some time, I would just ignore it and move on; but this is a host that is b... [09:06:48] (03CR) 10Slyngshede: [C: 03+2] k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [09:08:32] (03PS1) 10KartikMistry: cxserver: Remove Flores MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/922064 (https://phabricator.wikimedia.org/T331505) [09:08:34] (03PS3) 10Jbond: profile::auto_restarts::service: add some spec tests [puppet] - 10https://gerrit.wikimedia.org/r/920654 (https://phabricator.wikimedia.org/T336845) (owner: 10Arturo Borrero Gonzalez) [09:08:36] (03CR) 10Jbond: profile::auto_restarts::service: add some spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920654 (https://phabricator.wikimedia.org/T336845) (owner: 10Arturo Borrero Gonzalez) [09:09:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/920654 (https://phabricator.wikimedia.org/T336845) (owner: 10Arturo Borrero Gonzalez) [09:09:36] Is there a deployment schedule for this week yet? [09:10:04] (03PS1) 10Slyngshede: sre.ganeti.reimage: Remove specialised cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/922065 (https://phabricator.wikimedia.org/T336491) [09:11:15] We just accidentally merged a patch in config (the patch is fine, we just didn’t jntend for it to be merged without ability to deploy immediately) [09:12:00] Should we revert, or can I deploy it now-ish? Or just leave it merged-but-undeployed for now? [09:12:03] (03CR) 10Jbond: [C: 03+1] cloud: wmf-auto-restart: exclude NFS filesystems [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:14:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, please make sure to follow https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/922065 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [09:18:08] matthiasmullie: you absolutely should not leave undeployed patches merged. deploying should be fine since there's nothing else going on [09:18:51] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10MoritzMuehlenhoff) I've updated https://wikitech.wikimedia.org/wiki/Ganeti to point to the new cookbook [09:18:59] note that many deployers are travelling back from the Hackathon today. I'm just about to board my flight for example [09:22:45] matthiasmullie: either deploy it or revert in Gerrit then pull the revert on the deployment server [09:22:48] I think that will do it [09:23:31] though we have a monitoring probe comparing mediawiki-config has the same commit on deployment server and on mediawiki ones in which case that will need a dummy deploy [09:24:08] Perfect; I’m close to airport, will deploy once I have a stable connection there [09:24:31] that is for the change [WikibaseMediaInfo] Add 'main subject of' property ? [09:25:02] I can revert it in Gerrit and do the dummy deploy [09:25:06] Correct [09:25:23] then the patch can be deployed later when you are really back rather than in transports :] [09:25:33] Sounds good, thanks [09:25:39] doing [09:26:12] (03PS1) 10Hashar: Revert "[WikibaseMediaInfo] Add 'main subject of' property" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921558 [09:26:27] matthiasmullie: if you wanna +1 the revert, I will then process it :] [09:27:21] (03PS1) 10Muehlenhoff: Remove KDC role from krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/922068 (https://phabricator.wikimedia.org/T331695) [09:27:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921558 (owner: 10Hashar) [09:29:15] (03Merged) 10jenkins-bot: Revert "[WikibaseMediaInfo] Add 'main subject of' property" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921558 (owner: 10Hashar) [09:29:49] !log hashar@deploy1002 Started scap: Backport for [[gerrit:921558|Revert "[WikibaseMediaInfo] Add 'main subject of' property"]] [09:30:06] (03CR) 10Matthias Mullie: [C: 03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921558 (owner: 10Hashar) [09:30:38] +1 done, thanks hashar! [09:31:07] (03CR) 10Jbond: "-1: see inline" [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:39:11] !log hashar@deploy1002 hashar: Backport for [[gerrit:921558|Revert "[WikibaseMediaInfo] Add 'main subject of' property"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:43:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host debmonitor1003.eqiad.wmnet [09:43:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:48:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor1003.eqiad.wmnet - jmm@cumin2002" [09:49:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor1003.eqiad.wmnet - jmm@cumin2002" [09:49:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:44] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache debmonitor1003.eqiad.wmnet on all recursors [09:49:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) debmonitor1003.eqiad.wmnet on all recursors [09:50:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor1003.eqiad.wmnet - jmm@cumin2002" [09:51:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor1003.eqiad.wmnet - jmm@cumin2002" [09:51:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host debmonitor1003.eqiad.wmnet [09:51:21] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10fgiunchedi) >>! In T336234#8864792, @MatthewVernon wrote: > I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a may... [10:00:32] doing the backport now (I forgot to confirm in the terminal [10:01:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host debmonitor2003.codfw.wmnet [10:01:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:02:11] (03PS1) 10Slyngshede: Wikimedia signup terms: Update signup information. [software/bitu] - 10https://gerrit.wikimedia.org/r/922072 [10:02:54] !log installing updated usb.ids packages for Bullseye [10:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor2003.codfw.wmnet - jmm@cumin2002" [10:04:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor2003.codfw.wmnet - jmm@cumin2002" [10:04:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:42] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache debmonitor2003.codfw.wmnet on all recursors [10:04:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) debmonitor2003.codfw.wmnet on all recursors [10:05:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor2003.codfw.wmnet - jmm@cumin2002" [10:06:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor2003.codfw.wmnet - jmm@cumin2002" [10:06:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host debmonitor2003.codfw.wmnet [10:06:30] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [10:06:49] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:921558|Revert "[WikibaseMediaInfo] Add 'main subject of' property"]] (duration: 37m 00s) [10:07:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/922072 (owner: 10Slyngshede) [10:12:00] (03CR) 10Slyngshede: [C: 03+2] Wikimedia signup terms: Update signup information. [software/bitu] - 10https://gerrit.wikimedia.org/r/922072 (owner: 10Slyngshede) [10:12:02] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Wikimedia signup terms: Update signup information. [software/bitu] - 10https://gerrit.wikimedia.org/r/922072 (owner: 10Slyngshede) [10:14:52] (03PS1) 10Elukey: helmfile.d: add Lift Wing's revert risk model server to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) [10:17:36] !log Un-draining transport circuit from eqsin to codfw, moving traffic back to default path [10:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:56] !log Un-draining transport circuit from eqsin to codfw, moving traffic back to default path T337220 [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:49] (03PS2) 10Elukey: helmfile.d: add Lift Wing's revert risk model server to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) [10:27:03] (03CR) 10Vgutierrez: [C: 03+1] varnishkafka: remove absented logster integration [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah) [10:29:00] (03CR) 10Cathal Mooney: [C: 03+2] templates: convert 172.20.5.0/24 to Nebox [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [10:43:32] (03PS1) 10Muehlenhoff: Add debmonitor[12]003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/922079 (https://phabricator.wikimedia.org/T241049) [10:43:35] (03CR) 10Jbond: [C: 04-1] "thanks for the work general idea an implementation looks good see inline for comments (-1s explicitly labelled)" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [10:49:17] (03CR) 10Muehlenhoff: [C: 03+2] Add debmonitor[12]003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/922079 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:53:31] (03PS1) 10Btullis: Downgrade the version of spark to 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922082 (https://phabricator.wikimedia.org/T332765) [10:55:21] (03PS2) 10Btullis: Downgrade the version of spark to 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922082 (https://phabricator.wikimedia.org/T332765) [10:58:40] (03PS3) 10Btullis: Downgrade the version of spark to 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922082 (https://phabricator.wikimedia.org/T332765) [11:08:20] (03CR) 10Jbond: query_service: Permit python2 on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [11:12:17] (03PS1) 10Slyngshede: Cleanup wording of various messages [software/bitu] - 10https://gerrit.wikimedia.org/r/922087 [11:13:43] (03PS1) 10Stevemunene: Grant stat1009 access to cloud dumps [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) [11:17:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/922087 (owner: 10Slyngshede) [11:17:30] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Cleanup wording of various messages [software/bitu] - 10https://gerrit.wikimedia.org/r/922087 (owner: 10Slyngshede) [11:26:20] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 [11:26:44] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refator to set up routes for the cloud realm independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [11:26:46] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [11:26:48] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) [11:26:57] (03CR) 10Matthias Mullie: [C: 03+1] "Per earlier CR, this is good to go once I'm around to see deployment through" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie) [11:27:30] (03CR) 10CI reject: [V: 04-1] cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:28:10] (03CR) 10CI reject: [V: 04-1] cloudgw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:29:24] (03CR) 10CI reject: [V: 04-1] cloudgw: refator to set up routes for the cloud realm independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:35:36] (03CR) 10Jbond: "lgtm but see comment and follow up patch for improvments" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [11:35:43] (03PS1) 10Jbond: profile::gerrit: make dependency on gerrit class explicit [puppet] - 10https://gerrit.wikimedia.org/r/922107 [11:36:09] (03CR) 10CI reject: [V: 04-1] profile::gerrit: make dependency on gerrit class explicit [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [11:38:08] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refator to set up routes for the cloud realm independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [11:38:10] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [11:38:12] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) [11:38:33] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:38:43] (03CR) 10CI reject: [V: 04-1] cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:39:15] (03CR) 10Btullis: [V: 03+2 C: 03+2] Downgrade the version of spark to 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922082 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [11:39:20] (03CR) 10CI reject: [V: 04-1] cloudgw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:39:54] (03PS2) 10Jbond: profile::gerrit: make dependency on gerrit class explicit [puppet] - 10https://gerrit.wikimedia.org/r/922107 [11:40:37] (03CR) 10CI reject: [V: 04-1] cloudgw: refator to set up routes for the cloud realm independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:41:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41257/console" [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [11:42:10] (03CR) 10CI reject: [V: 04-1] profile::gerrit: make dependency on gerrit class explicit [puppet] - 10https://gerrit.wikimedia.org/r/922107 (owner: 10Jbond) [11:43:53] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [11:44:11] (03PS1) 10Slyngshede: Signup: Custom Captcha function. [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 [11:45:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host debmonitor1003.eqiad.wmnet with OS bookworm [11:46:12] Can anyone update deployment calendar? https://wikitech.wikimedia.org/wiki/Deployments [11:46:49] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: Device does not support ifTable - try without -I option https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:38] (03CR) 10Cathal Mooney: [C: 04-1] "I'm wary of us rushing this in without a more rounded plan. We need to work out what the exact config will be like and then update the pu" [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:48:13] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:48:33] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [11:52:18] (03PS1) 10Slyngshede: Signup: Tweak layout on success page. [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 [11:57:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on debmonitor1003.eqiad.wmnet with reason: host reimage [11:57:17] (03PS1) 10Slyngshede: C:idm enable custom captcha generator. [puppet] - 10https://gerrit.wikimedia.org/r/922111 [11:59:13] (03CR) 10Muehlenhoff: Signup: Tweak layout on success page. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 (owner: 10Slyngshede) [11:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2124', diff saved to https://phabricator.wikimedia.org/P48456 and previous config saved to /var/cache/conftool/dbconfig/20230522-115936-root.json [12:00:40] (03CR) 10Muehlenhoff: Signup: Custom Captcha function. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 (owner: 10Slyngshede) [12:02:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on debmonitor1003.eqiad.wmnet with reason: host reimage [12:03:07] (03PS2) 10Slyngshede: Signup: Tweak layout on success page. [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 [12:04:20] (03PS2) 10Slyngshede: Signup: Custom Captcha function. [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 [12:05:08] (03CR) 10Slyngshede: Signup: Custom Captcha function. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 (owner: 10Slyngshede) [12:05:14] (03CR) 10Slyngshede: Signup: Tweak layout on success page. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 (owner: 10Slyngshede) [12:05:43] (03PS2) 10Slyngshede: C:idm enable custom captcha generator. [puppet] - 10https://gerrit.wikimedia.org/r/922111 [12:05:52] (03CR) 10Jbond: [C: 04-1] gitlab: use sshkey for git-ssh public keys (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:08:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 (owner: 10Slyngshede) [12:11:33] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:13:38] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Tweak layout on success page. [software/bitu] - 10https://gerrit.wikimedia.org/r/922109 (owner: 10Slyngshede) [12:14:07] (03CR) 10Slyngshede: "Do not merge before: https://gerrit.wikimedia.org/r/c/operations/software/bitu/+/922108" [puppet] - 10https://gerrit.wikimedia.org/r/922111 (owner: 10Slyngshede) [12:14:49] (03PS1) 10Marostegui: db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922112 [12:15:00] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:15:15] (03CR) 10Marostegui: [C: 03+2] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922112 (owner: 10Marostegui) [12:15:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 (owner: 10Slyngshede) [12:15:58] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Custom Captcha function. [software/bitu] - 10https://gerrit.wikimedia.org/r/922108 (owner: 10Slyngshede) [12:16:45] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10hashar) [12:17:34] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:18:12] (03CR) 10Muehlenhoff: C:idm enable custom captcha generator. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922111 (owner: 10Slyngshede) [12:18:48] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:18:50] (03CR) 10Slyngshede: [C: 03+2] C:idm enable custom captcha generator. [puppet] - 10https://gerrit.wikimedia.org/r/922111 (owner: 10Slyngshede) [12:19:28] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [12:19:45] (03PS3) 10Slyngshede: C:idm enable custom captcha generator. [puppet] - 10https://gerrit.wikimedia.org/r/922111 [12:19:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host debmonitor1003.eqiad.wmnet with OS bookworm [12:20:21] (03CR) 10Slyngshede: [C: 03+2] C:idm enable custom captcha generator. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922111 (owner: 10Slyngshede) [12:20:32] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:20:48] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [12:20:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host debmonitor2003.codfw.wmnet with OS bookworm [12:21:32] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:21:53] (03CR) 10Hashar: [C: 03+1] "Feel free to deploy anytime. The sole reason I reverted it previously was that because the original commit got merged in but not deployed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie) [12:22:26] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:24] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:25:16] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:25:39] (03CR) 10DCausse: [C: 03+2] search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [12:27:46] (03Merged) 10jenkins-bot: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [12:28:58] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:30:32] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:32:12] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:32:14] (03PS1) 10Stevemunene: Create stat user home directory [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) [12:33:52] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:35:04] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [12:36:28] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset 4.6e-05 secs https://wikitech.wikimedia.org/wiki/NTP [12:36:42] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41258/console" [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [12:36:54] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:37:02] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:39:31] (03CR) 10Muehlenhoff: Create stat user home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [12:39:52] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:40:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on debmonitor2003.codfw.wmnet with reason: host reimage [12:41:33] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:44:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on debmonitor2003.codfw.wmnet with reason: host reimage [12:47:25] (03CR) 10Ottomata: "One comment but LGTM, feel free to merge after that is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [12:47:32] (03CR) 10Ottomata: "Or ping me for merge." [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [12:55:32] 10SRE, 10SRE-tools, 10Discovery-Search, 10Infrastructure-Foundations, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Gehel) [12:57:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host debmonitor2003.codfw.wmnet with OS bookworm [12:58:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:00:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) @kzimmerman Have they been re-hired as a contractor or full time employee. If the former can you confirm the contract expiry data and would you be... [13:10:00] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10jbond) @KFrancis are you able to confirm the appropriate NDA has been signed @Reedy Are you able to approve this request [13:11:33] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10jbond) @KFrancis are you able to confirm or arrange an NDA for @Nux thanks [13:14:49] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10jbond) 05In progress→03Resolved > Now that each user has been supplied with the credentials, I wi... [13:15:05] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:15:25] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:20:35] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:25] (03CR) 10Elukey: [C: 03+2] varnishkafka: remove absented logster integration [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah) [13:31:01] (03CR) 10Ottomata: mw-page-content-change-enrich: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:31:20] (03CR) 10JMeybohm: [V: 03+1] Make kubernetes::clusters the central place for k8s config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:33:11] (03PS1) 10JMeybohm: miscweb: Fix multiple istio gateways in one namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/922120 [13:33:13] (03PS1) 10JMeybohm: modules.ingress: Introduce istio_1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922121 [13:33:15] (03PS1) 10JMeybohm: modules.ingress: Change default gateway host [deployment-charts] - 10https://gerrit.wikimedia.org/r/922122 [13:33:17] (03PS1) 10JMeybohm: miscweb: Update ingress.istio module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922123 [13:33:19] (03PS1) 10JMeybohm: miscweb: Fix multiple istio gateways in one namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/922124 [13:41:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) >>! In T337057#8869828, @ayounsi wrote: > I also like option 5 (hard-coding the conditional in Jinja to n... [13:45:03] (03CR) 10Klausman: [C: 03+1] helmfile.d: add Lift Wing's revert risk model server to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/922073 (https://phabricator.wikimedia.org/T333124) (owner: 10Elukey) [13:45:05] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:14] (03CR) 10JMeybohm: "Please take a look at the following changes in the chain as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922120 (owner: 10JMeybohm) [13:47:10] 10Puppet, 10Horizon: Allow providing a commit message for hieradata changes - https://phabricator.wikimedia.org/T250623 (10joanna_borun) [13:47:19] 10Puppet, 10Wikimedia Meet: Puppetize the jitsi instance - https://phabricator.wikimedia.org/T251040 (10joanna_borun) [13:47:38] 10Puppet, 10Horizon: Preserve formatting etc. in horizon hiera editor - https://phabricator.wikimedia.org/T250622 (10joanna_borun) [13:47:49] 10Puppet, 10Beta-Cluster-Infrastructure: puppetmaster config in deployment-prep may be inadvertently breaking store,logstash reports? - https://phabricator.wikimedia.org/T218175 (10joanna_borun) [13:48:07] 10Puppet, 10Cloud-VPS, 10MediaWiki-Vagrant: Vagrant -> mwvagrant alias in role::labs::mediawiki_vagrant is brittle - https://phabricator.wikimedia.org/T195592 (10joanna_borun) [13:50:17] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog-Deprecated, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deploy... - https://phabricator.wikimedia.org/T259812 [13:51:05] 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10Puppet-Core, 10cloud-services-team: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10joanna_borun) [13:51:22] (03PS2) 10JMeybohm: miscweb: Fix multiple istio gateways in one namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/922124 [13:51:35] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Thumbor, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10joanna_borun) [13:52:02] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10joanna_borun) [13:52:18] (03PS11) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:53:08] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:53:27] (03PS12) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:54:31] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:56:29] (03PS1) 10Muehlenhoff: Add debmonitor[12]003 as additional scap targets [puppet] - 10https://gerrit.wikimedia.org/r/922126 (https://phabricator.wikimedia.org/T241049) [14:03:10] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10Jhancock.wm) @jcrespo I found three errors in the lifecycle logs. (they've all cleared currently) 2023-05-21 07:05:13 PDR5 Disk 2 in Backplane 2 of Integrated RAID Controller 1 is r... [14:05:50] (03PS1) 10Slyngshede: Signup: Better wording in confirmation flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/922128 [14:06:11] (03CR) 10Bking: [C: 03+2] flink-session-cluster: fix prom reporter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920781 (https://phabricator.wikimedia.org/T336872) (owner: 10DCausse) [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:55] (03Merged) 10jenkins-bot: flink-session-cluster: fix prom reporter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920781 (https://phabricator.wikimedia.org/T336872) (owner: 10DCausse) [14:10:09] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:10:11] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:14:02] (03PS1) 10Bking: flink-session-cluster: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/922133 (https://phabricator.wikimedia.org/T336872) [14:14:44] (03CR) 10DCausse: [C: 03+1] flink-session-cluster: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/922133 (https://phabricator.wikimedia.org/T336872) (owner: 10Bking) [14:15:02] thcipriani: FYI, https://wikitech.wikimedia.org/wiki/Special:Diff/2079237 — added this week's deployment cal entries after figuring out why `make-deployments-calendar` was failing (afaics its because it couldn't get the primary for T330216 ?) [14:15:03] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [14:15:50] (03CR) 10Bking: [C: 03+2] flink-session-cluster: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/922133 (https://phabricator.wikimedia.org/T336872) (owner: 10Bking) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:50] (03Merged) 10jenkins-bot: flink-session-cluster: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/922133 (https://phabricator.wikimedia.org/T336872) (owner: 10Bking) [14:19:16] (03PS13) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:19:17] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10phaultfinder) [14:31:11] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:32:37] (03PS1) 10Ottomata: Undeploy flink-operator and uncreate service namespace in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922138 (https://phabricator.wikimedia.org/T333464) [14:32:44] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:35:41] 10SRE, 10Infrastructure-Foundations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jbond) [14:36:28] (03PS2) 10Herron: arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) [14:40:22] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10jcrespo) > Is it safe to do that now? For context, the host is pending to be put into production yet, there is no production data or software there other than the base os install, so you... [14:46:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:46:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49992 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:43] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10Ottomata) [14:50:02] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:51:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:23] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) Some info - the kafka mirror maker's cergen TLS cert has `kafka_mirror_maker` as CN, that is used as "username" in Kafka ACLs: `... [15:04:01] (03CR) 10Hashar: "In last patch I have added a test to raise coverage." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/920708 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [15:09:39] (03CR) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:10:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "New debmonitor VMs - jmm@cumin2002 - T241049" [15:12:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "New debmonitor VMs - jmm@cumin2002 - T241049" [15:12:28] (03PS1) 10Kamila Součková: benthos: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/922142 (https://phabricator.wikimedia.org/T336658) [15:14:13] (03CR) 10Btullis: Grant stat1009 access to cloud dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [15:15:56] (03PS14) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:19:39] (03PS1) 10Filippo Giunchedi: thanos: use idle_timeout instead of route timeout for long-running requests [puppet] - 10https://gerrit.wikimedia.org/r/922144 (https://phabricator.wikimedia.org/T337251) [15:23:51] (03CR) 10Ayounsi: [C: 03+2] Validators: improve device name, add interface/outlet (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:24:56] (03Merged) 10jenkins-bot: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:25:06] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [15:25:51] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog-Deprecated, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deploy... - https://phabricator.wikimedia.org/T259812 [15:25:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary [15:26:14] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [15:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T1530). [15:31:34] (03CR) 10Herron: [C: 03+1] "good idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922144 (https://phabricator.wikimedia.org/T337251) (owner: 10Filippo Giunchedi) [15:32:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [15:33:18] (03CR) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:33:20] (03CR) 10Ayounsi: [C: 03+2] Netbox prod: add poweroutlet validator [puppet] - 10https://gerrit.wikimedia.org/r/920691 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:34:54] (03PS15) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [15:36:31] (03PS1) 10Muehlenhoff: debmonitor::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/922145 [15:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:39:13] (03CR) 10MVernon: [C: 03+1] "Seems like a sensible idea." [puppet] - 10https://gerrit.wikimedia.org/r/922144 (https://phabricator.wikimedia.org/T337251) (owner: 10Filippo Giunchedi) [15:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:45] (03PS3) 10Ayounsi: Configure mgmt_junos on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/920311 (https://phabricator.wikimedia.org/T320244) [15:45:55] (03PS2) 10Stevemunene: Grant stat1009 access to cloud dumps [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) [15:46:25] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10Jhancock.wm) server found with a rapidly blinking amber idrac and no connection on the mgmt port. according to the blink code the system needs a reboot. I attempted to reboot just the idrac but it did not fix the issue. [15:47:52] (03PS3) 10Cathal Mooney: Disable IPv6 RA generation on spine layer switches [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) [15:47:55] 10SRE, 10Infrastructure Security, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10jbond) [15:49:49] (03CR) 10Stevemunene: Grant stat1009 access to cloud dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [15:49:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Traffic: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557 (10jbond) [15:53:25] (03CR) 10Vgutierrez: Create cookbook to upgrade Apache Traffic Server (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [15:55:58] (03CR) 10Eevans: [C: 03+1] thanos: use idle_timeout instead of route timeout for long-running requests [puppet] - 10https://gerrit.wikimedia.org/r/922144 (https://phabricator.wikimedia.org/T337251) (owner: 10Filippo Giunchedi) [15:56:21] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2009.codfw.wmnet [15:57:50] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [15:59:22] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [16:02:24] 10SRE, 10ops-codfw, 10Data-Persistence-Backup: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T337174 (10Jhancock.wm) @jcrespo thanks, that is reassuring. Trying to be careful and diligent. I checked the SATA cables and they appear fine. I updated the raid controller firmware and ran a har... [16:04:05] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922126 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [16:05:11] (03PS1) 10Jbond: x509-bundle: skip popping first if we have an empty list [puppet] - 10https://gerrit.wikimedia.org/r/922147 (https://phabricator.wikimedia.org/T283001) [16:05:22] 10SRE, 10SRE-Unowned, 10Patch-For-Review: x509-bundle as used by envoy::tlsproxy fails on single certificate file - https://phabricator.wikimedia.org/T283001 (10jbond) [16:07:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Manuel) Thank you, @CDanis! It's working well! :) [16:08:07] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/920311 (https://phabricator.wikimedia.org/T320244) (owner: 10Ayounsi) [16:11:33] (03CR) 10Slyngshede: debmonitor::server: Add bookworm support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [16:11:39] (03CR) 10Slyngshede: [C: 04-1] debmonitor::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [16:17:51] PROBLEM - Host wdqs2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:52] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King Firmware update reboot T331297 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:18:52] ACKNOWLEDGEMENT - SSH on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King Firmware update reboot T331297 https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:18:52] ACKNOWLEDGEMENT - Host wdqs2009 is DOWN: PING CRITICAL - Packet loss = 100% Brian_King Firmware update reboot T331297 [16:20:41] (03PS2) 10Stevemunene: Create stat user home directory [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) [16:21:04] (03CR) 10CI reject: [V: 04-1] Create stat user home directory [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [16:21:34] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/d/d7/Elizabeth_Sombart,_February,_2023.jpg" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T331800 (10MarcoAurelio) Feels like T244567. [16:22:00] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: "Error undeleting file: A non-identical file already exists at "mwstore://local-swift-eqiad/local-public/..." while restoring a file on Commons - https://phabricator.wikimedia.org/T258938 (10MarcoAurelio) Seems T244567. [16:23:42] (03PS3) 10Stevemunene: Create stat user home directory [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) [16:26:42] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41259/console" [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [16:30:43] (03CR) 10Stevemunene: [V: 03+1] Create stat user home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [16:31:27] 10SRE-swift-storage, 10Commons: Renaming doesn't work - https://phabricator.wikimedia.org/T337231 (10MarcoAurelio) Seems T244567. FileBackendMultiWrite::doOperationsInternal: failed sync check: ["mwstore://local-multiwrite/local-public/3/3a/August\u00edi_Julbe.jpg","mwstore://local-multiwrite/local-public/b... [16:32:51] RECOVERY - Host wdqs2009 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [16:35:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2009.codfw.wmnet [16:35:53] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs2009.codfw.wmnet [16:37:54] 10SRE, 10Discovery-Search, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10jbond) > Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham. Are you able... [16:39:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/919365 (https://phabricator.wikimedia.org/T216380) (owner: 10Dzahn) [16:43:09] (03CR) 10Jbond: debmonitor::server: Add bookworm support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [16:46:30] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10jbond) > maybe @jbond can chime in with some suggestions Happy too but i may need some more context :) specifically what is the endpoint... [16:53:18] (03CR) 10Ayounsi: [C: 03+2] Configure mgmt_junos on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/920311 (https://phabricator.wikimedia.org/T320244) (owner: 10Ayounsi) [16:53:53] (03Merged) 10jenkins-bot: Configure mgmt_junos on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/920311 (https://phabricator.wikimedia.org/T320244) (owner: 10Ayounsi) [16:58:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:58:44] !log push mgmt_junos to all L2 switches [16:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:56] ^ it's basically a bug that bast2002 is alerting [16:59:04] it should have been decom'ed by cookbook [16:59:20] but for some reason wasnt removed from alerting [17:00:08] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T1700) [17:00:08] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T1700). [17:03:54] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@5ee7a62]: (no justification provided) [17:04:11] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@5ee7a62]: (no justification provided) (duration: 00m 17s) [17:06:03] (03PS2) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [17:08:39] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [17:09:19] (03PS3) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [17:09:53] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10Ottomata) > specifically what is the endpoint that theses certs authenticate to Kafka brokers > is that allready managed by pki yup! Th... [17:11:58] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [17:14:06] (03PS4) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [17:17:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/922128 (owner: 10Slyngshede) [17:18:29] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [17:18:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/922115 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [17:20:43] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [17:20:48] that's me ^ [17:21:01] ack [17:24:10] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [17:29:57] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 260.27 ms [17:51:55] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:52:08] me ^ [17:53:03] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms [18:02:48] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Better wording in confirmation flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/922128 (owner: 10Slyngshede) [18:17:37] 10SRE-swift-storage, 10Commons: Renaming doesn't work - https://phabricator.wikimedia.org/T337231 (10Umherirrender) [18:17:43] (03PS1) 10Stevemunene: Set mariadb-client to pull the right version [puppet] - 10https://gerrit.wikimedia.org/r/922158 (https://phabricator.wikimedia.org/T336036) [18:23:09] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41260/console" [puppet] - 10https://gerrit.wikimedia.org/r/922158 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [18:43:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/922158 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [18:47:27] (03CR) 10Gmodena: [C: 03+1] "@ottomata: your patchset LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [18:50:02] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:52:32] 10SRE, 10Infrastructure-Foundations, 10netops: Junos: use mgmt_junos for syslog and ntp - https://phabricator.wikimedia.org/T320244 (10ayounsi) 05Open→03Resolved a:03ayounsi All done where possible. [18:52:42] 10SRE, 10Infrastructure-Foundations, 10netops: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) All done where possible. [18:53:20] (03PS1) 10Ayounsi: Introduce mgmt_junos variable [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) [18:58:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) the fasw and asw1-eqsin switches didn't create the `mgmt_junos` routing instance as they should have. https://gerrit.wikimedia.org/r/922161 works... [18:59:45] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) [19:00:33] (03CR) 10Ayounsi: "[infra wide diff in progress from my laptop, will update once done]" [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:02:19] (03CR) 10Ayounsi: [C: 03+1] Disable IPv6 RA generation on spine layer switches [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) (owner: 10Cathal Mooney) [19:17:56] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10KFrancis) Hi all, I will need the volunteer's full name, mailing address, and email to process the NDA. Please send the following information to: kfrancis@wikimedia.org. Thank you! [19:18:14] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:44] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:47] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10KFrancis) Hi all, I will need the volunteer's full name, mailing address, and email to process the NDA. Please send the following information to: kfrancis@wikimedia.org. Thank you! [19:20:10] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:20:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:22:11] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:22:13] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:31:34] (03CR) 10Btullis: [C: 03+1] Set mariadb-client to pull the right version [puppet] - 10https://gerrit.wikimedia.org/r/922158 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [19:42:59] (03CR) 10Ayounsi: "Changes for 3 devices: ['cr2-eqdfw.wikimedia.org', 'cr2-eqord.wikimedia.org', 'cr3-knams.wikimedia.org']" [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:44:42] (03CR) 10Raymond Ndibe: maintain-dbusers: use click for cli definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [19:46:46] (03CR) 10Raymond Ndibe: [C: 03+1] maintain-dbusers: use click for cli definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [19:50:10] (03CR) 10Ayounsi: [C: 03+2] "Going to self merge it to not keep outstanding diffs for too long, feel free to do a post merge review if needed." [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:50:45] (03Merged) 10jenkins-bot: Introduce mgmt_junos variable [homer/public] - 10https://gerrit.wikimedia.org/r/922161 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:53:10] (03PS3) 10Samtar: [ruwiki] Add 'abusefilter log/view private' flags to ArbCom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921764 (https://phabricator.wikimedia.org/T336625) (owner: 10Superpes15) [19:58:08] (03PS1) 10Ayounsi: mgmt_junos: create default route if no mgmt_junos configured [homer/public] - 10https://gerrit.wikimedia.org/r/922164 (https://phabricator.wikimedia.org/T327862) [19:59:03] (03CR) 10Ayounsi: [C: 03+2] mgmt_junos: create default route if no mgmt_junos configured [homer/public] - 10https://gerrit.wikimedia.org/r/922164 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:59:37] (03Merged) 10jenkins-bot: mgmt_junos: create default route if no mgmt_junos configured [homer/public] - 10https://gerrit.wikimedia.org/r/922164 (https://phabricator.wikimedia.org/T327862) (owner: 10Ayounsi) [19:59:53] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 294 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T2000). [20:00:05] Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] * TheresNoTime can deploy [20:00:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 128 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:01:52] TheresNoTime, Superpes: have them rights ever been granted to other groups before [20:02:34] RhinosF1: ref T336625 ? [20:02:34] T336625: Add abusefilter-view/log-private rights for arbcom user group on ruwiki - https://phabricator.wikimedia.org/T336625 [20:02:46] TheresNoTime: yes [20:04:11] RhinosF1: seems so, T256572 [20:04:12] T256572: Creation of a new user group on plwikipedia - https://phabricator.wikimedia.org/T256572 [20:04:33] Superpes: waiting for you, ping me when you're available [20:04:40] Hi TheresNoTime I'm here :) [20:05:21] TheresNoTime: seems fine yes [20:05:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:05:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921764 (https://phabricator.wikimedia.org/T336625) (owner: 10Superpes15) [20:06:01] Yep I saw previous patches in which this right was added! Btw ArbCom shouldn't have problem in having private rights [20:06:01] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:06:18] (03Merged) 10jenkins-bot: [ruwiki] Add 'abusefilter log/view private' flags to ArbCom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921764 (https://phabricator.wikimedia.org/T336625) (owner: 10Superpes15) [20:06:35] !log samtar@deploy1002 Started scap: Backport for [[gerrit:921764|[ruwiki] Add 'abusefilter log/view private' flags to ArbCom (T336625)]] [20:07:27] Superpes: doesn't seem to be any particular rules around 'private' rights that aren't CU/OS and for the ones that do exist like deleted content it's an RFA-like process. Arbcom is an elected role and elections I assume are properly done so I can't see any issue. [20:08:04] !log samtar@deploy1002 superpes and samtar: Backport for [[gerrit:921764|[ruwiki] Add 'abusefilter log/view private' flags to ArbCom (T336625)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:08:09] T336625: Add abusefilter-view/log-private rights for arbcom user group on ruwiki - https://phabricator.wikimedia.org/T336625 [20:08:20] Yep agree [20:08:21] Superpes: live on mwdebug, can you confirm the group change? [20:08:24] Testing [20:08:37] * RhinosF1 was just crossing his t's and dotting his i's [20:09:03] its appreciated ^^ [20:09:04] TheresNoTime Yep it's fine [20:09:13] syncin' [20:09:55] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs[2010-2011].codfw.wmnet [20:10:03] (03PS2) 10Samtar: [kaawiki] Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921765 (https://phabricator.wikimedia.org/T336648) (owner: 10Superpes15) [20:11:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs[2010-2011].codfw.wmnet [20:14:50] (03PS1) 10TChin: Fix python logging in flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/922165 [20:14:57] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:921764|[ruwiki] Add 'abusefilter log/view private' flags to ArbCom (T336625)]] (duration: 08m 22s) [20:15:02] T336625: Add abusefilter-view/log-private rights for arbcom user group on ruwiki - https://phabricator.wikimedia.org/T336625 [20:15:12] Superpes: that one is live, moving to 921765 [20:15:25] TheresNoTime Yep thanks :) [20:15:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921765 (https://phabricator.wikimedia.org/T336648) (owner: 10Superpes15) [20:15:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:16:12] (03Merged) 10jenkins-bot: [kaawiki] Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921765 (https://phabricator.wikimedia.org/T336648) (owner: 10Superpes15) [20:16:31] !log samtar@deploy1002 Started scap: Backport for [[gerrit:921765|[kaawiki] Enable SandboxLink extension (T336648)]] [20:16:36] T336648: Enable Sandbox extension for Karakalpak Wikipedia (kaa.wikipedia.org) - https://phabricator.wikimedia.org/T336648 [20:17:48] !log samtar@deploy1002 samtar and superpes: Backport for [[gerrit:921765|[kaawiki] Enable SandboxLink extension (T336648)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:18:07] Superpes: ^ [20:18:10] Yep it works too TheresNoTime :) [20:18:39] (03PS16) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:18:54] and still seeing a lot of `Data for ftr_namespace and ftr_title must be non-empty` — guessing its still not been fixed? [20:19:18] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:19:42] ah yup T336980 [20:19:42] T336980: InvalidArgumentException: Data for ftr_namespace and ftr_title must be non-empty - https://phabricator.wikimedia.org/T336980 [20:20:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:51] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:19] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:921765|[kaawiki] Enable SandboxLink extension (T336648)]] (duration: 07m 47s) [20:24:24] T336648: Enable Sandbox extension for Karakalpak Wikipedia (kaa.wikipedia.org) - https://phabricator.wikimedia.org/T336648 [20:24:28] Superpes: and that's live too [20:24:44] 10SRE, 10Infrastructure-Foundations, 10netops: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Open→03Resolved a:03ayounsi Going to close this task as this is as far as we can go due to the fasw switches not being easily upgraded. [20:25:03] TheresNoTime Many thanks :D [20:25:11] ^^ [20:25:33] 10SRE, 10Infrastructure-Foundations, 10netops: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) 05Open→03Resolved a:03ayounsi [20:26:14] (03PS17) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:27:06] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:27:56] !log close UTC late backport window [20:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:35] (03CR) 10Ottomata: "really! Cool. Chart change requires Chart.yaml version bump." [deployment-charts] - 10https://gerrit.wikimedia.org/r/922165 (owner: 10TChin) [20:30:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:33:04] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1005.eqiad.wmnet [20:33:58] (03PS2) 10TChin: Fix python logging in flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/922165 [20:36:56] (03PS1) 10Andrew Bogott: Remove references to labstore100[45] [puppet] - 10https://gerrit.wikimedia.org/r/922168 (https://phabricator.wikimedia.org/T337269) [20:39:43] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to labstore100[45] [puppet] - 10https://gerrit.wikimedia.org/r/922168 (https://phabricator.wikimedia.org/T337269) (owner: 10Andrew Bogott) [20:40:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) Configured raid on an-worker11(49,51,52,53,54,55,56) Unable to login to management on 50. need to verify psu1 on 49 [20:40:25] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:40:42] (03CR) 10Ottomata: [C: 03+2] Fix python logging in flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/922165 (owner: 10TChin) [20:40:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) [20:41:24] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [20:41:32] (03Merged) 10jenkins-bot: Fix python logging in flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/922165 (owner: 10TChin) [20:41:48] (03PS18) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:42:23] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:42:52] (03PS19) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:43:02] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: labstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:43:33] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:44:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: labstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:44:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:44:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1005.eqiad.wmnet [20:45:27] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1004.eqiad.wmnet [20:48:54] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10Andrew) a:03Jclark-ctr [20:49:39] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10Andrew) Two things: 1) The decom script said this during both runs: ` Traceback (most recent call last): File "/srv/deployment/spicera... [20:51:01] (03PS20) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [20:51:15] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:53:44] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: labstore1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:54:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: labstore1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:54:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:54:57] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1004.eqiad.wmnet [20:55:07] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `labstore1004.eqiad.wmnet` - labstore1004.eqia... [20:58:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:00:05] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T2100). nyaa~ [21:01:26] (03PS1) 10Dzahn: httpbb: add tests for planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) [21:02:20] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [21:03:40] (03CR) 10CI reject: [V: 04-1] httpbb: add tests for planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [21:06:33] (03PS2) 10Dzahn: httpbb: add tests for planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) [21:08:57] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [21:11:31] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10Nux) Sent an e-mail signed with my PGP, fingerprint: `86C84A9B865FDD51FCFB12D2EE3F8013A0DD3792`. [21:12:02] (03CR) 10Dzahn: [C: 03+2] "Now this tests all language version and the redirect on all backends:" [puppet] - 10https://gerrit.wikimedia.org/r/922170 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [21:14:01] (03CR) 10Btullis: [C: 03+1] Grant stat1009 access to cloud dumps [puppet] - 10https://gerrit.wikimedia.org/r/922091 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [21:17:28] (03PS3) 10Dzahn: planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105 (https://phabricator.wikimedia.org/T336701) [21:18:26] (03CR) 10Dzahn: [C: 03+2] logstash_checker.py: remove trusty-specific hacks [puppet] - 10https://gerrit.wikimedia.org/r/919365 (https://phabricator.wikimedia.org/T216380) (owner: 10Dzahn) [21:19:31] jouncebot: next [21:19:31] In 4 hour(s) and 40 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0200) [21:19:34] jouncebot: now [21:19:34] For the next 1 hour(s) and 40 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230522T2100) [21:19:55] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:23:49] (03PS21) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [21:23:57] (03PS22) 10Ottomata: mw-page-content-change-enrich: enable HA and checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [21:24:09] Hey all - would like to deploy two quick updates to PrivateSettings.php for today’s security deployment window. Looks like /private/ProfilerSettings.php is untracked. [21:26:21] sbassett:would like to know if everythign works normal during deployment because I merged a change to a script that is run by scan on deployment server [21:26:29] scap [21:29:53] mutante: ok :) About to deploy... [21:30:19] alright [21:30:56] applies the change on deploy2002 as well [21:33:40] applied on deploy* and mwmaint*. good to go. all yours. [21:34:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:36:43] mutante: er, ok, I started the scap a couple of minutes ago [21:36:55] It’s just for PS.php [21:37:49] it's ok [21:38:14] !log Deployed security mitigations for T333140 and T336027 [21:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:19:22] (03CR) 10Btullis: [C: 03+1] "I think that this is ready to merge now, but I'm going to wait until tomorrow before I test any further. Giving myself a +1 until then." [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [22:24:47] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:24:51] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:49:00] (03CR) 10Dzahn: [C: 03+1] Move doc-gitlab rsync endpoint to doc1002 (primary) [puppet] - 10https://gerrit.wikimedia.org/r/921429 (owner: 10EoghanGaffney) [22:50:02] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:54:19] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:55:45] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:00:29] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:31] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:15] jouncebot: nowandnext [23:08:16] No deployments scheduled for the next 2 hour(s) and 51 minute(s) [23:08:16] In 2 hour(s) and 51 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230523T0200) [23:08:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921614 (owner: 10Zabe) [23:09:26] (03Merged) 10jenkins-bot: Enable VE on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921614 (owner: 10Zabe) [23:09:42] !log zabe@deploy1002 Started scap: Backport for [[gerrit:921614|Enable VE on new wikis]] [23:09:49] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:09:49] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:03] !log zabe@deploy1002 zabe: Backport for [[gerrit:921614|Enable VE on new wikis]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [23:16:40] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:921614|Enable VE on new wikis]] (duration: 06m 58s) [23:28:21] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:23] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:27] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:27] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:49:59] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:01] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down