[00:00:34] (ProbeDown) firing: (6) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:00:58] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:59] (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:01:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 35.26% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:01:15] (ProbeDown) firing: (2) Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:01:20] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:34] (KubernetesCalicoDown) firing: (66) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:02:55] (03CR) 10Cwhite: Filter errors originating in external tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [00:03:07] (ProbeDown) firing: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - recommendation-api_4632: Servers kubernetes2046.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2032.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2028.codfw.wmnet, kubernetes2059.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2049.codfw.wmnet, kubernetes2 [00:04:12] w.wmnet, kubernetes2019.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2041.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2031.codfw.wmnet are marked down but pooled: push-notifications_4104: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernete [00:04:12] dfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2038.codfw.wmnet, kubern https://wikitech.wikimedia.org/wiki/PyBal [00:04:13] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:04:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:05:19] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [00:06:00] (SwaggerProbeHasFailures) firing: (6) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:06:15] (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:06:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:07:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:07:14] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:07:36] (KubernetesCalicoDown) firing: (61) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:07:38] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [00:08:07] (ProbeDown) firing: (2) Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:11] (ProbeDown) resolved: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:10:00] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:10:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:10:59] (SwaggerProbeHasFailures) firing: (6) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:11:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 40.89% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:11:15] (ProbeDown) resolved: (14) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:12:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:12:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [00:13:00] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:13:14] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:14:08] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:14:11] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:14:42] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2058.codfw.wmnet, kubernetes2053.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2032.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2038.codfw.wmnet, kubernetes2024.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2014.codf [00:14:42] kubernetes2049.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2031.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2051.codfw.wmnet, kubernetes2057.codfw.wmnet are marked down but pooled: mw-api-int_4446: Servers kubernetes2010.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2043.codfw.wmne [00:14:42] netes2025.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2040.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2054.codfw.w https://wikitech.wikimedia.org/wiki/PyBal [00:15:19] (ProbeDown) firing: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2007.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2053.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2026.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2036.codf [00:15:20] kubernetes2040.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2057.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2027.codfw.wmnet, kubernetes2031.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2046.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2038.codfw.wmnet, kubernetes2011.codfw.wmnet, k [00:15:20] s2026.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2059.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2047.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal [00:16:00] (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:16:15] (ProbeDown) firing: (20) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:46] (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:16:54] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-jobrunner_hourly.service,httpbb_kubernetes_mw-web_hourly.service,httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-api-ext (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:17:35] (KubernetesCalicoDown) firing: (62) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:18:07] (ProbeDown) resolved: (7) Service eventgate-analytics:4592 has failed probes (http_eventgate-analytics_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:18:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 203, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:19:09] (PHPFPMTooBusy) resolved: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 4.808% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:19:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:19:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:19:16] (ProbeDown) resolved: (10) Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:19:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [00:19:50] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:20:59] (SwaggerProbeHasFailures) firing: (7) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:21:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:21:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:15] (ProbeDown) resolved: (19) Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:46] (MediaWikiHighErrorRate) resolved: (3) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:22:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-api-ext (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:22:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:22:32] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:22:35] (KubernetesCalicoDown) resolved: (62) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:23:32] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:23:42] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:24:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:24:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [00:25:02] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:14] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:25:32] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:25:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [00:25:59] (SwaggerProbeHasFailures) resolved: (6) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:07] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981444 [00:38:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981444 (owner: 10TrainBranchBot) [00:43:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981444 (owner: 10TrainBranchBot) [01:04:26] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:24] (03PS1) 10DDesouza: Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982178 (https://phabricator.wikimedia.org/T344393) [01:08:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T353215 (10phaultfinder) [01:09:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:13:30] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:14:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:14:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:17:26] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:17:30] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:17:52] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:19:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:22:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:47:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:49:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:10] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T0300) [03:01:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:28] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:42] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:05:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:07:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.9 [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982186 (https://phabricator.wikimedia.org/T350085) [03:07:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.9 [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982186 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [03:09:10] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:27:06] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:30:54] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.9 [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982186 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [03:34:32] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:56:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T0400) [04:01:32] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982181 (https://phabricator.wikimedia.org/T350085) [04:01:34] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982181 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [04:02:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982181 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [04:02:47] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.9 refs T350085 [04:02:51] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [04:05:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:07:20] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:37:07] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:51] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.9 refs T350085 (duration: 53m 03s) [04:55:55] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [04:58:10] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.5 (duration: 02m 17s) [05:06:45] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T353215 (10Papaul) 05Open→03Resolved a:03Papaul [05:47:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787 [05:47:10] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 as master of pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982206 (https://phabricator.wikimedia.org/T351787) [05:47:13] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [05:47:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787 [05:48:36] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 as master of pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982206 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui) [05:49:12] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982207 (https://phabricator.wikimedia.org/T351787) [05:49:23] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 as master of pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982206 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui) [05:49:49] (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982207 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui) [05:50:59] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]] [05:52:07] (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982208 [05:52:24] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:52:28] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [05:52:29] !log marostegui@deploy2002 marostegui: Continuing with sync [05:53:01] (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982208 (owner: 10Marostegui) [05:59:34] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]] (duration: 08m 35s) [05:59:38] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [06:00:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2011.codfw.wmnet with OS bookworm [06:09:01] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [06:15:18] (03PS1) 10Marostegui: Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981744 [06:15:30] (03PS1) 10Marostegui: Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981745 [06:15:40] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 as master of pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982226 [06:15:51] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [06:18:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2011.codfw.wmnet with reason: host reimage [06:21:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2011.codfw.wmnet with reason: host reimage [06:22:07] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981744 (owner: 10Marostegui) [06:31:30] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2011.codfw.wmnet with OS bookworm [06:35:39] (03CR) 10Marostegui: [C: 03+2] Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981745 (owner: 10Marostegui) [06:36:03] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 as master of pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982226 (owner: 10Marostegui) [06:36:45] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 as master of pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982226 (owner: 10Marostegui) [06:37:15] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:982226|Revert "ProductionServices.php: Promote pc2014 as master of pc1"]] [06:38:39] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:982226|Revert "ProductionServices.php: Promote pc2014 as master of pc1"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:38:43] !log marostegui@deploy2002 marostegui: Continuing with sync [06:46:15] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:982226|Revert "ProductionServices.php: Promote pc2014 as master of pc1"]] (duration: 09m 00s) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T0700) [07:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T0700). [07:01:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:16:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4800 [07:17:22] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 4800 [07:25:14] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:52] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:26:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:29:05] (03PS3) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) [07:29:56] (03CR) 10Ayounsi: "Thx. Not sure if the CI error is legit or not." [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [07:31:09] (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [07:38:28] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:45:12] (03CR) 10Elukey: [C: 03+2] changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [07:46:22] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Updated Java security policy in OpenJDK 11.0.18 - https://phabricator.wikimedia.org/T328331 (10MoritzMuehlenhoff) p:05Triage→03Low [07:49:59] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [07:50:14] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [07:52:10] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [07:52:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [07:57:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:52] marostegui Amir1: I'd like to drop the ipoid database, in preparation of re-running a full import. Is sometimein in the next hour an OK time to do it? [08:00:06] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T0800) [08:00:06] No Gerrit patches in the queue for this window AFAICS. [08:01:39] kostajh: Sorry I'm on a train and my connection is spotty, Maybe Manuel can take care of it [08:01:39] kostajh: Go for it [08:01:57] awesome [08:01:58] Yep, no problem [08:01:59] I am going to downtime m5 for a bit, to make sure nothing pages [08:02:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2135,2160].codfw.wmnet,db[1176,1217].eqiad.wmnet with reason: m5 ipoid maintenance [08:02:31] marostegui: thanks [08:02:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2135,2160].codfw.wmnet,db[1176,1217].eqiad.wmnet with reason: m5 ipoid maintenance [08:04:35] marostegui: ok to go ahead? [08:05:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:06] kostajh: yep, go for it [08:05:33] marostegui: ok, done [08:06:05] marostegui: should have the import running in the next couple of hours, I need to merge a patch, deploy, then run a script. I can ping you again before that kicks off. [08:08:30] kostajh: yeah, ping me please so I can monitor and we can see if there's lag there [08:08:35] So we know for future runs [08:10:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:29:44] (03PS14) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [08:32:48] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: test kserve batcher for revertrisk-la in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981646 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [08:35:05] (03PS9) 10Brouberol: An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) [08:36:28] (03CR) 10Ayounsi: [C: 03+2] Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:37:08] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:14] (03CR) 10Muehlenhoff: [C: 03+2] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [08:39:47] (03CR) 10Muehlenhoff: [C: 03+2] Initial checkin of community_civicrm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [08:41:40] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1247 to instances [puppet] - 10https://gerrit.wikimedia.org/r/981443 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:45:56] (03PS5) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) [08:46:50] (03Abandoned) 10Brouberol: Explicitly link the apt_repo.yaml hiera file to the modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [08:47:19] (03CR) 10Ayounsi: [C: 03+2] Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:47:53] (03Merged) 10jenkins-bot: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:48:46] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: server BGP in netbox plugin - ayounsi@cumin1001 [08:48:50] (03CR) 10AikoChou: [C: 03+2] ml-services: test kserve batcher for revertrisk-la in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981646 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [08:49:40] (03Merged) 10jenkins-bot: ml-services: test kserve batcher for revertrisk-la in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981646 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [08:50:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: server BGP in netbox plugin - ayounsi@cumin1001 [09:04:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: provisionning db1247.eqiad.wmnet - T344036 [09:04:49] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [09:04:58] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MoritzMuehlenhoff) I'll prepare the respective OpenSSL 1.1 forward ports. I'm optimistic I'll have something ready before the holiday break. Given haproxy's importance for our DDoS resiliency this seem... [09:04:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: provisionning db1247.eqiad.wmnet - T344036 [09:05:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: provisionning db1247.eqiad.wmnet - T344036 [09:05:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: provisionning db1247.eqiad.wmnet - T344036 [09:05:46] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:06:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1147 in db1247 for T344036', diff saved to https://phabricator.wikimedia.org/P54333 and previous config saved to /var/cache/conftool/dbconfig/20231212-090652-arnaudb.json [09:08:54] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1147.eqiad.wmnet onto db1247.eqiad.wmnet [09:15:15] (03CR) 10Vgutierrez: [C: 03+1] webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [09:18:02] (03CR) 10Filippo Giunchedi: [C: 04-1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [09:19:35] (03CR) 10Jelto: [C: 04-1] "typo in-line" [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [09:23:27] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/875/con" [puppet] - 10https://gerrit.wikimedia.org/r/982103 (https://phabricator.wikimedia.org/T353060) (owner: 10Filippo Giunchedi) [09:23:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] alertmanager: add sink notifications capability [puppet] - 10https://gerrit.wikimedia.org/r/982103 (https://phabricator.wikimedia.org/T353060) (owner: 10Filippo Giunchedi) [09:30:43] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Vgutierrez) >>! In T352744#9398828, @MoritzMuehlenhoff wrote: > I'm wondering though if we reproduced this with the pilot bookworm cp installation? The pilot cp bookworm installation on cp4052 (upload@... [09:36:40] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Vgutierrez) HAProxy 2.9 has been released, introducing AWS-LC support and with some interesting mention to OpenSSL [[ https://www.mail-archive.com/haproxy@formilux.org/msg44400.html | on its release no... [09:42:28] (03CR) 10Volans: [C: 03+1] "LGTM, although this will hide the actual issue we are having ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [09:43:46] (03PS1) 10Arnaudb: mariadb: db1128 → db1228 [puppet] - 10https://gerrit.wikimedia.org/r/982187 (https://phabricator.wikimedia.org/T344036) [09:43:46] !log installing ca-certificates-java updates from Bookworm point release [09:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:40] (03CR) 10Marostegui: [C: 03+1] mariadb: db1128 → db1228 [puppet] - 10https://gerrit.wikimedia.org/r/982187 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:45:54] (03CR) 10Arnaudb: [C: 03+2] mariadb: db1128 → db1228 [puppet] - 10https://gerrit.wikimedia.org/r/982187 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:48:57] (03PS4) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) [09:51:11] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [09:51:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: provisionning db1228.eqiad.wmnet - T344036 [09:51:34] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [09:51:39] (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [09:51:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: provisionning db1228.eqiad.wmnet - T344036 [09:51:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: provisionning db1228.eqiad.wmnet - T344036 [09:52:01] (03PS5) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) [09:52:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: provisionning db1228.eqiad.wmnet - T344036 [09:53:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 clone from db1128 ', diff saved to https://phabricator.wikimedia.org/P54334 and previous config saved to /var/cache/conftool/dbconfig/20231212-095352-arnaudb.json [09:55:06] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [09:55:12] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Stalled→03Resolved Automation is up and running. Doc updated: https://wikitech.wikimedia.org/w/in... [09:56:58] (03CR) 10Ayounsi: [C: 03+2] Add retry logic to Netbox API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [09:57:55] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1128.eqiad.wmnet onto db1228.eqiad.wmnet [09:59:01] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982364 (https://phabricator.wikimedia.org/T344941) [09:59:21] (03Merged) 10jenkins-bot: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [10:00:09] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982364 (https://phabricator.wikimedia.org/T344941) [10:00:21] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982364 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [10:01:19] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982364 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [10:04:09] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:04:32] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:04:41] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:05:18] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:06:52] (03PS10) 10Brouberol: An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) [10:06:54] (03PS1) 10Kosta Harlan: ipoid: Fix toggling of initial-import flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/982365 [10:07:02] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Fix toggling of initial-import flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/982365 (owner: 10Kosta Harlan) [10:07:58] (03Merged) 10jenkins-bot: ipoid: Fix toggling of initial-import flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/982365 (owner: 10Kosta Harlan) [10:09:25] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:09:28] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:11:20] (03PS1) 10Kosta Harlan: ipoid: Enable initial-import for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982366 [10:11:30] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Enable initial-import for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982366 (owner: 10Kosta Harlan) [10:12:21] (03Merged) 10jenkins-bot: ipoid: Enable initial-import for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982366 (owner: 10Kosta Harlan) [10:13:35] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:13:52] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:15:37] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:16:46] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:30:00] (03PS1) 10Arnaudb: mariadb: db1129 → db1229 [puppet] - 10https://gerrit.wikimedia.org/r/982188 (https://phabricator.wikimedia.org/T344036) [10:30:06] !log installing nghttp2 security updates [10:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:46] (03PS2) 10Samtar: testwiki: Enable the Edit Recovery feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981423 (https://phabricator.wikimedia.org/T353041) (owner: 10Samwilson) [10:36:29] jouncebot: nowandnext [10:36:29] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [10:36:29] In 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1100) [10:37:00] (03CR) 10Marostegui: [C: 03+1] mariadb: db1129 → db1229 [puppet] - 10https://gerrit.wikimedia.org/r/982188 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:37:14] (03CR) 10Arnaudb: [C: 03+2] mariadb: db1129 → db1229 [puppet] - 10https://gerrit.wikimedia.org/r/982188 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:39:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981423 (https://phabricator.wikimedia.org/T353041) (owner: 10Samwilson) [10:39:36] (03CR) 10Btullis: [C: 03+1] An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [10:39:45] (03Merged) 10jenkins-bot: testwiki: Enable the Edit Recovery feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981423 (https://phabricator.wikimedia.org/T353041) (owner: 10Samwilson) [10:40:19] !log samtar@deploy2002 Started scap: Backport for [[gerrit:981423|testwiki: Enable the Edit Recovery feature (T353041)]] [10:40:23] T353041: Enable Edit Recovery on testwiki - https://phabricator.wikimedia.org/T353041 [10:41:43] !log samtar@deploy2002 samtar and samwilson: Backport for [[gerrit:981423|testwiki: Enable the Edit Recovery feature (T353041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:41:45] * TheresNoTime testing.. [10:41:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: provisionning db1229.eqiad.wmnet - T344036 [10:42:03] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:42:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: provisionning db1229.eqiad.wmnet - T344036 [10:42:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: provisionning db1229.eqiad.wmnet - T344036 [10:42:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: provisionning db1229.eqiad.wmnet - T344036 [10:43:32] !log samtar@deploy2002 samtar and samwilson: Continuing with sync [10:44:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1129 in db1229 for T344036', diff saved to https://phabricator.wikimedia.org/P54335 and previous config saved to /var/cache/conftool/dbconfig/20231212-104404-arnaudb.json [10:47:18] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1129.eqiad.wmnet onto db1229.eqiad.wmnet [10:50:11] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:981423|testwiki: Enable the Edit Recovery feature (T353041)]] (duration: 09m 51s) [10:50:16] T353041: Enable Edit Recovery on testwiki - https://phabricator.wikimedia.org/T353041 [11:00:02] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1100) [11:06:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:06:57] (03CR) 10Muehlenhoff: [C: 03+2] parsoid::testing: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/981546 (owner: 10Muehlenhoff) [11:08:44] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:03] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:15:25] (03PS6) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [11:16:36] (03CR) 10CI reject: [V: 04-1] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [11:17:06] (03PS2) 10EoghanGaffney: [apt-staging] Deploy gitlab-package-puller script [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) [11:17:46] (03PS7) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [11:23:31] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:45] (PuppetFailure) resolved: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:23:45] (DiskSpace) resolved: Disk space relforge1003:9100:/ 2.449% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:23:50] (SystemdUnitFailed) resolved: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:17] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:42] (SystemdUnitFailed) firing: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:28:40] !log installing postgresql-11 security updates [11:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:18] (DiskSpace) firing: Disk space relforge1003:9100:/ 1.907% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:35:45] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [11:36:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:40:41] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982375 (https://phabricator.wikimedia.org/T344941) [11:41:11] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982375 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [11:42:05] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982375 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [11:43:23] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:43:49] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:51:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:51:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:55:41] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:56:57] (ProbeDown) firing: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:11] PROBLEM - Check systemd state on moss-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,export_smart_data_dump.service,prometheus-debian-version-textfile.service,prometheus-dpkg-success-textfile.service,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-puppet-agent-stats.service,prometheus_intel_microcode.service https://wikitech.wikimedia.org/wiki/ [11:58:11] ng/check_systemd_state [11:58:44] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:58:45] PROBLEM - MD RAID on moss-be2002 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:03:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host backup2010.codfw.wmnet [12:05:14] (03CR) 10Slyngshede: Move Debmonitor client code to separate repository. (0327 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:05:18] (03Abandoned) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [12:06:57] (03Abandoned) 10Hnowlan: jobrunner: add simple HTTP check [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [12:06:57] (JobUnavailable) resolved: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:07:33] (03PS1) 10Muehlenhoff: Switch backup2010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982376 (https://phabricator.wikimedia.org/T349619) [12:10:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup2010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982376 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:11:57] (JobUnavailable) firing: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:13:45] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:15:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup2010.codfw.wmnet [12:18:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:18:39] (03CR) 10Brouberol: [C: 03+2] An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [12:23:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:24:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host backup2011.codfw.wmnet [12:25:50] (03CR) 10Clément Goubert: [C: 03+2] wikikube: add kubernetes10[59-62] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/982072 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [12:27:13] (03PS1) 10Muehlenhoff: Switch backup2011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982379 (https://phabricator.wikimedia.org/T349619) [12:28:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup2011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982379 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:29:56] (03CR) 10Btullis: [C: 03+1] "This change also looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [12:31:42] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:33:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup2011.codfw.wmnet [12:37:18] !log Pooling kubernetes10[59-62].eqiad.wmnet - T353135 [12:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:32] T353135: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 [12:38:53] !log Uncordoning kubernetes10[59-62].eqiad.wmnet - T353135 [12:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:10] (03PS1) 10Muehlenhoff: Switch backup1010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982384 (https://phabricator.wikimedia.org/T349619) [12:42:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup1010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982384 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:43:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Clement_Goubert) [12:43:30] 10SRE, 10serviceops: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert) 05Open→03Resolved Nodes are in production. [12:45:27] !log increasing memory of ganeti instance kubemaster2001.codfw.wmnet from 4G to 12G (requires reboot) - T353233 [12:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] T353233: Outage of wikikube codfw apiservers - https://phabricator.wikimedia.org/T353233 [12:45:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup1010.eqiad.wmnet [12:46:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host backup1011.eqiad.wmnet [12:47:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:45] (03PS1) 10Muehlenhoff: Switch backup1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982385 (https://phabricator.wikimedia.org/T349619) [12:47:50] (03PS2) 10Phuedx: Add stream config for Android article instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [12:48:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:21] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982386 (https://phabricator.wikimedia.org/T344941) [12:49:32] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982386 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [12:50:21] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982386 (https://phabricator.wikimedia.org/T344941) (owner: 10Kosta Harlan) [12:51:09] (03CR) 10Jelto: [C: 04-1] "What about thanos? It clones operations/alerts as well in:" [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski) [12:51:28] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2001.codfw.wmnet [12:51:43] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982385 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:52:38] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:53:03] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:54:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:57] RECOVERY - Check systemd state on kubemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:25] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1001.eqiad.wmnet [12:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup1011.eqiad.wmnet [12:56:58] (03CR) 10Cathal Mooney: "LGTM overall. I'll leave you and Riccardo to hammer out the finer points but logic and function makes sense to me." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [12:57:54] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2001.codfw.wmnet [12:58:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1300) [13:00:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1001.eqiad.wmnet [13:02:08] (03CR) 10Volans: Move Debmonitor client code to separate repository. (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [13:06:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1147.eqiad.wmnet onto db1247.eqiad.wmnet [13:07:15] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:08:44] (ProbeDown) resolved: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:08:44] (JobUnavailable) resolved: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:08:49] (03CR) 10Volans: "follow up from IRC chat" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:09:07] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2002.codfw.wmnet [13:09:31] !log jayme@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1002.eqiad.wmnet [13:11:08] (03PS8) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [13:12:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:55] RECOVERY - Check systemd state on kubemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:31] (03PS1) 10Arnaudb: mariadb: add db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982189 (https://phabricator.wikimedia.org/T344036) [13:15:52] (03PS9) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [13:16:01] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1002.eqiad.wmnet [13:16:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2002.codfw.wmnet [13:16:42] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:13] (03PS1) 10Brouberol: dse-k8s: increase the general contauner max memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/982389 [13:19:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1129.eqiad.wmnet onto db1229.eqiad.wmnet [13:20:31] (03PS1) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [13:20:42] * Lucas_WMDE will not be around for the backport window today btw [13:20:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1128.eqiad.wmnet onto db1228.eqiad.wmnet [13:25:50] (03CR) 10Muehlenhoff: [C: 03+2] defs_requestctl_nftables.tpl: Fix range selection [puppet] - 10https://gerrit.wikimedia.org/r/981465 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:28:18] (03PS1) 10Majavah: P:openldap: convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/982394 [13:28:20] (03PS1) 10Majavah: O:openldap::rw: don't allow queries from Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/982395 [13:28:23] (03CR) 10JMeybohm: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [13:29:01] (03PS2) 10Majavah: O:openldap::rw: don't allow queries from Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/982395 (https://phabricator.wikimedia.org/T317184) [13:32:58] (03PS2) 10Majavah: P:openldap: convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/982394 [13:33:00] (03PS3) 10Majavah: O:openldap::rw: don't allow queries from Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/982395 (https://phabricator.wikimedia.org/T317184) [13:33:25] (03PS2) 10Clément Goubert: prometheus-php-fpm-exporter: Bullseye update and fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 [13:34:04] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:34:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982395 (https://phabricator.wikimedia.org/T317184) (owner: 10Majavah) [13:37:03] (03CR) 10Volans: [C: 03+1] "LGTM as it's minimally invasive compared to the existing code. It would be possible to generalize some bits between the two if/elif blocks" [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi) [13:40:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Minor comment, lgtm otherwise" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:40:43] (03CR) 10Btullis: [C: 03+1] dse-k8s: increase the general contauner max memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/982389 (owner: 10Brouberol) [13:42:21] (03CR) 10Majavah: [C: 04-1] prometheus-php-fpm-exporter: Bullseye update and fix build script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:43:05] (03CR) 10Clément Goubert: prometheus-php-fpm-exporter: Bullseye update and fix build script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:43:47] (03CR) 10Brouberol: [C: 03+2] dse-k8s: increase the general contauner max memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/982389 (owner: 10Brouberol) [13:45:04] (03CR) 10Clément Goubert: prometheus-php-fpm-exporter: Bullseye update and fix build script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:45:33] !log increasing max container memory requests in dse-k8s from 3GB to 8GB - T351722 [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:38] T351722: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 [13:45:47] 10SRE, 10Observability-Alerting: Probe for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you @taavi ! The check is working as expected now, and uncovered {T353254} ! I'm resolving, though feel free to reopen [13:46:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:46:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:47:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] prometheus-php-fpm-exporter: Bullseye update and fix build script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:50:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [13:51:46] (03PS1) 10JMeybohm: kubestagemaster: Add http probe [puppet] - 10https://gerrit.wikimedia.org/r/982403 (https://phabricator.wikimedia.org/T353233) [13:54:09] 10ops-codfw, 10Infrastructure-Foundations, 10netops: cr2-codfw:xe-1/0/1:1 down - https://phabricator.wikimedia.org/T353256 (10ayounsi) p:05Triage→03High [13:54:48] (03PS3) 10Clément Goubert: prometheus-php-fpm-exporter: Bullseye update and fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 [13:56:17] (03CR) 10Clément Goubert: prometheus-php-fpm-exporter: Bullseye update and fix build script (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [13:56:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/879/con" [puppet] - 10https://gerrit.wikimedia.org/r/982403 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [13:56:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [13:59:45] (03CR) 10Marostegui: [C: 03+1] mariadb: add db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982189 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1400). [14:00:05] danisztls and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:35] o/ [14:00:37] o/ [14:01:13] (DiskSpace) resolved: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:03:39] (03PS15) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [14:06:15] (03CR) 10Btullis: Define the spark-history chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:11:32] (03PS3) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [14:12:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:52] (03CR) 10Brouberol: Define the spark-history chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:13:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:57] (03PS4) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [14:14:15] (03PS8) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [14:14:57] RECOVERY - Disk space on relforge1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=relforge1003&var-datasource=eqiad+prometheus/ops [14:15:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) sessionstore1004 Rack: A3 U: 23 CableID: 1865 Port: 21 sessionstore1005 Rack: C5 U:29 CableID: 1957 Port: 30 sessionstore1006 Rack: D6 U: 40 CableID: 5... [14:15:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [14:16:02] (03PS5) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [14:16:42] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:13] danisztls: Since no deployers are about, I'm able to deploy but I haven't done it in a while. Can your change be tested? [14:18:42] phuedx: thanks! It doesn't need to. [14:19:23] (03PS2) 10DDesouza: Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982178 (https://phabricator.wikimedia.org/T344393) [14:19:30] (03PS9) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [14:19:39] just rebased it [14:20:52] (03CR) 10Slyngshede: Move Debmonitor client code to separate repository. (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [14:21:01] Alright. I'm familiar with the QuickSurveys config. The change LGTM [14:21:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by phuedx@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982178 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [14:22:01] (03Merged) 10jenkins-bot: Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982178 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [14:22:25] !log phuedx@deploy2002 Started scap: Backport for [[gerrit:982178|Partially undeploy Reader Demographics 2 survey (T344393)]] [14:22:34] (03CR) 10Btullis: Define the spark-history chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:22:36] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [14:24:17] (03PS1) 10Muehlenhoff: defs_requestctl_nftables.tp: Fix query [puppet] - 10https://gerrit.wikimedia.org/r/982408 (https://phabricator.wikimedia.org/T348734) [14:24:34] !log phuedx@deploy2002 phuedx and dani: Backport for [[gerrit:982178|Partially undeploy Reader Demographics 2 survey (T344393)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:26:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [14:28:17] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1226 [puppet] - 10https://gerrit.wikimedia.org/r/982189 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:28:33] RECOVERY - cassandra-b service on restbase2031 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:29:05] RECOVERY - cassandra-b SSL 10.192.32.227:7000 on restbase2031 is OK: SSL OK - Certificate restbase2031-b valid until 2025-12-07 21:03:18 +0000 (expires in 726 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:29:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982408 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [14:30:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [14:30:52] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [14:31:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [14:31:14] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [14:31:23] (03CR) 10LSobanski: [C: 03+1] "Let's go ahead with this approach until the migration to K8s as discussed." [puppet] - 10https://gerrit.wikimedia.org/r/981591 (https://phabricator.wikimedia.org/T347355) (owner: 10Dzahn) [14:31:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [14:32:09] danisztls: Sorry for the delay. I was just trying to check the QS config on the test surveys [14:32:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1211 in db1226 for T344036', diff saved to https://phabricator.wikimedia.org/P54336 and previous config saved to /var/cache/conftool/dbconfig/20231212-143233-arnaudb.json [14:32:36] (03PS6) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [14:32:53] (03CR) 10Kamila Součková: kube-state-metrics: DRY network policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:34:45] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on netbox-dev2002.codfw.wmnet with reason: Restoring DB from backup on netbox-dev2002 [14:35:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on netbox-dev2002.codfw.wmnet with reason: Restoring DB from backup on netbox-dev2002 [14:35:22] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1226.eqiad.wmnet [14:36:48] danisztls: I realise that I couldn't confirm it because I had to be logged out ^^ OK to proceed? [14:36:57] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:40] (03CR) 10Muehlenhoff: [C: 03+2] defs_requestctl_nftables.tp: Fix query [puppet] - 10https://gerrit.wikimedia.org/r/982408 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [14:38:12] phuedx: sry. Yes. All good. [14:39:27] phuedx: I tested on test server. The surveys are disabled as intended. [14:39:35] Thanks [14:39:36] !log phuedx@deploy2002 phuedx and dani: Continuing with sync [14:41:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jclark-ctr) [14:44:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:44:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1143.eqiad.wmnet - https://phabricator.wikimedia.org/T353156 (10VRiley-WMF) [14:44:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1143.eqiad.wmnet - https://phabricator.wikimedia.org/T353156 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [14:45:59] (03PS1) 10Arnaudb: mariadb db1137 → db1237 [puppet] - 10https://gerrit.wikimedia.org/r/982190 (https://phabricator.wikimedia.org/T344036) [14:46:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [14:46:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jclark-ctr) @Jhancock.wm I finished imaging these if you want to verify anything before closing out ticket [14:46:58] !log phuedx@deploy2002 Finished scap: Backport for [[gerrit:982178|Partially undeploy Reader Demographics 2 survey (T344393)]] (duration: 24m 33s) [14:47:03] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [14:47:11] (03CR) 10Marostegui: [C: 03+1] mariadb db1137 → db1237 [puppet] - 10https://gerrit.wikimedia.org/r/982190 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:47:20] (03CR) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [14:47:52] jouncebot: refresh [14:47:53] I refreshed my knowledge about deployments. [14:47:54] (03CR) 10Arnaudb: [C: 03+2] mariadb db1137 → db1237 [puppet] - 10https://gerrit.wikimedia.org/r/982190 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:48:05] danisztls: That should be live now [14:48:13] (03PS1) 10Volans: defs_requestctl_nftables.tpl: simplify template [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) [14:48:29] I've rescheduled my patch for the next deployment window as I have a meeting at the top of the hour [14:48:56] !log UTC afternoon backport window done [14:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:49:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) (owner: 10Volans) [14:50:04] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: provisionning db1237.eqiad.wmnet - T344036 [14:50:10] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [14:50:20] !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly) [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: provisionning db1237.eqiad.wmnet - T344036 [14:50:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: provisionning db1237.eqiad.wmnet - T344036 [14:50:46] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) (owner: 10Volans) [14:50:55] (03CR) 10CI reject: [V: 04-1] defs_requestctl_nftables.tpl: simplify template [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) (owner: 10Volans) [14:50:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: provisionning db1237.eqiad.wmnet - T344036 [14:51:31] (03PS2) 10Volans: defs_requestctl_nftables.tpl: simplify template [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) [14:52:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1137 in db1237 for T344036', diff saved to https://phabricator.wikimedia.org/P54339 and previous config saved to /var/cache/conftool/dbconfig/20231212-145205-arnaudb.json [14:53:44] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:02] (03PS1) 10Jclark-ctr: add sessionstore100[4-6] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982411 (https://phabricator.wikimedia.org/T349875) [14:54:03] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1137.eqiad.wmnet onto db1237.eqiad.wmnet [14:54:50] (03CR) 10Jclark-ctr: [C: 03+2] add sessionstore100[4-6] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982411 (https://phabricator.wikimedia.org/T349875) (owner: 10Jclark-ctr) [14:55:16] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) (owner: 10Volans) [14:56:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Jclark-ctr) [14:58:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubestagemaster: Add http probe [puppet] - 10https://gerrit.wikimedia.org/r/982403 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:58:39] (03CR) 10Volans: [C: 03+2] defs_requestctl_nftables.tpl: simplify template [puppet] - 10https://gerrit.wikimedia.org/r/982409 (https://phabricator.wikimedia.org/T348734) (owner: 10Volans) [14:58:41] (03CR) 10Filippo Giunchedi: [C: 03+1] kubestagemaster: Add http probe [puppet] - 10https://gerrit.wikimedia.org/r/982403 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [14:58:51] phuedx: thanks again and sorry for delaying the deployment of your patch [14:59:14] Not at all [15:00:31] (03CR) 10Andrew Bogott: [C: 03+1] openstack: spreadcheck: remove in favour of server groups [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah) [15:00:40] (03CR) 10Majavah: [V: 03+1 C: 03+2] openstack: spreadcheck: remove in favour of server groups [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah) [15:03:49] (03PS1) 10Bartosz Dziewoński: RunSingleJob.php: Remove overly complicated error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) [15:03:51] (03PS1) 10Bartosz Dziewoński: RunSingleJob.php: Stop writing to $wgCommandLineMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) [15:03:53] (03PS1) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [15:04:05] (03PS1) 10Majavah: team-sre: puppet-agent: Don't trigger ConstantChange when agent is disabled [alerts] - 10https://gerrit.wikimedia.org/r/982417 [15:05:13] (03CR) 10Jforrester: [C: 03+1] RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) (owner: 10Bartosz Dziewoński) [15:06:31] (03PS2) 10Bartosz Dziewoński: Update expected RunSingleJob.php status code [puppet] - 10https://gerrit.wikimedia.org/r/982236 (https://phabricator.wikimedia.org/T352265) [15:08:44] (03CR) 10Bartosz Dziewoński: "I'd appreciate a +1 from you (or anyone else) before I schedule this for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [15:08:47] (03CR) 10Bartosz Dziewoński: "I'd appreciate a +1 from you (or anyone else) before I schedule this for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [15:12:57] (03PS1) 10Muehlenhoff: testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 [15:13:13] (03CR) 10CI reject: [V: 04-1] testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [15:14:22] (03PS2) 10Muehlenhoff: testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 [15:15:01] (03CR) 10Clément Goubert: [C: 03+2] BGPPeers: add codfw racks A1 to B8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:15:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus-php-fpm-exporter: Bullseye update and fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [15:17:11] (03PS1) 10Majavah: get_config: Respect .mailmap for git authors [puppet] - 10https://gerrit.wikimedia.org/r/982420 [15:17:13] (03PS1) 10Majavah: merge_cli: Respect .mailmap when formatting change list [puppet] - 10https://gerrit.wikimedia.org/r/982421 [15:17:41] (03Merged) 10jenkins-bot: BGPPeers: add codfw racks A1 to B8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:18:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [15:20:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done!" [alerts] - 10https://gerrit.wikimedia.org/r/982417 (owner: 10Majavah) [15:21:15] (03CR) 10Majavah: [C: 03+2] team-sre: puppet-agent: Don't trigger ConstantChange when agent is disabled [alerts] - 10https://gerrit.wikimedia.org/r/982417 (owner: 10Majavah) [15:21:50] !log Deploying new calico BGPPeers for codfw rows a/b - T352893 [15:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:54] T352893: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 [15:21:59] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:22:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) 05Open→03Resolved @Jclark-ctr looks good ty for your help! @Eevans all yours [15:22:15] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:22:32] (03Merged) 10jenkins-bot: team-sre: puppet-agent: Don't trigger ConstantChange when agent is disabled [alerts] - 10https://gerrit.wikimedia.org/r/982417 (owner: 10Majavah) [15:22:53] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:23:29] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:24:36] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:25:02] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:25:44] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:25:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:25:57] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:26:56] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:27:23] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:27:38] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:27:55] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:27:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:28:25] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:28:52] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:29:32] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:30:09] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:11] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:30:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:30:40] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:31:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1142.eqiad.wmnet - https://phabricator.wikimedia.org/T353154 (10VRiley-WMF) a:03VRiley-WMF [15:31:58] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1142.eqiad.wmnet - https://phabricator.wikimedia.org/T353154 (10VRiley-WMF) 05Open→03Resolved [15:33:11] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:35:10] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10Mcastro) [15:35:17] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:37:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:40:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1141.eqiad.wmnet - https://phabricator.wikimedia.org/T353152 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [15:43:13] 10SRE, 10Infrastructure-Foundations, 10SRE Observability: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10fgiunchedi) 05Open→03Resolved Nowadays we have `ManagementSSHDown` alert that opens dcops tasks, optimistically calling this one resolved [15:44:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1137.eqiad.wmnet onto db1237.eqiad.wmnet [15:47:11] (03CR) 10Xcollazo: [C: 03+1] "All right let's deploy this!" [puppet] - 10https://gerrit.wikimedia.org/r/980923 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [15:51:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10VRiley-WMF) 05Open→03Resolved a:05Jclark-ctr→03VRiley-WMF [15:51:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:55:22] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Fix some Build-Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert) [15:55:41] (03PS2) 10Clément Goubert: Fix some Build-Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 [15:55:45] (03CR) 10Clément Goubert: [V: 03+2] Fix some Build-Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert) [15:56:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bullseye [15:56:52] (03PS1) 10Kosta Harlan: ipoid: Re-enable daily updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/982427 (https://phabricator.wikimedia.org/T339284) [15:57:23] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Re-enable daily updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/982427 (https://phabricator.wikimedia.org/T339284) (owner: 10Kosta Harlan) [15:58:03] (03PS1) 10Muehlenhoff: tlsproxy::envoy: Only pass an srange if not an empty array [puppet] - 10https://gerrit.wikimedia.org/r/982428 [15:58:23] (03PS2) 10Muehlenhoff: tlsproxy::envoy: Only pass an srange if not an empty array [puppet] - 10https://gerrit.wikimedia.org/r/982428 [15:58:27] (03Merged) 10jenkins-bot: ipoid: Re-enable daily updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/982427 (https://phabricator.wikimedia.org/T339284) (owner: 10Kosta Harlan) [15:58:29] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Fix some Build-Depends (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert) [15:59:44] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:00:01] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:00:05] eoghan, jelto, and arnoldokoth: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1600). [16:03:02] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys [16:03:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [16:03:17] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys [16:03:38] !log brennen@deploy2002 Started deploy [phabricator/deployment@c243cc2]: test deploy to phab2002 for T353274 [16:03:43] T353274: Deploy Phabricator/Phorge 2023-12-12 - https://phabricator.wikimedia.org/T353274 [16:04:11] !log brennen@deploy2002 Finished deploy [phabricator/deployment@c243cc2]: test deploy to phab2002 for T353274 (duration: 00m 32s) [16:04:46] !log brennen@deploy2002 Started deploy [phabricator/deployment@c243cc2]: deploy to phab1004 for T353274 [16:05:34] !log brennen@deploy2002 Finished deploy [phabricator/deployment@c243cc2]: deploy to phab1004 for T353274 (duration: 00m 48s) [16:06:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:08:00] (03CR) 10Volans: Netbox module: add get/set for primary IPs and access vlan (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:08:30] (03PS3) 10Muehlenhoff: tlsproxy::envoy: Only pass an srange if not an empty array [puppet] - 10https://gerrit.wikimedia.org/r/982428 [16:15:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [16:17:16] (03PS1) 10Clément Goubert: mw-debug: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982430 [16:17:18] (03PS1) 10Clément Goubert: mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 [16:17:20] (03PS1) 10Clément Goubert: shellbox: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982432 [16:18:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [16:19:36] (03CR) 10Brouberol: Define the spark-history chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [16:19:51] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1060'] [16:20:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes1060'] [16:22:03] 10SRE, 10ops-eqiad: Degraded RAID on kubernetes1060 - https://phabricator.wikimedia.org/T353165 (10Jclark-ctr) i have downloaded the tsr report I do not see any failed drives. unsure if this is a mistake? [16:22:07] (03PS16) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [16:24:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [16:24:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) 05Open→03Resolved [16:29:16] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:30:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:33:51] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Waiting for hardware install [16:34:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Waiting for hardware install [16:34:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:34:19] (03CR) 10Bartosz Dziewoński: RunSingleJob.php: Stop writing to $wgCommandLineMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [16:34:26] (03PS2) 10Bartosz Dziewoński: RunSingleJob.php: Stop writing to $wgCommandLineMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) [16:34:32] (03PS2) 10Bartosz Dziewoński: RunSingleJob.php: Fix use of MWExceptionHandler before it's defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982416 (https://phabricator.wikimedia.org/T352265) [16:36:20] (03CR) 10Ottomata: [C: 03+2] varnishkafka::instance - Add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/982163 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:37:35] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:02] (03Abandoned) 10Ottomata: eventgate-analytics-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968334 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:40:08] (03CR) 10Ottomata: [C: 03+1] "Ready to go when you are!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806) (owner: 10Gmodena) [16:42:22] (03PS4) 10Ottomata: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [16:45:04] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [16:47:46] (03PS1) 10DCausse: flink-app: include mesh.networkpolicy.ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/982434 (https://phabricator.wikimedia.org/T353224) [16:52:13] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) P54340 [16:56:59] (03PS3) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) [16:57:01] (03PS3) 10RLazarus: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) [16:57:24] (03CR) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [17:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:51] (03CR) 10MVernon: "I've done some bugfixing this this, and it's nearly there, but there's a redfish problem I could do with some help with, please:" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:04:16] (03CR) 10DCausse: [C: 03+2] flink-app: include mesh.networkpolicy.ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/982434 (https://phabricator.wikimedia.org/T353224) (owner: 10DCausse) [17:04:40] (03CR) 10Bking: [C: 03+1] flink-app: include mesh.networkpolicy.ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/982434 (https://phabricator.wikimedia.org/T353224) (owner: 10DCausse) [17:04:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:30] (03Merged) 10jenkins-bot: flink-app: include mesh.networkpolicy.ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/982434 (https://phabricator.wikimedia.org/T353224) (owner: 10DCausse) [17:06:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:39] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:11] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:13:12] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:13:21] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:14:22] (03PS1) 10Bartosz Dziewoński: Remove references to refreshMessageBlobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982441 (https://phabricator.wikimedia.org/T314947) [17:16:16] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:16:31] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:16:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on phab2002.codfw.wmnet with reason: reimage [17:16:47] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:16:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on phab2002.codfw.wmnet with reason: reimage [17:19:10] (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [17:19:36] (03PS1) 10Subramanya Sastry: ParserOutput::getText(): do not clone ParserOutput when invoking pipeline [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982237 (https://phabricator.wikimedia.org/T353257) [17:22:23] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:23:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 2.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:24:41] (03CR) 10Subramanya Sastry: [C: 03+1] ParserOutput::getText(): do not clone ParserOutput when invoking pipeline [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982237 (https://phabricator.wikimedia.org/T353257) (owner: 10Subramanya Sastry) [17:25:12] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) Here's a new diff. This compares outputs from Nov 17 with today. The < is from the 17th, the < is today.... [17:29:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:29:42] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubestagemaster: Add http probe [puppet] - 10https://gerrit.wikimedia.org/r/982403 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [17:30:32] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [17:30:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bullseye [17:32:29] (03CR) 10JMeybohm: [C: 03+1] "🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [17:32:35] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt sessionstore - jclark@cumin1001" [17:33:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt sessionstore - jclark@cumin1001" [17:33:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:33:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:28] (03CR) 10JHathaway: [C: 03+1] "looks good, can you add a comment to the script for posterity?" [puppet] - 10https://gerrit.wikimedia.org/r/982420 (owner: 10Majavah) [17:35:53] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:14] (03CR) 10JHathaway: [C: 03+1] "looks good, I think a comment here as well would be nice" [puppet] - 10https://gerrit.wikimedia.org/r/982421 (owner: 10Majavah) [17:38:21] (03CR) 10JMeybohm: kube-state-metrics: DRY network policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [17:38:53] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:40:07] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) >>! In T348643#9397062, @Jclark-ctr wrote: > @Andrew Dell is requesting smartctl output showing what dr... [17:41:10] (03CR) 10JMeybohm: [C: 03+1] "We should probably have a common_images helper for this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982432 (owner: 10Clément Goubert) [17:41:47] (03CR) 10JMeybohm: [C: 03+1] mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 (owner: 10Clément Goubert) [17:41:56] (03CR) 10JMeybohm: [C: 03+1] mw-debug: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982430 (owner: 10Clément Goubert) [17:46:21] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:48:53] (03CR) 10Jdlrobson: Filter errors originating in external tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [17:52:00] (03PS2) 10Majavah: get_config: Respect .mailmap for git authors [puppet] - 10https://gerrit.wikimedia.org/r/982420 [17:52:02] (03PS2) 10Majavah: merge_cli: Respect .mailmap when formatting change list [puppet] - 10https://gerrit.wikimedia.org/r/982421 [17:52:07] (CertAlmostExpired) firing: Certificate for service kubestagemaster:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:52:15] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:59:19] (03PS6) 10Jdlrobson: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [17:59:26] (03CR) 10Jdlrobson: [C: 03+1] Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1800) [18:00:46] (03CR) 10Majavah: [C: 03+2] get_config: Respect .mailmap for git authors [puppet] - 10https://gerrit.wikimedia.org/r/982420 (owner: 10Majavah) [18:00:51] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:00:55] (03CR) 10Majavah: [C: 03+2] merge_cli: Respect .mailmap when formatting change list [puppet] - 10https://gerrit.wikimedia.org/r/982421 (owner: 10Majavah) [18:02:55] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:06:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:08:29] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:09:01] (03PS1) 10Bking: Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) [18:10:02] !log reimaging phab2002 (stand-by phorge server with bullseye - T327068 [18:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:06] (03PS2) 10Bking: Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) [18:10:25] T327068: Bullseye upgrade for remaining Collab hosts - https://phabricator.wikimedia.org/T327068 [18:11:53] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:12:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host phab2002.codfw.wmnet with OS bullseye [18:16:58] (SystemdUnitFailed) firing: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:21] (03CR) 10Peter Fischer: [C: 03+1] "LGTM, thanks!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [18:18:45] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:19:01] (03CR) 10Bking: [C: 03+2] Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [18:19:21] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:19:56] (03CR) 10Bking: [V: 04-1 C: 03+2] Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [18:20:01] (03CR) 10Bking: [C: 04-1] Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [18:21:41] (03CR) 10RLazarus: [C: 03+2] admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [18:23:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:07] (03CR) 10Bking: [C: 03+2] Accept document as script.source in addition to script.params.source (deprecated) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982444 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [18:24:33] (03Merged) 10jenkins-bot: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [18:26:31] PROBLEM - Check systemd state on moss-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service,confd_prometheus_metrics.service,export_smart_data_dump.service,prometheus-debian-version-textfile.service,prometheus-dpkg-success-textfile.service,prometheus-ipmi-exporter.service,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-puppet-agent-stats.service,prometheus_intel_micr [18:26:31] rvice,wmf_auto_restart_nagios-nrpe-server.service,wmf_auto_restart_prometheus-node-exporter.service,wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on phab2002.codfw.wmnet with reason: host reimage [18:28:51] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:29:58] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:29:59] (03CR) 10Jforrester: [C: 03+1] "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982441 (https://phabricator.wikimedia.org/T314947) (owner: 10Bartosz Dziewoński) [18:31:19] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:31:42] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:32:53] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:32:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2002.codfw.wmnet with reason: host reimage [18:33:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 581.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:34:15] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:38:33] (03PS3) 10Majavah: P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) [18:41:25] (03PS2) 10Krinkle: RunSingleJob.php: Remove overly complicated error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [18:41:29] (03CR) 10Majavah: [C: 03+2] P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) (owner: 10Majavah) [18:42:00] (03CR) 10Dzahn: [C: 03+2] phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) (owner: 10Jelto) [18:45:44] (03PS1) 10Majavah: P:openstack: nova: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/982451 [18:45:55] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10Aklapper) @Mcastro: Hi, the Phabricator account https://phabricator.wikimedia.org/p/Mcastro/ is linked to a self-created personal MediaWiki.org acco... [18:45:56] (03CR) 10Majavah: [V: 03+2 C: 03+2] P:openstack: nova: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/982451 (owner: 10Majavah) [18:47:55] (03CR) 10Krinkle: RunSingleJob.php: Remove overly complicated error handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [18:49:01] PROBLEM - MD RAID on moss-be2001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:55:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab2002.codfw.wmnet with OS bullseye [18:55:26] (03CR) 10Krinkle: RunSingleJob.php: Remove overly complicated error handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [18:57:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:15] (03CR) 10Krinkle: [C: 03+1] "I changed my mind based on T352265. I guess the HTML response is fine-ish? Might be worth double checking with Hnolan that this really is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [18:59:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:09] brennen and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T1900). [19:01:04] o/ [19:01:39] (03PS3) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) [19:03:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: enable new wmf-elasticsearch-search-plugins - bking@cumin2002 - T353270 [19:03:26] T353270: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [19:03:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: enable new wmf-elasticsearch-search-plugins - bking@cumin2002 - T353270 [19:04:21] RECOVERY - cassandra-b CQL 10.192.32.227:9042 on restbase2031 is OK: TCP OK - 0.091 second response time on 10.192.32.227 port 9042 https://phabricator.wikimedia.org/T93886 [19:04:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982237 (https://phabricator.wikimedia.org/T353257) (owner: 10Subramanya Sastry) [19:06:33] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 8 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 8, active_shards: 8, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 8, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, nu [19:06:33] in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:06:49] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 214 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 218, active_shards: 218, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 214, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number [19:06:49] light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.46296296296296 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:07:03] ^^ Relforge issues are known, will silence [19:08:42] ACKNOWLEDGEMENT - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: man-db.service Brian_King T353270 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:42] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 214 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 218, active_shards: 218, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 214, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 [19:08:42] _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.46296296296296 Brian_King T353270 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:08:42] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 8 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 8, active_shards: 8, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 8, delayed_unassigned_shards: 0, number_of_pending_task [19:08:42] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Brian_King T353270 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:08:43] ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_7@relforge-eqiad-small-alpha.service,elasticsearch_7@relforge-eqiad.service Brian_King T353270 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:43] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f62a07a3280: Failed to establish a new connection: [Errno 111] Connection refused)) Brian_Ki [19:08:44] 70 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:08:44] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f7d93da2280: Failed to establish a new connection: [Errno 111] Connection refused)) Brian_Ki [19:08:45] 70 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:08:56] !log 1.42.0-wmf.9 (T350085) status: deploying a fix for T353257 and then will proceed to group0. [19:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:09:06] T353257: OutputTransformPipeline changes have broken ?useparsoid=1 with DiscussionTools - https://phabricator.wikimedia.org/T353257 [19:10:07] RECOVERY - cassandra-c service on restbase2031 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:10:09] RECOVERY - cassandra-c SSL 10.192.32.228:7000 on restbase2031 is OK: SSL OK - Certificate restbase2031-c valid until 2025-12-07 21:03:20 +0000 (expires in 726 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:18:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: enable new wmf-elasticsearch-search-plugins - bking@cumin2002 - T353270 [19:18:40] T353270: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [19:23:26] (03Merged) 10jenkins-bot: ParserOutput::getText(): do not clone ParserOutput when invoking pipeline [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982237 (https://phabricator.wikimedia.org/T353257) (owner: 10Subramanya Sastry) [19:23:47] !log brennen@deploy2002 Started scap: Backport for [[gerrit:982237|ParserOutput::getText(): do not clone ParserOutput when invoking pipeline (T353257)]] [19:23:56] T353257: OutputTransformPipeline changes have broken ?useparsoid=1 with DiscussionTools - https://phabricator.wikimedia.org/T353257 [19:25:13] !log brennen@deploy2002 brennen and ssastry: Backport for [[gerrit:982237|ParserOutput::getText(): do not clone ParserOutput when invoking pipeline (T353257)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:26:05] !log brennen@deploy2002 brennen and ssastry: Continuing with sync [19:27:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:33:29] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:982237|ParserOutput::getText(): do not clone ParserOutput when invoking pipeline (T353257)]] (duration: 09m 41s) [19:33:40] T353257: OutputTransformPipeline changes have broken ?useparsoid=1 with DiscussionTools - https://phabricator.wikimedia.org/T353257 [19:34:14] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982457 (https://phabricator.wikimedia.org/T350085) [19:34:16] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982457 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [19:35:23] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982457 (https://phabricator.wikimedia.org/T350085) (owner: 10TrainBranchBot) [19:43:19] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.9 refs T350085 [19:43:23] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:46:19] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1147 MB (2% inode=97%): /tmp 1147 MB (2% inode=97%): /var/tmp 1147 MB (2% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [19:46:25] !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:46:29] !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:53:53] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 6.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:56:34] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:57:08] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:59:19] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: enable new wmf-elasticsearch-search-plugins - bking@cumin2002 - T353270 [19:59:24] T353270: Update relforge elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [20:00:21] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 16, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:00:21] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:00:35] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 233, active_shards: 436, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:00:35] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:01:23] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10jhathaway) @thcipriani just a reminder to approve this access request [20:02:28] (03PS3) 10Thcipriani: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [20:03:46] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10thcipriani) >>! In T351431#9401293, @jhathaway wrote: > @thcipriani just a reminder to approve this access request Ah, thank you! Approved! [20:04:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [20:05:31] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [20:06:22] (03CR) 10RLazarus: [C: 03+2] admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [20:09:21] (03Merged) 10jenkins-bot: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [20:11:32] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jhathaway) This is currently on the clinic duty workboard, but outside of clinic duties normal access requests. @jcre... [20:11:43] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10jhathaway) >>! In T351431#9401301, @thcipriani wrote: >>>! In T351431#9401293, @jhathaway wrote: >> @thcipriani just a reminder to approve this access request > > Ah, thank you! Approved! gr... [20:11:49] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10jhathaway) [20:13:44] (ProbeDown) firing: Service planet2003:443 has failed probes (http_en_planet_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#planet2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:14] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) There is ongoing conversations with legal, which doesn't write here. Don't worry- deployment of this should... [20:17:53] (03PS1) 10Ryan Kemper: wdqs: prepare new public & internal hosts [puppet] - 10https://gerrit.wikimedia.org/r/982463 (https://phabricator.wikimedia.org/T982172) [20:17:54] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:19:05] (03PS4) 10Thcipriani: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [20:28:56] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:30:34] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [20:32:55] 10SRE-Access-Requests, 10Data-Persistence, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10Ladsgroup) [20:33:33] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [20:37:52] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [20:38:24] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [20:40:55] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [20:41:20] (03PS1) 10Santiago Faci: Remove partial migration of EditAttemptStep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982467 (https://phabricator.wikimedia.org/T351335) [20:42:07] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [20:43:56] (03PS1) 10JHathaway: deployment group: add sfaci [puppet] - 10https://gerrit.wikimedia.org/r/982468 (https://phabricator.wikimedia.org/T351431) [20:47:45] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [20:48:01] (03CR) 10Bartosz Dziewoński: RunSingleJob.php: Remove overly complicated error handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [20:48:24] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10jhathaway) @darthmon_wmde do you still need this access? [20:50:21] (03PS1) 10Andrew Bogott: Revert "Horizon: allow image uploading via horizon for users with glance admin" [puppet] - 10https://gerrit.wikimedia.org/r/982470 (https://phabricator.wikimedia.org/T326818) [20:50:23] (03PS1) 10Andrew Bogott: Horizon: update build version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982471 (https://phabricator.wikimedia.org/T326818) [20:50:25] (03PS1) 10Andrew Bogott: Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231212T2100) [21:00:05] phuedx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:30] o/ [21:04:14] * TheresNoTime can deploy [21:04:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [21:05:38] (03Merged) 10jenkins-bot: Add stream config for Android article instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [21:06:02] !log samtar@deploy2002 Started scap: Backport for [[gerrit:980963|Add stream config for Android article instruments (T351292)]] [21:06:18] T351292: Deploy the latest version of the Java Metrics Platform client library - https://phabricator.wikimedia.org/T351292 [21:07:26] !log samtar@deploy2002 cjming and samtar: Backport for [[gerrit:980963|Add stream config for Android article instruments (T351292)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:29] phuedx: ready for testing on mwdebug [21:10:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:10:29] TheresNoTime: Tested on mwdebug by checking that the streams appeared in the streamconfigs MediaWiki Action API: https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs [21:10:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/982463 (https://phabricator.wikimedia.org/T982172) (owner: 10Ryan Kemper) [21:10:46] syncing :) [21:10:48] !log samtar@deploy2002 cjming and samtar: Continuing with sync [21:13:37] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: allow image uploading via horizon for users with glance admin" [puppet] - 10https://gerrit.wikimedia.org/r/982470 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [21:13:44] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update build version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982471 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [21:18:02] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:980963|Add stream config for Android article instruments (T351292)]] (duration: 11m 59s) [21:18:06] T351292: Deploy the latest version of the Java Metrics Platform client library - https://phabricator.wikimedia.org/T351292 [21:18:16] phuedx: live on prod :) [21:18:47] TheresNoTime ty! [21:19:24] (03CR) 10Gergő Tisza: [C: 03+1] use virtual db domain for CentralAuth and GlobalBlocking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [21:29:10] (03CR) 10Thcipriani: [C: 03+1] "Tested locally and lgtm, thank you for making this!" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [21:29:14] (03PS5) 10Cwhite: Filter errors originating in external tools [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [21:32:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on planet2003.codfw.wmnet with reason: reimage [21:32:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on planet2003.codfw.wmnet with reason: debugging [21:32:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on planet2003.codfw.wmnet with reason: debugging [21:33:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on planet1003.eqiad.wmnet with reason: debugging [21:33:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on planet1003.eqiad.wmnet with reason: debugging [21:33:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on planet1003.eqiad.wmnet with reason: debugging [21:34:44] (03CR) 10Cwhite: "Change tests ok against production. I'm ready to deploy when you are." [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [21:35:40] (03CR) 10Jforrester: [C: 03+1] RunSingleJob.php: Stop writing to $wgCommandLineMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982415 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [21:41:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/982468 (https://phabricator.wikimedia.org/T351431) (owner: 10JHathaway) [21:42:41] (03CR) 10JHathaway: [C: 03+2] deployment group: add sfaci [puppet] - 10https://gerrit.wikimedia.org/r/982468 (https://phabricator.wikimedia.org/T351431) (owner: 10JHathaway) [21:43:41] (03CR) 10Volans: [C: 03+1] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [21:44:22] (03CR) 10Muehlenhoff: "(the change for idp-test1002.wikimedia.org is some PCC noise (the role isn't switched to nftables yet) and prometheus1006.eqiad.wmnet unre" [puppet] - 10https://gerrit.wikimedia.org/r/982428 (owner: 10Muehlenhoff) [21:48:47] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:49:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10jhathaway) 05Open→03Resolved enjoy! [21:50:23] (03PS9) 10Cwhite: Enable $wgStatsTarget for requests to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) [21:53:13] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:26] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10jhathaway) @Sandeeps would you kindly submit a gerrit patch with your ssh key for verification, "SSH public key has to be submitted via gerrit patchset by user, or by some confirmed (non-email)... [21:54:13] (03PS1) 10Dzahn: planet: remove 2 broken feed URLs from cs language version [puppet] - 10https://gerrit.wikimedia.org/r/982481 [21:56:06] (03CR) 10Dzahn: "Aklapper, do you see a working feed URL on these blogspot ones? Can you read the Czech?" [puppet] - 10https://gerrit.wikimedia.org/r/982481 (owner: 10Dzahn) [21:57:17] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10jhathaway) [22:00:01] RECOVERY - Check systemd state on planet1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:02:13] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:04:37] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10jhathaway) @Mcastro would you kindly update this request using the access template, https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [22:07:31] (03PS1) 10Dzahn: planet: enable feed updates on both new VMs [puppet] - 10https://gerrit.wikimedia.org/r/982484 (https://phabricator.wikimedia.org/T348392) [22:08:17] (03CR) 10Dzahn: [C: 03+2] planet: remove 2 broken feed URLs from cs language version [puppet] - 10https://gerrit.wikimedia.org/r/982481 (owner: 10Dzahn) [22:16:58] (SystemdUnitFailed) firing: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:02] (03CR) 10Dzahn: [C: 03+2] Switch planet to bookworm VM backends [dns] - 10https://gerrit.wikimedia.org/r/982156 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [22:27:40] (03CR) 10Dzahn: [C: 03+2] planet: enable feed updates on both new VMs [puppet] - 10https://gerrit.wikimedia.org/r/982484 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [22:29:13] (03PS1) 10JHathaway: wmde: grant nda & wmde access for ArthurTaylor [puppet] - 10https://gerrit.wikimedia.org/r/982485 (https://phabricator.wikimedia.org/T352653) [22:30:45] (03CR) 10Volans: [C: 03+1] "Looks good, make sure to test it." [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [22:30:50] (03CR) 10Dzahn: [C: 03+1] "lgtm, has approval from WMDE-leszek and confirmation from KFrancis" [puppet] - 10https://gerrit.wikimedia.org/r/982485 (https://phabricator.wikimedia.org/T352653) (owner: 10JHathaway) [22:34:32] (03CR) 10JHathaway: [C: 03+2] wmde: grant nda & wmde access for ArthurTaylor [puppet] - 10https://gerrit.wikimedia.org/r/982485 (https://phabricator.wikimedia.org/T352653) (owner: 10JHathaway) [22:37:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10jhathaway) 05Open→03Resolved a:03jhathaway enjoy! [22:43:52] !log planet2003 -manually upgrade rawdog package to 3.0.2 T348392 [22:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:56] T348392: Migrate planet servers to bullseye or bookworm - https://phabricator.wikimedia.org/T348392 [22:44:39] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:49:36] (03PS2) 10Andrew Bogott: Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) [22:49:38] (03PS1) 10Andrew Bogott: Update horizon docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982488 [22:50:38] (03PS1) 10Dzahn: planet: switch to new eqiad backend [dns] - 10https://gerrit.wikimedia.org/r/982489 (https://phabricator.wikimedia.org/T348392) [22:50:53] (03CR) 10Andrew Bogott: [C: 03+2] Update horizon docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982488 (owner: 10Andrew Bogott) [22:52:59] (03CR) 10Dzahn: [C: 03+2] planet: switch to new eqiad backend [dns] - 10https://gerrit.wikimedia.org/r/982489 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [22:57:09] !log planet - switched to eqiad and bookworm backend (T348392 T345617) - https://meta.wikimedia.org/wiki/Planet_Wikimedia [22:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:15] T348392: Migrate planet servers to bullseye or bookworm - https://phabricator.wikimedia.org/T348392 [22:57:15] T345617: Switchover planet.wikimedia.org - September 2023 - https://phabricator.wikimedia.org/T345617 [22:59:27] (03CR) 10Dzahn: [C: 03+2] "contact https://meta.wikimedia.org/wiki/User:Okino about this?" [puppet] - 10https://gerrit.wikimedia.org/r/982481 (owner: 10Dzahn) [23:01:05] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 830.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:05:48] !log removing 2 files for legal compliance [23:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:51] RECOVERY - Check systemd state on planet2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:15] RECOVERY - cassandra-c CQL 10.192.32.228:9042 on restbase2031 is OK: TCP OK - 0.099 second response time on 10.192.32.228 port 9042 https://phabricator.wikimedia.org/T93886 [23:26:15] !log removing 2 files for legal compliance [23:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure