[00:02:56] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P34419 and previous config saved to /var/cache/conftool/dbconfig/20220912-000356-ladsgroup.json
[00:04:00] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[00:10:08] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:13:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:16:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:19:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P34420 and previous config saved to /var/cache/conftool/dbconfig/20220912-001902-ladsgroup.json
[00:25:06] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:31:46] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:34:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P34421 and previous config saved to /var/cache/conftool/dbconfig/20220912-003409-ladsgroup.json
[00:34:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:38:50] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P34422 and previous config saved to /var/cache/conftool/dbconfig/20220912-004915-ladsgroup.json
[00:49:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[00:49:20] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[00:49:24] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[00:49:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[00:49:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[00:49:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:49:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34423 and previous config saved to /var/cache/conftool/dbconfig/20220912-004952-ladsgroup.json
[00:51:28] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:56:38] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:57:04] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:58:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:02:54] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:08] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:13:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:18:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:21:10] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:21:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34424 and previous config saved to /var/cache/conftool/dbconfig/20220912-012118-ladsgroup.json
[01:21:22] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[01:23:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:25:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:31:50] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:36:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P34425 and previous config saved to /var/cache/conftool/dbconfig/20220912-013625-ladsgroup.json
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:37:12] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:39:02] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:34] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:58] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:51:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P34426 and previous config saved to /var/cache/conftool/dbconfig/20220912-015131-ladsgroup.json
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:53:16] <icinga-wm>	 PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:56:50] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34427 and previous config saved to /var/cache/conftool/dbconfig/20220912-020638-ladsgroup.json
[02:06:42] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:04] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:32:02] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:33:20] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:39:14] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[02:45:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:49:46] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:50:58] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[02:53:22] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[02:57:02] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:05:14] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:04] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:12:28] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:32:16] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:39:30] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:50:04] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:58] <icinga-wm>	 RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:57:18] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:05:30] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:42] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:24:43] <wikibugs>	 (03PS1) 10Stang: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211
[04:24:55] <wikibugs>	 (03PS1) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212
[04:25:11] <wikibugs>	 (03PS2) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212
[04:26:20] <wikibugs>	 (03PS2) 10Stang: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211
[04:26:31] <wikibugs>	 (03PS3) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212
[04:32:30] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P34428 and previous config saved to /var/cache/conftool/dbconfig/20220912-043624-ladsgroup.json
[04:36:29] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[04:39:42] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:49:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:50:16] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P34429 and previous config saved to /var/cache/conftool/dbconfig/20220912-045130-ladsgroup.json
[04:53:04] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:55:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:57:32] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P34430 and previous config saved to /var/cache/conftool/dbconfig/20220912-050636-ladsgroup.json
[05:14:48] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:16:32] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:19:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2020 for upgrade T317507', diff saved to https://phabricator.wikimedia.org/P34431 and previous config saved to /var/cache/conftool/dbconfig/20220912-051906-root.json
[05:19:10] <stashbot>	 T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507
[05:19:16] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:19:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:19:44] <wikibugs>	 (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831369
[05:20:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831369 (owner: 10Marostegui)
[05:21:21] <marostegui>	 !log dbmaint Reboot es2020 for kernel upgrade T317507
[05:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P34432 and previous config saved to /var/cache/conftool/dbconfig/20220912-052143-ladsgroup.json
[05:21:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[05:21:46] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[05:21:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[05:22:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507
[05:23:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507
[05:23:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:26:28] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:32:48] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:35:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34433 and previous config saved to /var/cache/conftool/dbconfig/20220912-053504-root.json
[05:35:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831213
[05:36:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:36:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831213 (owner: 10Marostegui)
[05:40:02] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:40:36] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:46:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:50] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:50:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34434 and previous config saved to /var/cache/conftool/dbconfig/20220912-055008-root.json
[05:51:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2024 for upgrade', diff saved to https://phabricator.wikimedia.org/P34435 and previous config saved to /var/cache/conftool/dbconfig/20220912-055101-root.json
[05:58:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:03:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34436 and previous config saved to /var/cache/conftool/dbconfig/20220912-060305-root.json
[06:05:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34437 and previous config saved to /var/cache/conftool/dbconfig/20220912-060513-root.json
[06:05:33] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:09:17] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34438 and previous config saved to /var/cache/conftool/dbconfig/20220912-061810-root.json
[06:19:07] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:20:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34439 and previous config saved to /var/cache/conftool/dbconfig/20220912-062018-root.json
[06:25:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:29:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:31:48] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:32:56] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34440 and previous config saved to /var/cache/conftool/dbconfig/20220912-063314-root.json
[06:35:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34441 and previous config saved to /var/cache/conftool/dbconfig/20220912-063523-root.json
[06:36:24] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:36] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:37:42] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865)
[06:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:43:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34442 and previous config saved to /var/cache/conftool/dbconfig/20220912-064350-ladsgroup.json
[06:43:54] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[06:47:37] <moritzm>	 !log installing 5.10.136 updates on buster systems running 5.10
[06:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34443 and previous config saved to /var/cache/conftool/dbconfig/20220912-064819-root.json
[06:50:24] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede)
[06:50:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34444 and previous config saved to /var/cache/conftool/dbconfig/20220912-065028-root.json
[06:53:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans)
[06:55:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:58:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P34445 and previous config saved to /var/cache/conftool/dbconfig/20220912-065856-ladsgroup.json
[07:01:25] <Amir1>	 jouncebot: nowandnext
[07:01:26] <jouncebot>	 No deployments scheduled for the forseeable future!
[07:01:26] <jouncebot>	 No deployments scheduled for the forseeable future!
[07:01:43] <Amir1>	 aaah, the calendar is not added, then I just deploy something
[07:02:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[07:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[07:03:00] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:03:12] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:03:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34446 and previous config saved to /var/cache/conftool/dbconfig/20220912-070324-root.json
[07:03:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[07:04:08] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]]
[07:04:10] <stashbot>	 T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865
[07:04:32] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[07:06:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:06:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:06:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:07:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:10:22] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:11:05] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]] (duration: 06m 57s)
[07:11:09] <stashbot>	 T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865
[07:14:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P34447 and previous config saved to /var/cache/conftool/dbconfig/20220912-071403-ladsgroup.json
[07:16:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[07:16:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[07:17:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34448 and previous config saved to /var/cache/conftool/dbconfig/20220912-071700-ladsgroup.json
[07:17:03] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[07:18:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34449 and previous config saved to /var/cache/conftool/dbconfig/20220912-071829-root.json
[07:23:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:26:55] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) Thank you for taking the time to look into this @cdanis! Overall LGTM on the fixes you are suggesting
[07:27:04] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) Adding #Machine-Learning-Team per my last question
[07:27:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:27:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:29:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34450 and previous config saved to /var/cache/conftool/dbconfig/20220912-072909-ladsgroup.json
[07:29:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[07:29:13] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:29:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[07:29:25] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-General, 10Thumbor: File:Keep_tidy_ask.svg 404 on Commons - https://phabricator.wikimedia.org/T314712 (10Aklapper) `Original file` link works; is there more to do in this ticket or can this be `resolved`?
[07:29:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34452 and previous config saved to /var/cache/conftool/dbconfig/20220912-072931-ladsgroup.json
[07:31:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:31:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[07:31:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[07:33:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507
[07:33:21] <stashbot>	 T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507
[07:33:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507
[07:34:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2020 with weight 0 T317507', diff saved to https://phabricator.wikimedia.org/P34453 and previous config saved to /var/cache/conftool/dbconfig/20220912-073408-root.json
[07:37:28] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831464 (https://phabricator.wikimedia.org/T317507)
[07:38:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831464 (https://phabricator.wikimedia.org/T317507) (owner: 10Marostegui)
[07:39:21] <marostegui>	 !log Starting es4 codfw failover from es2021 to es2020 - T317507
[07:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:25] <stashbot>	 T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507
[07:39:28] <wikibugs>	 (03PS1) 10Jforrester: Restore compatibility with overrides for IndexPager::makeLink() [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831215 (https://phabricator.wikimedia.org/T317477)
[07:41:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2020 to es4 primary and set section read-write T317507', diff saved to https://phabricator.wikimedia.org/P34454 and previous config saved to /var/cache/conftool/dbconfig/20220912-074100-root.json
[07:42:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2021 T317507', diff saved to https://phabricator.wikimedia.org/P34455 and previous config saved to /var/cache/conftool/dbconfig/20220912-074258-root.json
[07:43:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[07:43:45] <wikibugs>	 (03PS3) 10Aklapper: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[07:43:46] <hashar>	 Good "morning", I am upgrading the Jenkins instances this morning
[07:43:53] <wikibugs>	 (03PS4) 10Aklapper: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[07:45:16] <wikibugs>	 (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831478
[07:45:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831478 (owner: 10Marostegui)
[07:47:16] <hashar>	 !log Upgraded Jenkins instances from  2.346.1 to 2.346.3 # T317418
[07:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:19] <stashbot>	 T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418
[07:47:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see also inline" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[07:48:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! I _think_ you can also nuke the resources (without the ensure => absent dance) in this case" [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[07:49:01] <wikibugs>	 (03PS2) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021)
[07:49:33] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] opensatck: remove some not needed absented resources [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro)
[07:49:44] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Depool esams for cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690)
[07:51:51] <wikibugs>	 (03PS3) 10Volans: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786
[07:53:15] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010)
[07:53:17] <wikibugs>	 (03PS7) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847)
[07:53:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi)
[07:54:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[07:55:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Depool esams for cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[07:55:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:56:16] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Ok cool well we can close this in that case I think.  Cheers.
[07:56:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317508
[07:56:43] <stashbot>	 T317508: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T317508
[07:56:56] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317508
[07:57:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2024 with weight 0 T317508', diff saved to https://phabricator.wikimedia.org/P34456 and previous config saved to /var/cache/conftool/dbconfig/20220912-075739-root.json
[07:57:45] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging2001.codfw.wmnet with reason: Kafka PKI upgrade
[07:58:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging2001.codfw.wmnet with reason: Kafka PKI upgrade
[07:58:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[08:00:38] <topranks>	 !log de-pooliong esams in advance of upgrade to core router cr3-esams T295690
[08:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:41] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[08:00:56] <hashar>	 !log Restarting CI Jenkins for upgrade T317418
[08:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:59] <stashbot>	 T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418
[08:01:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams with reason: router upgrade
[08:01:31] <elukey>	 !log restart kafka on kafka2001 to pick up new PKI settings
[08:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:33] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831480 (https://phabricator.wikimedia.org/T317508)
[08:01:39] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams with reason: router upgrade
[08:01:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1e573369-5fdd-4621-8ae7-786b5a67de04) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and th...
[08:02:32] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271)
[08:02:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831480 (https://phabricator.wikimedia.org/T317508) (owner: 10Marostegui)
[08:02:55] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:01] <marostegui>	 !log Starting es5 codfw failover from es2023 to es2024 - T317508
[08:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:05] <stashbot>	 T317508: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T317508
[08:04:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2024 to es5 codfw primary T317508', diff saved to https://phabricator.wikimedia.org/P34457 and previous config saved to /var/cache/conftool/dbconfig/20220912-080400-root.json
[08:05:56] <hashar>	 I might have broken the CI Jenkins :-(
[08:06:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2023 T317508', diff saved to https://phabricator.wikimedia.org/P34458 and previous config saved to /var/cache/conftool/dbconfig/20220912-080602-root.json
[08:06:07] <icinga-wm>	 PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:06:11] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:47] <icinga-wm>	 PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:07:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34459 and previous config saved to /var/cache/conftool/dbconfig/20220912-080719-root.json
[08:07:23] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:07:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans)
[08:07:39] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831216
[08:07:59] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:08:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831216 (owner: 10Marostegui)
[08:08:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.cf
[08:09:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[08:13:20] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) kafka-logging2001 migrated to PKI, all good from what I can see in metrics!  Next steps: - wait a couple of days wi...
[08:13:24] <wikibugs>	 (03Merged) 10jenkins-bot: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans)
[08:15:42] <wikibugs>	 (03CR) 10Muehlenhoff: "Another pass of comments, this is going into the right direction." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[08:17:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34460 and previous config saved to /var/cache/conftool/dbconfig/20220912-081754-root.json
[08:17:56] <moritzm>	 !log imported jenkins 2.361.1 to thirdparty/ci T317418
[08:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:00] <stashbot>	 T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418
[08:19:27] <wikibugs>	 (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831482
[08:19:33] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032', diff saved to https://phabricator.wikimedia.org/P34461 and previous config saved to /var/cache/conftool/dbconfig/20220912-081936-root.json
[08:20:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831482 (owner: 10Marostegui)
[08:21:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:22:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34462 and previous config saved to /var/cache/conftool/dbconfig/20220912-082224-root.json
[08:22:26] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37221/console" [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[08:23:01] <wikibugs>	 (03PS1) 10Volans: doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483
[08:25:21] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:19] <wikibugs>	 (03CR) 10Muehlenhoff: smart: restore get_fact and deprecate get_raid_drivers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[08:28:20] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484
[08:29:28] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[08:32:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34463 and previous config saved to /var/cache/conftool/dbconfig/20220912-083258-root.json
[08:33:00] <icinga-wm>	 PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:33:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34464 and previous config saved to /var/cache/conftool/dbconfig/20220912-083308-root.json
[08:36:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,cr3-esams.mgmt with reason: router upgrade
[08:36:02] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr3-esams,cr3-esams IPv6,cr3-esams.mgmt with reason: router upgrade
[08:36:18] <icinga-wm>	 RECOVERY - DPKG on contint2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:36:20] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:37:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34465 and previous config saved to /var/cache/conftool/dbconfig/20220912-083729-root.json
[08:38:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831218
[08:38:24] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet
[08:39:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831218 (owner: 10Marostegui)
[08:39:34] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[08:39:50] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[08:39:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:39:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=57f0ae1d-0fa1-4b98-9454-bea638ac3971) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th...
[08:42:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:45:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet
[08:45:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Add slapd audit logs to backup - https://phabricator.wikimedia.org/T317516 (10MoritzMuehlenhoff)
[08:47:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet
[08:47:46] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:48:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34466 and previous config saved to /var/cache/conftool/dbconfig/20220912-084803-root.json
[08:48:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34467 and previous config saved to /var/cache/conftool/dbconfig/20220912-084812-root.json
[08:52:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[08:52:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34468 and previous config saved to /var/cache/conftool/dbconfig/20220912-085234-root.json
[08:53:51] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans)
[08:54:29] <wikibugs>	 (03CR) 10Jbond: "lgtm, see nits" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[08:55:17] <wikibugs>	 (03PS1) 10Muehlenhoff: openldap: Include slapd-audit.log to backup [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516)
[08:56:24] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:56:43] <wikibugs>	 (03CR) 10Muehlenhoff: smart: restore get_fact and deprecate get_raid_drivers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[08:56:50] <dcausse>	 jouncebot: next
[08:56:50] <jouncebot>	 No deployments scheduled for the forseeable future!
[08:56:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "see inline comments, looking good overall" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[08:57:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet
[09:00:33] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251)
[09:00:35] <wikibugs>	 (03PS3) 10JMeybohm: prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251)
[09:02:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:02:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 88, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:02:44] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:03:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34469 and previous config saved to /var/cache/conftool/dbconfig/20220912-090308-root.json
[09:03:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34470 and previous config saved to /var/cache/conftool/dbconfig/20220912-090317-root.json
[09:05:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[09:05:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[09:06:24] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:02] <icinga-wm>	 RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:07:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34471 and previous config saved to /var/cache/conftool/dbconfig/20220912-090739-root.json
[09:08:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[09:08:14] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:09:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet
[09:11:12] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:15:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:15:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:18:03] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet
[09:18:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34472 and previous config saved to /var/cache/conftool/dbconfig/20220912-091813-root.json
[09:18:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34473 and previous config saved to /var/cache/conftool/dbconfig/20220912-091822-root.json
[09:18:38] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[09:19:46] <wikibugs>	 (03PS1) 10JMeybohm: prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430)
[09:22:06] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:22:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34474 and previous config saved to /var/cache/conftool/dbconfig/20220912-092244-root.json
[09:27:24] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:27:42] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:31:35] <moritzm>	 !log updated buster install image for 10.13 release T317413
[09:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:39] <stashbot>	 T317413: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413
[09:32:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[09:32:58] <icinga-wm>	 RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:33:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34475 and previous config saved to /var/cache/conftool/dbconfig/20220912-093318-root.json
[09:33:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34476 and previous config saved to /var/cache/conftool/dbconfig/20220912-093327-root.json
[09:35:19] * Emperor really isn't still on clinic duty :p
[09:35:37] <jynus>	 who is?
[09:35:47] <jynus>	 (I only changed what I knew)
[09:36:07] <marostegui>	 jynus: looks like brett 
[09:36:16] <marostegui>	 per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Schedule
[09:36:28] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:36:30] * Emperor had just got there but marostegui is quicker :)
[09:36:50] <jynus>	 also I didn't change that because I thought CD change happened at SRE meeting time
[09:37:24] * Emperor has usually done it from the start of their Monday working day
[09:37:48] <Emperor>	 (so I'd expect brett to pick it up later, but I'm not expecting to do any clinic stuff today, IYSWIM)
[09:38:29] <moritzm>	 yeah, Monday morning for whatever time is morning for the person is the current standard practice
[09:38:46] <moritzm>	 especially given that we don't have weekly SRE meetings for some time now :-)
[09:40:24] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:41:18] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Please check it worked tomorrow, or ping me for me to check." [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff)
[09:41:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:45:54] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[09:45:58] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[09:46:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=39465e0b-b93d-45ba-b1d8-0c49dacc39fb) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th...
[09:46:30] <wikibugs>	 (03PS1) 10Jbond: prepare: rename hiera files indicateing new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494
[09:47:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prepare: rename hiera files indicateing new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 (owner: 10Jbond)
[09:48:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1033', diff saved to https://phabricator.wikimedia.org/P34477 and previous config saved to /var/cache/conftool/dbconfig/20220912-094818-root.json
[09:48:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34478 and previous config saved to /var/cache/conftool/dbconfig/20220912-094832-root.json
[09:51:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[09:54:22] <wikibugs>	 (03PS2) 10Jbond: prepare: rename hiera files indicating new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494
[09:55:38] <Emperor>	 !log rebalance thanos rings T311690
[09:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:41] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[09:57:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prepare: rename hiera files indicating new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 (owner: 10Jbond)
[09:59:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34479 and previous config saved to /var/cache/conftool/dbconfig/20220912-095918-root.json
[09:59:54] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/831495
[10:02:17] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34480 and previous config saved to /var/cache/conftool/dbconfig/20220912-100337-root.json
[10:03:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/831495 (owner: 10Jbond)
[10:06:33] <wikibugs>	 (03PS6) 10Jbond: puppetmaster: remove puppet-merge from wmcs instances [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[10:08:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37224/console" [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[10:08:21] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:33] <wikibugs>	 (03PS3) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536
[10:13:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[10:14:16] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37225/console" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[10:14:18] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10taavi)
[10:14:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34481 and previous config saved to /var/cache/conftool/dbconfig/20220912-101423-root.json
[10:14:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430) (owner: 10JMeybohm)
[10:16:48] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:17:10] <wikibugs>	 (03PS3) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:18:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34483 and previous config saved to /var/cache/conftool/dbconfig/20220912-101842-root.json
[10:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1315 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:20:17] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:51] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 551 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:22:45] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10EChetty)
[10:22:47] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[10:23:25] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 37s)
[10:24:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:24:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:25:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 88, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:25:47] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) Thanks, I was able to complete the servers' powerdown through the management interface by using the asset tag FQDN. `wtp[1029-1033].eqiad.wmnet` n...
[10:25:50] <jynus>	 I don't see any maintenance related to that
[10:26:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] openldap: Include slapd-audit.log to backup [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff)
[10:26:32] <jynus>	 ^ XioNoX: possibly a link eqiad-drmrs down?
[10:26:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet
[10:26:56] <wikibugs>	 (03PS4) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:26:58] <wikibugs>	 (03PS1) 10Jbond: hiera: Add renamed labs hiera file so pcc works [puppet] - 10https://gerrit.wikimedia.org/r/831497
[10:27:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] openldap: Include slapd-audit.log to backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff)
[10:27:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: Add renamed labs hiera file so pcc works [puppet] - 10https://gerrit.wikimedia.org/r/831497 (owner: 10Jbond)
[10:27:21] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:26] <XioNoX>	 jynus: maintenance on cr3-esams (cc topranks)
[10:27:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:27:36] <jynus>	 ah, ok, sorryu
[10:27:38] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert)
[10:28:06] <topranks>	 XioNoX: thanks
[10:28:39] <topranks>	 jynus: sry for the noise please ignore, done in about 20 mins
[10:28:51] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:28:54] <jynus>	 no issue, you logged it, it is my fault
[10:28:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:29:10] <jynus>	 it is just it is hard for me to notice it with so many messages
[10:29:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34484 and previous config saved to /var/cache/conftool/dbconfig/20220912-102928-root.json
[10:30:11] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm but see comment about hiera key" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:30:21] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:33:04] <wikibugs>	 (03PS5) 10Majavah: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230
[10:33:41] <wikibugs>	 (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:33:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet
[10:34:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1034', diff saved to https://phabricator.wikimedia.org/P34485 and previous config saved to /var/cache/conftool/dbconfig/20220912-103428-root.json
[10:35:58] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37229/console" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:36:27] <wikibugs>	 (03PS6) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:36:29] <wikibugs>	 (03PS1) 10Majavah: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500
[10:38:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37230/console" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:38:32] <taavi>	 jbond: role::puppetmaster::standalone doesn't use profile::puppetmaster::common :/
[10:39:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:40:52] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan)
[10:40:58] <wikibugs>	 (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:41:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34486 and previous config saved to /var/cache/conftool/dbconfig/20220912-104120-root.json
[10:43:10] <wikibugs>	 (03PS1) 10Jbond: prepare: drop old hiera file location [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831502
[10:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:44:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34487 and previous config saved to /var/cache/conftool/dbconfig/20220912-104432-root.json
[10:44:55] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah)
[10:45:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah)
[10:47:13] <wikibugs>	 (03PS1) 10Majavah: O:puppetmaster::standalone: fix hiera_config [puppet] - 10https://gerrit.wikimedia.org/r/831503
[10:47:15] <wikibugs>	 (03CR) 10Hnowlan: "lgtm - we could also replace the math.floor calls with `//` if we wanted to but this is fine for now." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik)
[10:47:22] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik)
[10:47:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:puppetmaster::standalone: fix hiera_config [puppet] - 10https://gerrit.wikimedia.org/r/831503 (owner: 10Majavah)
[10:49:12] <wikibugs>	 (03PS1) 10Jbond: O:puppetmaster::standalone: add correct hiere config default [puppet] - 10https://gerrit.wikimedia.org/r/831504
[10:50:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:50:43] <wikibugs>	 (03PS1) 10Cathal Mooney: Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690)
[10:50:44] <jbond>	 taavi: can you give ^^ a check 
[10:50:49] <jbond>	 https://gerrit.wikimedia.org/r/831230
[10:50:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet
[10:51:08] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "see also: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831503/" [puppet] - 10https://gerrit.wikimedia.org/r/831504 (owner: 10Jbond)
[10:51:50] <wikibugs>	 (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[10:51:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[10:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan)
[10:54:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney)
[10:55:10] <topranks>	 !log re-pooliong esams after successful upgrade of core router cr3-esams T295690
[10:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:13] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[10:55:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] "ill override the CI for now to get things working and send a follow up patch" [puppet] - 10https://gerrit.wikimedia.org/r/831503 (owner: 10Majavah)
[10:56:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34488 and previous config saved to /var/cache/conftool/dbconfig/20220912-105625-root.json
[10:58:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[10:58:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[10:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34489 and previous config saved to /var/cache/conftool/dbconfig/20220912-105841-ladsgroup.json
[10:58:45] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:59:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet
[10:59:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34490 and previous config saved to /var/cache/conftool/dbconfig/20220912-105937-root.json
[10:59:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet
[11:02:23] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:55] <moritzm>	 !log updated bullseye install image for 11.5 release T317416
[11:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:58] <stashbot>	 T317416: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416
[11:06:18] <wikibugs>	 (03PS1) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507
[11:06:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah)
[11:08:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet
[11:09:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1142.eqiad.wmnet
[11:09:31] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:19] <marostegui>	 jouncebot: next
[11:10:19] <jouncebot>	 In 1 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1300)
[11:10:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1142.eqiad.wmnet
[11:11:00] <wikibugs>	 (03PS2) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507
[11:11:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34491 and previous config saved to /var/cache/conftool/dbconfig/20220912-111130-root.json
[11:11:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet
[11:11:44] <wikibugs>	 (03PS1) 10Marostegui: db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522)
[11:12:11] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:12:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1143-1148].eqiad.wmnet
[11:13:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:13:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317522
[11:13:16] <stashbot>	 T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522
[11:13:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317522
[11:13:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:13:58] <wikibugs>	 (03CR) 10Majavah: "seems to fail in pcc: https://puppet-compiler.wmflabs.org/pcc-worker1003/37233/" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[11:14:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1020 with weight 0 T317522', diff saved to https://phabricator.wikimedia.org/P34492 and previous config saved to /var/cache/conftool/dbconfig/20220912-111424-root.json
[11:14:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34493 and previous config saved to /var/cache/conftool/dbconfig/20220912-111442-root.json
[11:15:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-etcd1001.eqiad.wmnet
[11:16:22] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831513 (https://phabricator.wikimedia.org/T317522)
[11:16:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430) (owner: 10JMeybohm)
[11:16:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1143-1148].eqiad.wmnet
[11:17:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831513 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:17:35] <jayme>	 taavi: ok to merge O:puppetmaster::standalone: fix hiera_config (9b5ff0a721) ?
[11:18:13] <taavi>	 jayme: yes, thanks, cc jbond who merged the gerrit config
[11:18:15] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es4 T317522 (duration: 04m 10s)
[11:18:18] <stashbot>	 T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522
[11:18:22] <marostegui>	 jayme: are you still merging puppet? 
[11:18:35] <jayme>	 was waiting on the ok - merged
[11:18:39] <marostegui>	 ah ok
[11:18:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:19:29] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:20:01] <marostegui>	 !log Starting es4 eqiad failover from es1021 to es1020 - T317522
[11:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1020 to es4 primary T317522', diff saved to https://phabricator.wikimedia.org/P34494 and previous config saved to /var/cache/conftool/dbconfig/20220912-112039-root.json
[11:21:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:21:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:21:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224
[11:22:15] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/831524 (https://phabricator.wikimedia.org/T317522)
[11:23:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1146.eqiad.wmnet
[11:23:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/831524 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:23:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1021 T317522', diff saved to https://phabricator.wikimedia.org/P34495 and previous config saved to /var/cache/conftool/dbconfig/20220912-112343-root.json
[11:23:46] <stashbot>	 T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522
[11:24:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224 (owner: 10Marostegui)
[11:24:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224 (owner: 10Marostegui)
[11:25:14] <wikibugs>	 (03Abandoned) 10Hashar: Boilerplate for automatic MediaWiki deployment [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) (owner: 10Hashar)
[11:25:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:26:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34496 and previous config saved to /var/cache/conftool/dbconfig/20220912-112635-root.json
[11:26:37] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:42] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[11:27:51] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[11:28:53] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es4 T317522 (duration: 03m 36s)
[11:28:55] <stashbot>	 T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522
[11:30:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:31:13] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::varnishsli: Consider req.body read|writer errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[11:32:52] <wikibugs>	 (03PS3) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507
[11:33:03] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:15] <wikibugs>	 (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[11:33:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:33:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:34:18] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[11:35:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:36:23] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:37] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:42] <wikibugs>	 (03PS3) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[11:36:53] <wikibugs>	 (03PS1) 10Marostegui: es1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831529 (https://phabricator.wikimedia.org/T317522)
[11:37:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34497 and previous config saved to /var/cache/conftool/dbconfig/20220912-113702-ladsgroup.json
[11:37:05] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:37:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831529 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui)
[11:38:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34498 and previous config saved to /var/cache/conftool/dbconfig/20220912-113808-root.json
[11:40:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[11:41:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34499 and previous config saved to /var/cache/conftool/dbconfig/20220912-114140-root.json
[11:42:14] <wikibugs>	 (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[11:43:33] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, see one nit about whitespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:46:29] <wikibugs>	 (03CR) 10Majavah: "This is failing PCC profile::puppetmaster::common::base_config is still missing. Looks like the standalone role duplicates the entire logi" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[11:48:57] <wikibugs>	 (03PS4) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507
[11:50:07] <wikibugs>	 (03PS4) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[11:52:02] <wikibugs>	 (03PS5) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507
[11:52:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P34500 and previous config saved to /var/cache/conftool/dbconfig/20220912-115208-ladsgroup.json
[11:53:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34501 and previous config saved to /var/cache/conftool/dbconfig/20220912-115313-root.json
[11:54:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[11:54:40] <wikibugs>	 (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[11:56:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34502 and previous config saved to /var/cache/conftool/dbconfig/20220912-115645-root.json
[11:59:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Thank you @jhathaway -- crossing fingers it works!
[11:59:24] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37238/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[12:01:34] <wikibugs>	 (03PS3) 10Majavah: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500
[12:04:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "Seems to work fine on my tests." [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[12:07:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P34503 and previous config saved to /var/cache/conftool/dbconfig/20220912-120715-ladsgroup.json
[12:07:58] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1146.eqiad.wmnet
[12:08:09] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet
[12:08:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34504 and previous config saved to /var/cache/conftool/dbconfig/20220912-120818-root.json
[12:11:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34505 and previous config saved to /var/cache/conftool/dbconfig/20220912-121150-root.json
[12:12:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm. I've done a helm template before and after and differences are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 (owner: 10JMeybohm)
[12:16:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet
[12:18:17] <wikibugs>	 (03PS1) 10Btullis: Add the locations of the new hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/831532 (https://phabricator.wikimedia.org/T275767)
[12:18:49] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34506 and previous config saved to /var/cache/conftool/dbconfig/20220912-122221-ladsgroup.json
[12:22:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:22:25] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:22:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:22:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34507 and previous config saved to /var/cache/conftool/dbconfig/20220912-122242-ladsgroup.json
[12:23:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34508 and previous config saved to /var/cache/conftool/dbconfig/20220912-122323-root.json
[12:25:41] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet
[12:26:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34509 and previous config saved to /var/cache/conftool/dbconfig/20220912-122654-root.json
[12:30:14] <wikibugs>	 (03PS4) 10Hashar: jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637)
[12:30:16] <wikibugs>	 (03PS1) 10Hashar: systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637)
[12:33:11] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet
[12:36:28] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34510 and previous config saved to /var/cache/conftool/dbconfig/20220912-123828-root.json
[12:40:43] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:41:43] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:49:01] <wikibugs>	 (03CR) 10Hashar: "That is to be used by the child change https://gerrit.wikimedia.org/r/c/operations/puppet/+/808900/ . My aim is to replace our own systemd" [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[12:49:11] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:53:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34511 and previous config saved to /var/cache/conftool/dbconfig/20220912-125333-root.json
[12:54:40] <wikibugs>	 (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[12:54:45] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans)
[12:57:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans)
[12:58:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prepare: drop old hiera file location [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831502 (owner: 10Jbond)
[12:58:59] <wikibugs>	 (03Abandoned) 10Jbond: O:puppetmaster::standalone: add correct hiere config default [puppet] - 10https://gerrit.wikimedia.org/r/831504 (owner: 10Jbond)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1300).
[13:00:05] <jouncebot>	 koi and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond)
[13:00:11] <koi>	 o/
[13:00:20] <Lucas_WMDE>	 o/
[13:00:23] <Lucas_WMDE>	 I can deploy!
[13:02:45] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:02:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:04:39] <Lucas_WMDE>	 koi: I’ve pulled the first change to mwdebug1001, can you test it?
[13:05:00] <koi>	 looking
[13:05:14] <Lucas_WMDE>	 (looks good on my end, I think)
[13:05:37] <koi>	 yeah, also looks good from my side
[13:05:40] <Lucas_WMDE>	 ok!
[13:05:53] <Lucas_WMDE>	 the files can probably be synced in any order
[13:06:12] <Lucas_WMDE>	 I think I’ll do yaml, logos.php, then IS.php
[13:06:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:06:41] <Lucas_WMDE>	 syncing
[13:07:25] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans)
[13:08:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34512 and previous config saved to /var/cache/conftool/dbconfig/20220912-130838-root.json
[13:09:11] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Update: we're [[ https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=swift_account_stats_byt...
[13:09:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:09:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:09:31] <wikibugs>	 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon) 05Open→03Resolved Fix with https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/6
[13:09:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maint
[13:09:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maint
[13:10:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:10:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (1/3) (duration: 03m 53s)
[13:10:34] <stashbot>	 T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127
[13:12:49] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[13:12:59] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[13:13:59] <wikibugs>	 (03Merged) 10jenkins-bot: doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans)
[13:14:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (2/3) (duration: 03m 39s)
[13:14:55] <Lucas_WMDE>	 are there any known issues with the mwdebug logstash dashboard?
[13:15:10] <Lucas_WMDE>	 it looks empty for me, and usually there’s at least a few messages there during a backport window, e.g. from scap pull IIRC
[13:17:06] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:17:44] <Lucas_WMDE>	 I’m also wondering whether to purge the kowiki-600k files from the HTTP cache after the deployment is done, or not
[13:18:12] <Lucas_WMDE>	 I feel like that would be a good idea – if anything still accesses those files, we want that to be a noticeable error now, not a total mystery a year later when the cache finally expires
[13:18:20] <Lucas_WMDE>	 koi: any thoughts on that? :)
[13:18:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (3/3) (duration: 03m 53s)
[13:18:34] <stashbot>	 T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127
[13:18:50] <koi>	 IIRC someone said purge files is only needed if you rename a file
[13:18:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:19:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang)
[13:19:44] <koi>	 but that idea is sense making at least
[13:19:47] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:12] <wikibugs>	 (03PS3) 10Clément Goubert: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto)
[13:20:29] <Lucas_WMDE>	 koi: the second change is on mwdebug1001, anything to test?
[13:20:36] <Lucas_WMDE>	 and yeah, I don’t think it’s exactly necessary, just an extra cleanup
[13:20:58] <koi>	 looking
[13:20:59] <wikibugs>	 (03PS1) 10Jbond: C:jenkins: remove migrate file [puppet] - 10https://gerrit.wikimedia.org/r/831541
[13:21:21] <Lucas_WMDE>	 https://en.wikipedia.org/static/images/project-logos/kowiki-600k-2x.png is a 404 on mwdebug1001, so that looks good
[13:21:46] <koi>	 I got a "Page not found" notice for /static/images/project-logos/kowiki-600k.png , so LGTM
[13:21:52] <Lucas_WMDE>	 ok, syncing
[13:23:07] <wikibugs>	 (03PS3) 10Volans: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[13:23:19] <wikibugs>	 (03PS2) 10Volans: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484
[13:23:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34513 and previous config saved to /var/cache/conftool/dbconfig/20220912-132343-root.json
[13:24:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:25:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:26:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:831212|Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" (T315127)]] (1/2; deleted files require syncing whole directory) (duration: 03m 50s)
[13:26:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:26:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:26:26] <stashbot>	 T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127
[13:26:41] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:12] <Lucas_WMDE>	 logstash doesn’t seem to have any messages for host:mwdebug*
[13:28:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34514 and previous config saved to /var/cache/conftool/dbconfig/20220912-132846-ladsgroup.json
[13:28:50] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[13:28:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:29:25] <wikibugs>	 (03PS1) 10Gergő Tisza: Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534)
[13:29:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto)
[13:30:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:831212|Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" (T315127)]] (2/2; deleted file requires syncing whole directory) (duration: 03m 44s)
[13:31:36] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans)
[13:31:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[13:32:03] <wikibugs>	 (03Merged) 10jenkins-bot: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto)
[13:33:19] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/%s\n' {mobile/copyright/wikipedia-ko-600k.svg,project-logos/kowiki-600k{,-1.5x,-2x}.png} | mwscript purgeList.php # T315127
[13:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:22] <stashbot>	 T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127
[13:34:10] <Lucas_WMDE>	 alright, now I’ll test if T317520 would affect production as well if the train rolls forward
[13:34:10] <stashbot>	 T317520:  Score: Call to a member function getExpensiveParserFunctionLimit() on null - https://phabricator.wikimedia.org/T317520
[13:35:20] <wikibugs>	 (03Abandoned) 10Jbond: C:jenkins: remove migrate file [puppet] - 10https://gerrit.wikimedia.org/r/831541 (owner: 10Jbond)
[13:35:22] <Lucas_WMDE>	 !log manually applying [[gerrit:830691]] on mwdebug1001 to test if T317520 affects production (expected to cause getExpensiveParserFunctionLimit-related logstash errors)
[13:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans)
[13:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:56] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] systemd: allow changing override filename (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[13:36:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar)
[13:38:43] <wikibugs>	 (03Merged) 10jenkins-bot: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[13:38:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34515 and previous config saved to /var/cache/conftool/dbconfig/20220912-133848-root.json
[13:39:04] <Lucas_WMDE>	 yup, there’s an internal error
[13:39:20] <Lucas_WMDE>	 aha, and it’s in logstash as well
[13:39:36] <Lucas_WMDE>	 so host:mwdebug* messages still make it to logstash – I suppose scap pull just doesn’t produce any logs anymore?
[13:39:47] <Lucas_WMDE>	 but it’s the same error, so this is indeed a train blocker
[13:40:16] <Lucas_WMDE>	 !log scap pull on mwdebug1001 to restore good code (confirmed that T317520 affects production)
[13:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:19] <stashbot>	 T317520:  Score: Call to a member function getExpensiveParserFunctionLimit() on null - https://phabricator.wikimedia.org/T317520
[13:41:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[13:41:51] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:43:43] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P34516 and previous config saved to /var/cache/conftool/dbconfig/20220912-134353-ladsgroup.json
[13:49:51] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[13:50:00] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[13:50:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) Upgrade of cr3-esams went well earlier.  Firmware upgrade works as per docs.  I will put up more info on that later for our own reference.
[13:51:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney)
[13:51:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney)
[13:53:07] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1028-1030]
[13:57:18] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet
[13:58:51] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:58:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P34517 and previous config saved to /var/cache/conftool/dbconfig/20220912-135859-ladsgroup.json
[14:01:14] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[14:01:52] <wikibugs>	 (03PS9) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663)
[14:02:24] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:02:25] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wtp[1028-1030]
[14:02:25] <wikibugs>	 (03PS5) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[14:02:30] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `wtp[1028-1030]` - wtp1028 (**FAIL**)   - //No DNS record found for th...
[14:05:45] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet
[14:06:11] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Clement_Goubert)
[14:06:29] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Clement_Goubert)
[14:07:31] <wikibugs>	 (03PS6) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051)
[14:14:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34518 and previous config saved to /var/cache/conftool/dbconfig/20220912-141405-ladsgroup.json
[14:14:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:14:09] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[14:14:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:14:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34519 and previous config saved to /var/cache/conftool/dbconfig/20220912-141427-ladsgroup.json
[14:18:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[14:18:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118)
[14:43:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34520 and previous config saved to /var/cache/conftool/dbconfig/20220912-144339-ladsgroup.json
[14:43:43] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[14:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:46:20] <jynus>	 should we worry about wikidata?
[14:48:11] <wikibugs>	 (03PS1) 10Elukey: Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130)
[14:48:56] <wikibugs>	 (03PS5) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[14:50:13] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:11] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P34521 and previous config saved to /var/cache/conftool/dbconfig/20220912-145845-ladsgroup.json
[14:58:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[15:02:25] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-33): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF)
[15:04:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen)
[15:13:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P34522 and previous config saved to /var/cache/conftool/dbconfig/20220912-151352-ladsgroup.json
[15:15:49] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:16:07] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:17:56] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.18.0" for 561 hosts
[15:18:13] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.18.0" completed for 561 hosts
[15:26:01] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) @MoritzMuehlenhoff @jbond Facter does not appear to be detecting the raid on some hosts.  Not sure how widespread the issue is.   current fact (direct c...
[15:26:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) @pfischer After I asked for your public key, it looks like someone updated the original request with the key. Thus...
[15:28:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34523 and previous config saved to /var/cache/conftool/dbconfig/20220912-152858-ladsgroup.json
[15:29:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[15:29:03] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[15:29:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[15:29:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34524 and previous config saved to /var/cache/conftool/dbconfig/20220912-152920-ladsgroup.json
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1530).
[15:32:41] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:17] <wikibugs>	 (03PS2) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293)
[15:39:00] <wikibugs>	 (03CR) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[15:39:39] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:58] <wikibugs>	 10SRE, 10ops-codfw, 10Observability-Logging: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10Papaul) 05Open→03Resolved @colewhite disk replaced
[15:43:46] <wikibugs>	 (03PS10) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663)
[15:44:15] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[15:46:44] <wikibugs>	 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Better test environments for Elastic - https://phabricator.wikimedia.org/T317420 (10Gehel) 05Open→03Invalid This is too broad as it is. We'll revisit this if we have a better defined need.
[15:46:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one final nit you couldn't foresee inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[15:49:44] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[15:51:59] <wikibugs>	 (03CR) 10Muehlenhoff: "This is expected, see the sysusers.d manpage: https://manpages.debian.org/unstable/systemd/sysusers.d.5.en.html" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[15:54:01] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[15:54:10] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[15:55:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite)
[15:55:54] <wikibugs>	 (03PS15) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[16:00:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[16:02:33] <wikibugs>	 (03PS11) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663)
[16:02:49] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:11] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[16:05:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis) As a note, such sites also include "everything on WMCS / toolserver" and it would probably be good to extend NEL to that as well.
[16:09:47] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:41] <icinga-wm>	 PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: srv.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:44] <wikibugs>	 (03PS6) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[16:12:03] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) It seems this comment was about T316090
[16:12:16] <wikibugs>	 (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[16:13:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Dzahn) @pfischer Hi, please also see this comment over here: T316922#8229340 . If you could try to ssh into an...
[16:13:59] <wikibugs>	 (03PS1) 10Ebernhardson: Re-enable track_total_hits for elastic7 [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374)
[16:15:29] <wikibugs>	 (03PS1) 10Ebernhardson: Set track_total_hits to true [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549
[16:17:23] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:24:29] <wikibugs>	 (03CR) 10RLazarus: "Hmm, this is a really interesting case! If I understand right, we're talking about situations where there was e.g. a network failure somew" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[16:28:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) Hi @pfischer You are in the requested wmf LDAP group and the WMF-NDA group in Phabricator meanwhle.  If you could...
[16:32:06] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric)
[16:33:05] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[16:40:03] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:21] <icinga-wm>	 RECOVERY - Check systemd state on logstash2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:06] <wikibugs>	 (03CR) 10Vgutierrez: varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[16:54:29] <wikibugs>	 (03CR) 10Vgutierrez: "1" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[16:57:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34527 and previous config saved to /var/cache/conftool/dbconfig/20220912-165720-ladsgroup.json
[16:57:24] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:00:04] <jouncebot>	 ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1700).
[17:03:14] <wikibugs>	 (03PS6) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[17:03:21] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:14] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mtail::varnishsli: Consider req.body read|write errors as good requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[17:07:37] <wikibugs>	 (03CR) 10Vlad.shapik: Remove division operation hack related to Python2 (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik)
[17:07:41] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash2027 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, active_primary_shards: 562, active_shards: 1281, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0,
[17:07:41] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:08:53] <cwhite>	 !log rebuilt raid on logstash2027 T316996
[17:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:56] <stashbot>	 T316996: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996
[17:10:21] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P34528 and previous config saved to /var/cache/conftool/dbconfig/20220912-171227-ladsgroup.json
[17:14:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10odimitrijevic) Approved
[17:21:00] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[17:21:09] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[17:27:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen)
[17:27:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P34529 and previous config saved to /var/cache/conftool/dbconfig/20220912-172733-ladsgroup.json
[17:30:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[17:37:46] <wikibugs>	 (03PS6) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261)
[17:39:05] <icinga-wm>	 RECOVERY - MD RAID on logstash2027 is OK: OK: Active: 24, Working: 24, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:42:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34531 and previous config saved to /var/cache/conftool/dbconfig/20220912-174239-ladsgroup.json
[17:42:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[17:42:43] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:42:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[17:43:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34532 and previous config saved to /var/cache/conftool/dbconfig/20220912-174301-ladsgroup.json
[17:57:51] <inflatador>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.116`. Pre-deploy tests passing on canary `wdqs1003`
[17:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:57] <inflatador>	 !log [WDQS Deploy] Tests passing following deploy of `wdqs1003` on canary `wdqs1003`; proceeding to rest of fleet
[18:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:46] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116
[18:05:37] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:08:23] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 05m 37s)
[18:10:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) p:05Triage→03Medium a:03BCornwall
[18:12:35] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez)
[18:13:45] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.16.0" for 561 hosts
[18:14:02] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.16.0" completed for 561 hosts
[18:14:34] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116
[18:14:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:17:35] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[18:19:23] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:06] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 07m 31s)
[18:24:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[18:24:28] <wikibugs>	 (03CR) 10Dduvall: "Just a friendly ping. Should I refactor `SETENV` to some alternative or is this good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[18:26:23] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10cmooney) @Jgreen I believe I've done what's required now (not all that familiar with this workflow however).  Both ports that are labelled for frdata100...
[18:37:35] <inflatador>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[18:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:42] <inflatador>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[18:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:50] <inflatador>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[18:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto)
[18:42:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) @cmooney Both interfaces show no-carrier, can you confirm that the switch ports are enabled?
[18:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34535 and previous config saved to /var/cache/conftool/dbconfig/20220912-184317-ladsgroup.json
[18:43:21] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[18:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:48:10] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14] (wcqs): Deploy 0.3.116 to WCQS
[18:49:13] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:54:33] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:56:11] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14] (wcqs): Deploy 0.3.116 to WCQS (duration: 08m 01s)
[18:56:11] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P34536 and previous config saved to /var/cache/conftool/dbconfig/20220912-185823-ladsgroup.json
[18:58:33] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:24] <inflatador>	 !log [WCQS Deploy] Test query passed on commons-query.wikimedia.org; WCQS deploy complete
[19:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:56] <ryankemper>	 !log [WCQS] Depooled `wcqs100[1,2]` while they catch up on ~1.5 days worth of lag (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wcqs&viewPanel=8&from=1662910789183&to=1663068616559)
[19:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:30] <inflatador>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[19:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34537 and previous config saved to /var/cache/conftool/dbconfig/20220912-191000-ladsgroup.json
[19:10:03] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:10:22] <wikibugs>	 (03PS3) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466)
[19:12:09] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.18.0" for 561 hosts
[19:12:27] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.18.0" completed for 561 hosts
[19:13:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P34538 and previous config saved to /var/cache/conftool/dbconfig/20220912-191330-ladsgroup.json
[19:14:39] <wikibugs>	 (03PS4) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500)
[19:15:33] <sbassett>	 jouncebot: now
[19:15:33] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[19:17:27] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:19:40] <wikibugs>	 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10gmodena) Hi - what is the status of the linked CR?  >>! In T303543#7768019, @gerritbot wrote: > Chang...
[19:20:03] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.19.0" for 561 hosts
[19:20:20] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.19.0" completed for 561 hosts
[19:24:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[19:25:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P34539 and previous config saved to /var/cache/conftool/dbconfig/20220912-192506-ladsgroup.json
[19:26:17] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116
[19:28:22] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 02m 04s)
[19:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34540 and previous config saved to /var/cache/conftool/dbconfig/20220912-192837-ladsgroup.json
[19:28:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[19:28:40] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:28:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[19:28:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34541 and previous config saved to /var/cache/conftool/dbconfig/20220912-192858-ladsgroup.json
[19:31:10] <sbassett>	 Hey all - mstyles and I would like to try to deploy a couple of security patches right now, if there are no objections.
[19:39:55] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:40:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P34542 and previous config saved to /var/cache/conftool/dbconfig/20220912-194013-ladsgroup.json
[19:48:45] <wikibugs>	 (03PS7) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[19:48:57] <wikibugs>	 (03PS2) 10Jdlrobson: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493)
[19:50:46] <wikibugs>	 (03CR) 10BCornwall: "I've updated the patch set to include a little more formatting and an explicit change to bash since we're using bashisms now in the script" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[19:53:24] <sbassett>	 !log Deployed security patch for T311337
[19:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34543 and previous config saved to /var/cache/conftool/dbconfig/20220912-195519-ladsgroup.json
[19:55:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[19:55:23] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[19:55:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34544 and previous config saved to /var/cache/conftool/dbconfig/20220912-195540-ladsgroup.json
[19:56:22] <wikibugs>	 (03PS8) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367
[19:58:22] <logmsgbot>	 !log mstyles@deploy1002 Synchronized php-1.39.0-wmf.28/extensions/PageTriage/includes/Api/ApiPageTriageAction.php: (no justification provided) (duration: 03m 42s)
[19:59:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts theemin.codfw.wmnet
[19:59:39] <maryum>	 !log deployed security patch for T314245
[19:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T2000).
[20:00:05] <jouncebot>	 ebernhardson, zabe, Aishik, danisztls, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <zabe>	 o/
[20:00:16] <danisztls>	 o/
[20:00:59] <maryum>	 o/
[20:01:17] <TheresNoTime>	 Evening all, I can deploy :)
[20:01:34] <sbassett>	 (end security patch deployments - both of which seem to have gone out ok!)
[20:02:12] <ebernhardson>	 \o
[20:02:20] <TheresNoTime>	 ah good hi ebernhardson, you're up first :)
[20:03:09] <TheresNoTime>	 Going to start with 831548
[20:03:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson)
[20:04:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:05:55] <wikibugs>	 (03PS2) 10Samtar: Mark spcomwiki and searchcomwiki as closed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe)
[20:06:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:06:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts theemin.codfw.wmnet
[20:06:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: `theemin.codfw.wmnet` - theemin.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanage...
[20:07:19] <TheresNoTime>	 ebernhardson: zabe: I'm going to get 831167 deployed while ^ merges
[20:07:22] <logmsgbot>	 !log samtar@deploy1002 backport aborted:  (duration: 03m 46s)
[20:07:23] <Jdlrobson>	 (i'm lurking )
[20:07:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe)
[20:08:38] <wikibugs>	 (03Merged) 10jenkins-bot: Mark spcomwiki and searchcomwiki as closed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe)
[20:08:54] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]]
[20:08:57] <stashbot>	 T285685: Mark searchcom and spcom wikis as closed on Special:SiteMatrix - https://phabricator.wikimedia.org/T285685
[20:09:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul)
[20:09:16] <logmsgbot>	 !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:09:44] <TheresNoTime>	 zabe: can you test on mwdebug1001?
[20:09:57] <zabe>	 lemme see
[20:10:24] <zabe>	 TheresNoTime, lgtm, listed as closed now
[20:10:32] <TheresNoTime>	 syncing :)
[20:10:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10BCornwall) 05Open→03Resolved It looks like this ticket has been resolved. I'm going to close it but please do re-open if there is any unfinished business.  Thank you!
[20:11:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:11:52] <wikibugs>	 (03PS4) 10Samtar: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza)
[20:12:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10BCornwall) 05Stalled→03Resolved I'm going to mark this as resolved since no verification has occurred. If there's any unfin...
[20:12:55] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host dispatch-be1001.eqiad.wmnet
[20:12:56] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.netbox
[20:12:57] <TheresNoTime>	 Hi Aishik :) you're up next if you're available?
[20:13:23] <wikibugs>	 (03PS5) 10Samtar: Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman)
[20:13:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:13:36] <Aishik>	 I am here!
[20:13:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:14:34] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]] (duration: 05m 40s)
[20:14:37] <stashbot>	 T285685: Mark searchcom and spcom wikis as closed on Special:SiteMatrix - https://phabricator.wikimedia.org/T285685
[20:14:57] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:14:57] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache dispatch-be1001.eqiad.wmnet on all recursors
[20:15:01] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dispatch-be1001.eqiad.wmnet on all recursors
[20:15:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman)
[20:16:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34545 and previous config saved to /var/cache/conftool/dbconfig/20220912-201604-ladsgroup.json
[20:16:07] <wikibugs>	 (03Merged) 10jenkins-bot: Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman)
[20:16:07] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[20:16:21] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]]
[20:16:24] <stashbot>	 T317424: Create six more namespaces on the Bengali Wiktionary - https://phabricator.wikimedia.org/T317424
[20:16:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:16:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:16:34] <TheresNoTime>	 Aishik: Can you test this on mwdebug1001?
[20:16:41] <logmsgbot>	 !log samtar@deploy1002 samtar and aishik: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:17:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul)  The only thing left on this task is to unrack the server and remove all the disks.
[20:17:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:18:29] <zabe>	 Aishik, do you know what mwdebug1001 means?
[20:18:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10BCornwall) @dcausse Are these action items filed into appropriate places such that this ticket, which seems "finished", can be closed?
[20:19:44] * TheresNoTime should have asked, apologies :) https://wikitech.wikimedia.org/wiki/WikimediaDebug 
[20:20:28] <Aishik>	 Yeap! Its working
[20:20:36] <TheresNoTime>	 Great! Will sync :)
[20:21:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:21:53] <Aishik>	 Thanks! Do I need to anything else?
[20:22:15] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable track_total_hits for elastic7 [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson)
[20:22:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:22:52] <TheresNoTime>	 Aishik: Test again on production proper in about ~4 minutes, I'll ping you :)
[20:23:04] <TheresNoTime>	 (er, more like 2 minutes)
[20:23:18] <TheresNoTime>	 ebernhardson: will loop back to 831548 next, are you available to test? I will note that's a lot of files to be backported
[20:23:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:23:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:23:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[20:23:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[20:24:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34546 and previous config saved to /var/cache/conftool/dbconfig/20220912-202359-ladsgroup.json
[20:24:03] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[20:24:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall)
[20:24:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:24:36] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]] (duration: 08m 14s)
[20:24:39] <stashbot>	 T317424: Create six more namespaces on the Bengali Wiktionary - https://phabricator.wikimedia.org/T317424
[20:24:59] <TheresNoTime>	 Aishik: sync'd fully :) just test if you don't mind, this time not using mwdebug
[20:26:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) 05Open→03Resolved a:03dcausse @BCornwall yes, this ticket can be closed, remaining work is tracked here: - complete the cleanup:...
[20:26:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson)
[20:26:33] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]]
[20:26:37] <stashbot>	 T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade - https://phabricator.wikimedia.org/T317374
[20:26:55] <logmsgbot>	 !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:27:01] <wikibugs>	 10SRE, 10Observability-Metrics, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10lmata)
[20:27:17] <wikibugs>	 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10lmata)
[20:27:46] <TheresNoTime>	 ebernhardson: please test on mwdebug1001
[20:28:22] <ebernhardson>	 TheresNoTime: works as expected
[20:28:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Dzahn) > SSH: configured to access all our servers, including an-launcher1002  We can't  be sure what the definition of "all our servers" is.  In gener...
[20:28:31] <TheresNoTime>	 Syncing
[20:28:43] <Aishik>	 It's totally ok! (this 🙂 emoji is my favourite too)
[20:28:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10cmooney) @Jgreen my bad yeah they were both still part of the disabled group.  Both up/up now, hopefully looks better your side too. ` cmooney@fasw-c-eq...
[20:28:57] <Aishik>	 P
[20:29:15] <TheresNoTime>	 ^^
[20:29:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:29:40] <wikibugs>	 (03PS2) 10Cwhite: rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500)
[20:30:41] <wikibugs>	 10SRE, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q1): librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10lmata)
[20:30:57] <wikibugs>	 (03PS5) 10Samtar: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza)
[20:31:03] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P34547 and previous config saved to /var/cache/conftool/dbconfig/20220912-203110-ladsgroup.json
[20:31:26] <TheresNoTime>	 danisztls: going to do 830917 next, are you available to test?
[20:32:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:32:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:32:19] <danisztls>	 TheresNoTime: yes
[20:32:45] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]] (duration: 06m 12s)
[20:32:48] <stashbot>	 T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade - https://phabricator.wikimedia.org/T317374
[20:32:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza)
[20:33:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:33:45] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza)
[20:34:00] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]]
[20:34:03] <stashbot>	 T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466
[20:34:19] <logmsgbot>	 !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[20:34:20] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "Deploy, set this merging as it takes a while.." [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson)
[20:34:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10lmata)
[20:34:42] <TheresNoTime>	 danisztls: Live on mwdebug1001, please test :)
[20:36:06] <danisztls>	 TheresNoTime: looks good
[20:36:30] <TheresNoTime>	 danisztls: okay, syncing
[20:37:26] <danisztls>	 TheresNoTime: thanks
[20:37:55] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:37:59] <wikibugs>	 (03PS3) 10Samtar: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[20:38:03] <Jdlrobson>	 o/
[20:38:07] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dispatch-be1001.eqiad.wmnet
[20:38:15] <TheresNoTime>	 Jdlrobson: will be doing 831117 next
[20:38:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:39:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall)
[20:39:35] <jhathaway>	 !log testing exim config change on mx1001.wikimedia.org
[20:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:58] <wikibugs>	 (03CR) 10Dzahn: "I am wondering how many SCAP env variables there are. If it's just a few it seems nicer to list them explicitly and use "env_keep"." [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[20:40:26] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]] (duration: 06m 25s)
[20:40:28] <stashbot>	 T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466
[20:40:46] <TheresNoTime>	 danisztls: sync'd, could you give it another test to be sure? :)
[20:40:47] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[20:41:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[20:41:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:41:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:42:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[20:42:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[20:42:10] <wikibugs>	 (03CR) 10Dzahn: "Is it only "$SCAP_FINAL_PATH and $SCAP_REV_PATH" in scap3?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[20:42:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:42:19] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]]
[20:42:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Thanks for the clarification, @Dzahn! Unless there's dissent, I'll just add them to the analytics-admins group as was suggested.  @Milimetri...
[20:42:22] <stashbot>	 T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493
[20:42:38] <logmsgbot>	 !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:42:58] <TheresNoTime>	 Jdlrobson: Live on mwdebug1001, could you test please? :)
[20:43:04] <Jdlrobson>	 looking
[20:43:20] <danisztls>	 TheresNoTime: yes, not working now
[20:43:41] <TheresNoTime>	 danisztls: your patch is not working in production?
[20:44:08] <danisztls>	 TheresNoTime: only on debug
[20:44:15] <danisztls>	 not working on production
[20:45:05] <TheresNoTime>	 hm, okay, one moment
[20:45:05] <Jdlrobson>	 TheresNoTime: please sync!
[20:45:31] <TheresNoTime>	 danisztls: Going to sync Jdlrobson's patch and then come back to look at that..
[20:45:57] <danisztls>	 TheresNoTime: pc issue, working on another device, sorry
[20:46:05] <TheresNoTime>	 phew!
[20:46:15] <TheresNoTime>	 best kind of bug! :D
[20:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P34548 and previous config saved to /var/cache/conftool/dbconfig/20220912-204617-ladsgroup.json
[20:47:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:48:03] <TheresNoTime>	 ebernhardson: once this patch is merged, I'll move onto 831549 - it's almost merged :)
[20:48:09] <ebernhardson>	 kk
[20:48:14] <TheresNoTime>	 s/merged/sync'd
[20:48:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:48:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:49:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:49:46] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]] (duration: 07m 27s)
[20:49:50] <stashbot>	 T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493
[20:49:55] <TheresNoTime>	 Jdlrobson: Sync'd ^ :)
[20:50:47] <wikibugs>	 (03PS1) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545)
[20:50:54] <Jdlrobson>	 Thanks TheresNoTime 
[20:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: Set track_total_hits to true [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson)
[20:50:59] <Jdlrobson>	 ill keep an eye on the logs
[20:51:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson)
[20:51:27] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:831549|Set track_total_hits to true]]
[20:51:42] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q2): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata)
[20:51:46] <logmsgbot>	 !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:831549|Set track_total_hits to true]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:51:52] <ebernhardson>	 TheresNoTime: this one isn't properly testable, none of the changes here are run in reponse to an http request. Should be fine to sync out
[20:52:09] <TheresNoTime>	 ebernhardson: ack, syncing :)
[20:53:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall)
[20:54:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:55:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:55:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:56:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:56:27] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831549|Set track_total_hits to true]] (duration: 05m 00s)
[20:56:51] <TheresNoTime>	 everything sync'd
[20:57:26] <TheresNoTime>	 !log closing UTC late backport window
[20:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:46] <ebernhardson>	 TheresNoTime: thanks!
[20:57:53] <TheresNoTime>	 No worries!
[20:58:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T2100).
[21:01:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34549 and previous config saved to /var/cache/conftool/dbconfig/20220912-210123-ladsgroup.json
[21:01:24] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:01:27] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[21:04:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) >>! In T317539#8230385, @cmooney wrote: > @Jgreen my bad yeah they were both still part of the disabled group. >  > Both up/up now, hopefully lo...
[21:04:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) 05Open→03Resolved a:03Jgreen
[21:07:22] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[21:07:30] <wikibugs>	 (03CR) 10Dzahn: "from a glance at hieradata this groups includes a LOT of things and the access request was for "all the things". that's all I know." [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall)
[21:07:42] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 19s)
[21:12:32] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815)
[21:18:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) From the CR which is currently not approved:  > from a glance at hieradata this groups includes a LOT of things and the access request was for "...
[21:20:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway)
[21:21:18] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[21:23:10] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[21:23:14] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:24:12] <wikibugs>	 (03PS1) 10JHathaway: mail::mx: Modify the Received header [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574)
[21:24:50] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:25:20] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[21:25:37] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway)
[21:32:12] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:32:15] <wikibugs>	 (03CR) 10Dduvall: phabricator: Allow deploy user to preserve environment when sudoing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[21:35:23] <wikibugs>	 (03PS1) 10Cwhite: logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099)
[21:36:51] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#8228079, @elukey wrote: > @colewhite does it sound good?  SGTM!  Thanks!
[21:51:07] <wikibugs>	 (03CR) 10Dzahn: phabricator: Allow deploy user to preserve environment when sudoing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[21:51:42] <wikibugs>	 (03CR) 10Dzahn: phabricator: Allow deploy user to preserve environment when sudoing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[21:54:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34550 and previous config saved to /var/cache/conftool/dbconfig/20220912-215407-ladsgroup.json
[21:54:11] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[21:56:41] <wikibugs>	 (03PS1) 10Dzahn: disable git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022)
[21:57:01] <wikibugs>	 (03PS2) 10Dzahn: disable git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022)
[21:58:08] <wikibugs>	 (03CR) 10Dzahn: "though.. if we do this we will get a lot of monitoring alerts... hrmmm. First removing it as a service from LVS/pybal is not as easy and c" [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[22:02:21] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "🎉" [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[22:03:55] <wikibugs>	 10SRE, 10serviceops: mediawiki::api: net.ipv4.local_port_range sysctl config does not exist - https://phabricator.wikimedia.org/T317454 (10Dzahn) thanks @paladox  confirmed. it's `ip_local_port_range` under `/ipv4/`.  https://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec70.html
[22:06:58] <wikibugs>	 (03PS1) 10Dzahn: mediawiki::api: fix kernel parameter name ip_local_port_range [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454)
[22:07:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:08:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:08:40] <wikibugs>	 (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/831629" [puppet] - 10https://gerrit.wikimedia.org/r/401714 (https://phabricator.wikimedia.org/T182568) (owner: 10Giuseppe Lavagetto)
[22:09:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P34551 and previous config saved to /var/cache/conftool/dbconfig/20220912-220914-ladsgroup.json
[22:11:27] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "Looks good to me!" [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen)
[22:12:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:13:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:13:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:14:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:20:18] <mutante>	 !log phabricator - disabling repository "tool-ranker" 
[22:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:11] <mutante>	 !log phabricator - disabling repositories: tool-xh-bot, tool-editor-contribution-dashboard, tool-ranker, tool-editor-contribution, tool-mikasa-bot-1, tool-maintun, tool-add-text, tool-wikibookassamese-book.php (none of them had commits) T296022 - T315706
[22:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:16] <stashbot>	 T315706: Migrate existing Striker created Diffusion repos to GitLab - https://phabricator.wikimedia.org/T315706
[22:23:17] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[22:24:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P34552 and previous config saved to /var/cache/conftool/dbconfig/20220912-222420-ladsgroup.json
[22:27:07] <wikibugs>	 (03PS3) 10Dduvall: scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259)
[22:27:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[22:29:18] <wikibugs>	 (03PS4) 10Dduvall: scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259)
[22:30:59] <wikibugs>	 (03CR) 10Dduvall: "Note this is now a change to `scap::target` and will effect all cases where `scap::target` is used with the `sudo_rules` parameter. Howeve" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[22:39:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34553 and previous config saved to /var/cache/conftool/dbconfig/20220912-223927-ladsgroup.json
[22:39:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[22:39:31] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[22:39:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[22:39:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:40:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:40:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34554 and previous config saved to /var/cache/conftool/dbconfig/20220912-224006-ladsgroup.json
[22:43:44] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:46] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:53:29] <mutante>	 !log phabricator - disabling MediaWiki extension repositories in Diffusion that have 0 commits - T296022 - T315706
[22:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:34] <stashbot>	 T315706: Migrate existing Striker created Diffusion repos to GitLab - https://phabricator.wikimedia.org/T315706
[22:53:34] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[23:05:49] <wikibugs>	 (03PS1) 10Dzahn: phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259)
[23:06:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:08:54] <wikibugs>	 (03PS2) 10Dzahn: phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259)
[23:12:02] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37240/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:13:42] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] "Seems like a good initial approach! Thanks for doing the legwork, Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:14:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:16:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "nope, that would have been not complex enough yet:/" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:18:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I am glad we did not do this in scap::target :) puppet is broken. disabled on phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:18:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "but I totally CAN manually run that command that failed in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:19:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "..because that file with the new rules does not exist anymore now." [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:24:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ">>> /etc/sudoers.d/scap_sudo_rules_phab-deploy_phabricator_deployment: syntax error near line 3 <<<" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn)
[23:30:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi)
[23:31:00] <wikibugs>	 (03CR) 10Dzahn: "an issue here is that sudo::user always starts a line with the user name, so this ends up becoming:" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:31:53] <wikibugs>	 (03CR) 10Dzahn: "I'll try to come up with a fix for that tomorrow. Maybe we can just turn the entire sudo file into a template for this case." [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:32:56] <wikibugs>	 (03CR) 10Dzahn: "..or we can add a new class lets us add generic sudo config lines that don't need to start with the user name" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:33:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34555 and previous config saved to /var/cache/conftool/dbconfig/20220912-233327-ladsgroup.json
[23:33:31] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:33:37] <wikibugs>	 (03CR) 10Dzahn: "tested at https://gerrit.wikimedia.org/r/c/operations/puppet/+/831634  and WIP" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:34:03] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: Allow deploy user to keep scap3 environment variables with sudo" [puppet] - 10https://gerrit.wikimedia.org/r/831554
[23:36:01] <wikibugs>	 (03CR) 10Dzahn: "wait, "phab-deploy env_keep+=SCAP_* ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_config_deploy" could also do it I guess" [puppet] - 10https://gerrit.wikimedia.org/r/831554 (owner: 10Dzahn)
[23:48:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P34556 and previous config saved to /var/cache/conftool/dbconfig/20220912-234833-ladsgroup.json
[23:50:49] <wikibugs>	 (03CR) 10Dduvall: "Darn! How about we add an additional parameter to sudo::user for defaults?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:51:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:53:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:54:39] <wikibugs>	 (03CR) 10Dzahn: "Yea, either that or maybe we use the restricted_env_file or env_file. We could define all the SCAP env variables there and give them value" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)
[23:57:29] <wikibugs>	 (03CR) 10Dzahn: "fwiw, toolforge just does it like this, with a plain file dropped into sudoers.d:" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)