[00:02:56] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P34419 and previous config saved to /var/cache/conftool/dbconfig/20220912-000356-ladsgroup.json [00:04:00] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:10:08] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:13:54] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:16:18] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:19:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P34420 and previous config saved to /var/cache/conftool/dbconfig/20220912-001902-ladsgroup.json [00:25:06] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:31:46] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:18] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:34:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P34421 and previous config saved to /var/cache/conftool/dbconfig/20220912-003409-ladsgroup.json [00:34:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:38:50] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P34422 and previous config saved to /var/cache/conftool/dbconfig/20220912-004915-ladsgroup.json [00:49:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [00:49:20] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:49:24] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [00:49:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [00:49:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [00:49:50] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34423 and previous config saved to /var/cache/conftool/dbconfig/20220912-004952-ladsgroup.json [00:51:28] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:56:38] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:04] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:58:40] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:02:54] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:08] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:54] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:18:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:21:10] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34424 and previous config saved to /var/cache/conftool/dbconfig/20220912-012118-ladsgroup.json [01:21:22] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:23:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:25:12] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:31:50] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P34425 and previous config saved to /var/cache/conftool/dbconfig/20220912-013625-ladsgroup.json [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:12] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:39:02] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:34] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:58] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:51:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P34426 and previous config saved to /var/cache/conftool/dbconfig/20220912-015131-ladsgroup.json [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:24] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:53:16] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:56:50] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34427 and previous config saved to /var/cache/conftool/dbconfig/20220912-020638-ladsgroup.json [02:06:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:04] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:32:02] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:20] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:39:14] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:45:24] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:49:46] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:58] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:53:22] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:57:02] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:14] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:04] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:12:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:18] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:32:16] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:30] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:04] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:58] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:18] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:30] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:42] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:43] (03PS1) 10Stang: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 [04:24:55] (03PS1) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 [04:25:11] (03PS2) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 [04:26:20] (03PS2) 10Stang: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 [04:26:31] (03PS3) 10Stang: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 [04:32:30] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P34428 and previous config saved to /var/cache/conftool/dbconfig/20220912-043624-ladsgroup.json [04:36:29] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:39:42] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:16] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P34429 and previous config saved to /var/cache/conftool/dbconfig/20220912-045130-ladsgroup.json [04:53:04] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:55:30] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:57:32] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P34430 and previous config saved to /var/cache/conftool/dbconfig/20220912-050636-ladsgroup.json [05:14:48] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:16:32] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2020 for upgrade T317507', diff saved to https://phabricator.wikimedia.org/P34431 and previous config saved to /var/cache/conftool/dbconfig/20220912-051906-root.json [05:19:10] T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507 [05:19:16] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:38] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:19:44] (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831369 [05:20:51] (03CR) 10Marostegui: [C: 03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831369 (owner: 10Marostegui) [05:21:21] !log dbmaint Reboot es2020 for kernel upgrade T317507 [05:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P34432 and previous config saved to /var/cache/conftool/dbconfig/20220912-052143-ladsgroup.json [05:21:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:21:46] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:21:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:22:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507 [05:23:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507 [05:23:44] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:26:28] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:48] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34433 and previous config saved to /var/cache/conftool/dbconfig/20220912-053504-root.json [05:35:48] (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831213 [05:36:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:28] (03CR) 10Marostegui: [C: 03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831213 (owner: 10Marostegui) [05:40:02] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:36] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:46:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:50] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:50:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34434 and previous config saved to /var/cache/conftool/dbconfig/20220912-055008-root.json [05:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2024 for upgrade', diff saved to https://phabricator.wikimedia.org/P34435 and previous config saved to /var/cache/conftool/dbconfig/20220912-055101-root.json [05:58:12] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:03:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34436 and previous config saved to /var/cache/conftool/dbconfig/20220912-060305-root.json [06:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34437 and previous config saved to /var/cache/conftool/dbconfig/20220912-060513-root.json [06:05:33] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34438 and previous config saved to /var/cache/conftool/dbconfig/20220912-061810-root.json [06:19:07] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34439 and previous config saved to /var/cache/conftool/dbconfig/20220912-062018-root.json [06:25:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:29:09] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:31:48] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:32:56] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34440 and previous config saved to /var/cache/conftool/dbconfig/20220912-063314-root.json [06:35:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34441 and previous config saved to /var/cache/conftool/dbconfig/20220912-063523-root.json [06:36:24] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:36] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:37:42] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) [06:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:43:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34442 and previous config saved to /var/cache/conftool/dbconfig/20220912-064350-ladsgroup.json [06:43:54] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:47:37] !log installing 5.10.136 updates on buster systems running 5.10 [06:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34443 and previous config saved to /var/cache/conftool/dbconfig/20220912-064819-root.json [06:50:24] (03CR) 10Slyngshede: [C: 03+2] C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede) [06:50:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34444 and previous config saved to /var/cache/conftool/dbconfig/20220912-065028-root.json [06:53:48] (03CR) 10Elukey: [C: 03+1] Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans) [06:55:44] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P34445 and previous config saved to /var/cache/conftool/dbconfig/20220912-065856-ladsgroup.json [07:01:25] jouncebot: nowandnext [07:01:26] No deployments scheduled for the forseeable future! [07:01:26] No deployments scheduled for the forseeable future! [07:01:43] aaah, the calendar is not added, then I just deploy something [07:02:10] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [07:02:56] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [07:03:00] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:03:12] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34446 and previous config saved to /var/cache/conftool/dbconfig/20220912-070324-root.json [07:03:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831374 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [07:04:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]] [07:04:10] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [07:04:32] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:06:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:07:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:10:22] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:831374|Stop writing to the old templatelinks fields everywhere (T312865)]] (duration: 06m 57s) [07:11:09] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [07:14:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P34447 and previous config saved to /var/cache/conftool/dbconfig/20220912-071403-ladsgroup.json [07:16:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:16:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34448 and previous config saved to /var/cache/conftool/dbconfig/20220912-071700-ladsgroup.json [07:17:03] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:18:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34449 and previous config saved to /var/cache/conftool/dbconfig/20220912-071829-root.json [07:23:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:26:55] 10SRE-OnFire, 10Observability-Alerting: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) Thank you for taking the time to look into this @cdanis! Overall LGTM on the fixes you are suggesting [07:27:04] 10SRE, 10Machine-Learning-Team, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) Adding #Machine-Learning-Team per my last question [07:27:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:27:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:29:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T312863)', diff saved to https://phabricator.wikimedia.org/P34450 and previous config saved to /var/cache/conftool/dbconfig/20220912-072909-ladsgroup.json [07:29:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [07:29:13] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:29:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [07:29:25] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-General, 10Thumbor: File:Keep_tidy_ask.svg 404 on Commons - https://phabricator.wikimedia.org/T314712 (10Aklapper) `Original file` link works; is there more to do in this ticket or can this be `resolved`? [07:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34452 and previous config saved to /var/cache/conftool/dbconfig/20220912-072931-ladsgroup.json [07:31:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:31:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:33:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507 [07:33:21] T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507 [07:33:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317507 [07:34:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2020 with weight 0 T317507', diff saved to https://phabricator.wikimedia.org/P34453 and previous config saved to /var/cache/conftool/dbconfig/20220912-073408-root.json [07:37:28] (03PS1) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831464 (https://phabricator.wikimedia.org/T317507) [07:38:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831464 (https://phabricator.wikimedia.org/T317507) (owner: 10Marostegui) [07:39:21] !log Starting es4 codfw failover from es2021 to es2020 - T317507 [07:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:25] T317507: Switchover es4 codfw master (es2021 -> es2020) - https://phabricator.wikimedia.org/T317507 [07:39:28] (03PS1) 10Jforrester: Restore compatibility with overrides for IndexPager::makeLink() [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831215 (https://phabricator.wikimedia.org/T317477) [07:41:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2020 to es4 primary and set section read-write T317507', diff saved to https://phabricator.wikimedia.org/P34454 and previous config saved to /var/cache/conftool/dbconfig/20220912-074100-root.json [07:42:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2021 T317507', diff saved to https://phabricator.wikimedia.org/P34455 and previous config saved to /var/cache/conftool/dbconfig/20220912-074258-root.json [07:43:12] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [07:43:45] (03PS3) 10Aklapper: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [07:43:46] Good "morning", I am upgrading the Jenkins instances this morning [07:43:53] (03PS4) 10Aklapper: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [07:45:16] (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831478 [07:45:59] (03CR) 10Marostegui: [C: 03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831478 (owner: 10Marostegui) [07:47:16] !log Upgraded Jenkins instances from 2.346.1 to 2.346.3 # T317418 [07:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:19] T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418 [07:47:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see also inline" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [07:48:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! I _think_ you can also nuke the resources (without the ensure => absent dance) in this case" [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [07:49:01] (03PS2) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) [07:49:33] (03CR) 10David Caro: [V: 03+1 C: 03+2] opensatck: remove some not needed absented resources [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro) [07:49:44] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:44] (03PS1) 10Cathal Mooney: Depool esams for cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690) [07:51:51] (03PS3) 10Volans: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 [07:53:15] (03PS3) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) [07:53:17] (03PS7) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [07:53:59] (03CR) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [07:54:29] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [07:55:13] (03CR) 10Cathal Mooney: [C: 03+2] Depool esams for cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831479 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [07:55:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:56:16] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Ok cool well we can close this in that case I think. Cheers. [07:56:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317508 [07:56:43] T317508: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T317508 [07:56:56] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317508 [07:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2024 with weight 0 T317508', diff saved to https://phabricator.wikimedia.org/P34456 and previous config saved to /var/cache/conftool/dbconfig/20220912-075739-root.json [07:57:45] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging2001.codfw.wmnet with reason: Kafka PKI upgrade [07:58:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging2001.codfw.wmnet with reason: Kafka PKI upgrade [07:58:44] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:00:38] !log de-pooliong esams in advance of upgrade to core router cr3-esams T295690 [08:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:41] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:00:56] !log Restarting CI Jenkins for upgrade T317418 [08:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:59] T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418 [08:01:25] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams with reason: router upgrade [08:01:31] !log restart kafka on kafka2001 to pick up new PKI settings [08:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:33] (03PS1) 10Marostegui: mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831480 (https://phabricator.wikimedia.org/T317508) [08:01:39] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams with reason: router upgrade [08:01:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1e573369-5fdd-4621-8ae7-786b5a67de04) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and th... [08:02:32] (03PS1) 10Jelto: gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) [08:02:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831480 (https://phabricator.wikimedia.org/T317508) (owner: 10Marostegui) [08:02:55] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:01] !log Starting es5 codfw failover from es2023 to es2024 - T317508 [08:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:05] T317508: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T317508 [08:04:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2024 to es5 codfw primary T317508', diff saved to https://phabricator.wikimedia.org/P34457 and previous config saved to /var/cache/conftool/dbconfig/20220912-080400-root.json [08:05:56] I might have broken the CI Jenkins :-( [08:06:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2023 T317508', diff saved to https://phabricator.wikimedia.org/P34458 and previous config saved to /var/cache/conftool/dbconfig/20220912-080602-root.json [08:06:07] PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:06:11] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:47] PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34459 and previous config saved to /var/cache/conftool/dbconfig/20220912-080719-root.json [08:07:23] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:07:24] (03CR) 10Volans: [C: 03+2] Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans) [08:07:39] (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831216 [08:07:59] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:08:33] (03CR) 10Marostegui: [C: 03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831216 (owner: 10Marostegui) [08:08:59] !log cmooney@cumin1001 START - Cookbook sre.network.cf [08:09:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:13:20] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) kafka-logging2001 migrated to PKI, all good from what I can see in metrics! Next steps: - wait a couple of days wi... [08:13:24] (03Merged) 10jenkins-bot: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans) [08:15:42] (03CR) 10Muehlenhoff: "Another pass of comments, this is going into the right direction." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [08:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34460 and previous config saved to /var/cache/conftool/dbconfig/20220912-081754-root.json [08:17:56] !log imported jenkins 2.361.1 to thirdparty/ci T317418 [08:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:00] T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418 [08:19:27] (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831482 [08:19:33] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032', diff saved to https://phabricator.wikimedia.org/P34461 and previous config saved to /var/cache/conftool/dbconfig/20220912-081936-root.json [08:20:15] (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831482 (owner: 10Marostegui) [08:21:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34462 and previous config saved to /var/cache/conftool/dbconfig/20220912-082224-root.json [08:22:26] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37221/console" [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [08:23:01] (03PS1) 10Volans: doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 [08:25:21] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:19] (03CR) 10Muehlenhoff: smart: restore get_fact and deprecate get_raid_drivers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [08:28:20] (03PS1) 10Volans: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 [08:29:28] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [08:32:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34463 and previous config saved to /var/cache/conftool/dbconfig/20220912-083258-root.json [08:33:00] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:33:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34464 and previous config saved to /var/cache/conftool/dbconfig/20220912-083308-root.json [08:36:01] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,cr3-esams.mgmt with reason: router upgrade [08:36:02] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr3-esams,cr3-esams IPv6,cr3-esams.mgmt with reason: router upgrade [08:36:18] RECOVERY - DPKG on contint2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:36:20] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34465 and previous config saved to /var/cache/conftool/dbconfig/20220912-083729-root.json [08:38:01] (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831218 [08:38:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet [08:39:21] (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831218 (owner: 10Marostegui) [08:39:34] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [08:39:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [08:39:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:39:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=57f0ae1d-0fa1-4b98-9454-bea638ac3971) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th... [08:42:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:45:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet [08:45:57] 10SRE, 10Infrastructure-Foundations, 10LDAP: Add slapd audit logs to backup - https://phabricator.wikimedia.org/T317516 (10MoritzMuehlenhoff) [08:47:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [08:47:46] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:48:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34466 and previous config saved to /var/cache/conftool/dbconfig/20220912-084803-root.json [08:48:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34467 and previous config saved to /var/cache/conftool/dbconfig/20220912-084812-root.json [08:52:17] (03CR) 10Btullis: [C: 03+2] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [08:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34468 and previous config saved to /var/cache/conftool/dbconfig/20220912-085234-root.json [08:53:51] (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans) [08:54:29] (03CR) 10Jbond: "lgtm, see nits" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [08:55:17] (03PS1) 10Muehlenhoff: openldap: Include slapd-audit.log to backup [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) [08:56:24] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:56:43] (03CR) 10Muehlenhoff: smart: restore get_fact and deprecate get_raid_drivers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [08:56:50] jouncebot: next [08:56:50] No deployments scheduled for the forseeable future! [08:56:54] (03CR) 10Vgutierrez: [C: 04-1] "see inline comments, looking good overall" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [08:57:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [09:00:33] (03PS3) 10JMeybohm: kubernetes: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) [09:00:35] (03PS3) 10JMeybohm: prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) [09:02:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:02:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 88, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:02:44] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34469 and previous config saved to /var/cache/conftool/dbconfig/20220912-090308-root.json [09:03:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34470 and previous config saved to /var/cache/conftool/dbconfig/20220912-090317-root.json [09:05:22] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [09:05:43] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [09:06:24] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:02] RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34471 and previous config saved to /var/cache/conftool/dbconfig/20220912-090739-root.json [09:08:02] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [09:08:14] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:09:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [09:11:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:15:02] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:18:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [09:18:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34472 and previous config saved to /var/cache/conftool/dbconfig/20220912-091813-root.json [09:18:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34473 and previous config saved to /var/cache/conftool/dbconfig/20220912-091822-root.json [09:18:38] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:19:46] (03PS1) 10JMeybohm: prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430) [09:22:06] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34474 and previous config saved to /var/cache/conftool/dbconfig/20220912-092244-root.json [09:27:24] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:27:42] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:31:35] !log updated buster install image for 10.13 release T317413 [09:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:39] T317413: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 [09:32:09] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [09:32:58] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:33:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34475 and previous config saved to /var/cache/conftool/dbconfig/20220912-093318-root.json [09:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34476 and previous config saved to /var/cache/conftool/dbconfig/20220912-093327-root.json [09:35:19] * Emperor really isn't still on clinic duty :p [09:35:37] who is? [09:35:47] (I only changed what I knew) [09:36:07] jynus: looks like brett [09:36:16] per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Schedule [09:36:28] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:30] * Emperor had just got there but marostegui is quicker :) [09:36:50] also I didn't change that because I thought CD change happened at SRE meeting time [09:37:24] * Emperor has usually done it from the start of their Monday working day [09:37:48] (so I'd expect brett to pick it up later, but I'm not expecting to do any clinic stuff today, IYSWIM) [09:38:29] yeah, Monday morning for whatever time is morning for the person is the current standard practice [09:38:46] especially given that we don't have weekly SRE meetings for some time now :-) [09:40:24] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:18] (03CR) 10Jcrespo: [C: 03+1] "Please check it worked tomorrow, or ping me for me to check." [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff) [09:41:56] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:45:54] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [09:45:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [09:46:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=39465e0b-b93d-45ba-b1d8-0c49dacc39fb) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th... [09:46:30] (03PS1) 10Jbond: prepare: rename hiera files indicateing new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 [09:47:30] (03CR) 10CI reject: [V: 04-1] prepare: rename hiera files indicateing new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 (owner: 10Jbond) [09:48:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1033', diff saved to https://phabricator.wikimedia.org/P34477 and previous config saved to /var/cache/conftool/dbconfig/20220912-094818-root.json [09:48:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34478 and previous config saved to /var/cache/conftool/dbconfig/20220912-094832-root.json [09:51:03] (03CR) 10JMeybohm: [C: 03+2] prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [09:54:22] (03PS2) 10Jbond: prepare: rename hiera files indicating new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 [09:55:38] !log rebalance thanos rings T311690 [09:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:41] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [09:57:12] (03CR) 10Jbond: [C: 03+2] prepare: rename hiera files indicating new wmcs realm name [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831494 (owner: 10Jbond) [09:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34479 and previous config saved to /var/cache/conftool/dbconfig/20220912-095918-root.json [09:59:54] (03PS1) 10Jbond: puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/831495 [10:02:17] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34480 and previous config saved to /var/cache/conftool/dbconfig/20220912-100337-root.json [10:03:50] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/831495 (owner: 10Jbond) [10:06:33] (03PS6) 10Jbond: puppetmaster: remove puppet-merge from wmcs instances [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [10:08:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37224/console" [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [10:08:21] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:33] (03PS3) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 [10:13:41] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831228 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [10:14:16] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37225/console" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [10:14:18] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10taavi) [10:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34481 and previous config saved to /var/cache/conftool/dbconfig/20220912-101423-root.json [10:14:51] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430) (owner: 10JMeybohm) [10:16:48] (03PS2) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:17:10] (03PS3) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:18:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34483 and previous config saved to /var/cache/conftool/dbconfig/20220912-101842-root.json [10:19:31] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1315 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:20:17] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:51] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 551 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:22:45] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10EChetty) [10:22:47] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [10:23:25] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 37s) [10:24:09] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:24:15] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 88, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:47] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) Thanks, I was able to complete the servers' powerdown through the management interface by using the asset tag FQDN. `wtp[1029-1033].eqiad.wmnet` n... [10:25:50] I don't see any maintenance related to that [10:26:28] (03CR) 10Muehlenhoff: [C: 03+2] openldap: Include slapd-audit.log to backup [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff) [10:26:32] ^ XioNoX: possibly a link eqiad-drmrs down? [10:26:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [10:26:56] (03PS4) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:26:58] (03PS1) 10Jbond: hiera: Add renamed labs hiera file so pcc works [puppet] - 10https://gerrit.wikimedia.org/r/831497 [10:27:00] (03CR) 10Muehlenhoff: [C: 03+2] openldap: Include slapd-audit.log to backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831490 (https://phabricator.wikimedia.org/T317516) (owner: 10Muehlenhoff) [10:27:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: Add renamed labs hiera file so pcc works [puppet] - 10https://gerrit.wikimedia.org/r/831497 (owner: 10Jbond) [10:27:21] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:26] jynus: maintenance on cr3-esams (cc topranks) [10:27:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:36] ah, ok, sorryu [10:27:38] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [10:28:06] XioNoX: thanks [10:28:39] jynus: sry for the noise please ignore, done in about 20 mins [10:28:51] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:28:54] no issue, you logged it, it is my fault [10:28:57] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:10] it is just it is hard for me to notice it with so many messages [10:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34484 and previous config saved to /var/cache/conftool/dbconfig/20220912-102928-root.json [10:30:11] (03CR) 10Jbond: [C: 04-1] "lgtm but see comment about hiera key" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:30:21] (03CR) 10Jbond: [C: 04-1] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:33:04] (03PS5) 10Majavah: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 [10:33:41] (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:33:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [10:34:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1034', diff saved to https://phabricator.wikimedia.org/P34485 and previous config saved to /var/cache/conftool/dbconfig/20220912-103428-root.json [10:35:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37229/console" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:36:27] (03PS6) 10Jbond: puppetmaster: explicitely specifify hiera config [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:36:29] (03PS1) 10Majavah: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 [10:38:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37230/console" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:38:32] jbond: role::puppetmaster::standalone doesn't use profile::puppetmaster::common :/ [10:39:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:40:52] (03CR) 10Hnowlan: [C: 03+2] Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan) [10:40:58] (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:41:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34486 and previous config saved to /var/cache/conftool/dbconfig/20220912-104120-root.json [10:43:10] (03PS1) 10Jbond: prepare: drop old hiera file location [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831502 [10:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:44:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34487 and previous config saved to /var/cache/conftool/dbconfig/20220912-104432-root.json [10:44:55] (03PS2) 10Jbond: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah) [10:45:04] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah) [10:47:13] (03PS1) 10Majavah: O:puppetmaster::standalone: fix hiera_config [puppet] - 10https://gerrit.wikimedia.org/r/831503 [10:47:15] (03CR) 10Hnowlan: "lgtm - we could also replace the math.floor calls with `//` if we wanted to but this is fine for now." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik) [10:47:22] (03CR) 10Hnowlan: [C: 03+1] Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik) [10:47:51] (03CR) 10CI reject: [V: 04-1] O:puppetmaster::standalone: fix hiera_config [puppet] - 10https://gerrit.wikimedia.org/r/831503 (owner: 10Majavah) [10:49:12] (03PS1) 10Jbond: O:puppetmaster::standalone: add correct hiere config default [puppet] - 10https://gerrit.wikimedia.org/r/831504 [10:50:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:50:43] (03PS1) 10Cathal Mooney: Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690) [10:50:44] taavi: can you give ^^ a check [10:50:49] https://gerrit.wikimedia.org/r/831230 [10:50:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [10:51:08] (03CR) 10Majavah: [C: 04-1] "see also: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831503/" [puppet] - 10https://gerrit.wikimedia.org/r/831504 (owner: 10Jbond) [10:51:50] (03CR) 10Majavah: puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [10:51:59] (03CR) 10Ayounsi: [C: 03+1] Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [10:53:36] (03Merged) 10jenkins-bot: Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan) [10:54:33] (03CR) 10Cathal Mooney: [C: 03+2] Repool esams after cr3-esams core router upgrade. [dns] - 10https://gerrit.wikimedia.org/r/831506 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [10:55:10] !log re-pooliong esams after successful upgrade of core router cr3-esams T295690 [10:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:13] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:55:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] "ill override the CI for now to get things working and send a follow up patch" [puppet] - 10https://gerrit.wikimedia.org/r/831503 (owner: 10Majavah) [10:56:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34488 and previous config saved to /var/cache/conftool/dbconfig/20220912-105625-root.json [10:58:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [10:58:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [10:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34489 and previous config saved to /var/cache/conftool/dbconfig/20220912-105841-ladsgroup.json [10:58:45] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:59:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [10:59:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34490 and previous config saved to /var/cache/conftool/dbconfig/20220912-105937-root.json [10:59:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [11:02:23] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:55] !log updated bullseye install image for 11.5 release T317416 [11:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:58] T317416: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 [11:06:18] (03PS1) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 [11:06:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: explicitely specifify hiera config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831230 (owner: 10Majavah) [11:08:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [11:09:02] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1142.eqiad.wmnet [11:09:31] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:19] jouncebot: next [11:10:19] In 1 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1300) [11:10:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1142.eqiad.wmnet [11:11:00] (03PS2) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 [11:11:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34491 and previous config saved to /var/cache/conftool/dbconfig/20220912-111130-root.json [11:11:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet [11:11:44] (03PS1) 10Marostegui: db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) [11:12:11] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:12:21] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1143-1148].eqiad.wmnet [11:13:05] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:13:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317522 [11:13:16] T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522 [11:13:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T317522 [11:13:48] (03Merged) 10jenkins-bot: db-production.php: Disable writes in es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831511 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:13:58] (03CR) 10Majavah: "seems to fail in pcc: https://puppet-compiler.wmflabs.org/pcc-worker1003/37233/" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [11:14:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1020 with weight 0 T317522', diff saved to https://phabricator.wikimedia.org/P34492 and previous config saved to /var/cache/conftool/dbconfig/20220912-111424-root.json [11:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34493 and previous config saved to /var/cache/conftool/dbconfig/20220912-111442-root.json [11:15:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-etcd1001.eqiad.wmnet [11:16:22] (03PS1) 10Marostegui: mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831513 (https://phabricator.wikimedia.org/T317522) [11:16:26] (03CR) 10JMeybohm: [C: 03+2] prometheus: Keep envoy connection metrics [puppet] - 10https://gerrit.wikimedia.org/r/831492 (https://phabricator.wikimedia.org/T317430) (owner: 10JMeybohm) [11:16:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1143-1148].eqiad.wmnet [11:17:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/831513 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:17:35] taavi: ok to merge O:puppetmaster::standalone: fix hiera_config (9b5ff0a721) ? [11:18:13] jayme: yes, thanks, cc jbond who merged the gerrit config [11:18:15] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es4 T317522 (duration: 04m 10s) [11:18:18] T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522 [11:18:22] jayme: are you still merging puppet? [11:18:35] was waiting on the ok - merged [11:18:39] ah ok [11:18:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:19:29] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:01] !log Starting es4 eqiad failover from es1021 to es1020 - T317522 [11:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1020 to es4 primary T317522', diff saved to https://phabricator.wikimedia.org/P34494 and previous config saved to /var/cache/conftool/dbconfig/20220912-112039-root.json [11:21:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:21:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:21:27] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224 [11:22:15] (03PS1) 10Marostegui: wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/831524 (https://phabricator.wikimedia.org/T317522) [11:23:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1146.eqiad.wmnet [11:23:15] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/831524 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:23:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1021 T317522', diff saved to https://phabricator.wikimedia.org/P34495 and previous config saved to /var/cache/conftool/dbconfig/20220912-112343-root.json [11:23:46] T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522 [11:24:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224 (owner: 10Marostegui) [11:24:54] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes in es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831224 (owner: 10Marostegui) [11:25:14] (03Abandoned) 10Hashar: Boilerplate for automatic MediaWiki deployment [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) (owner: 10Hashar) [11:25:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34496 and previous config saved to /var/cache/conftool/dbconfig/20220912-112635-root.json [11:26:37] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:42] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [11:27:51] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [11:28:53] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es4 T317522 (duration: 03m 36s) [11:28:55] T317522: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T317522 [11:30:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:31:13] (03PS1) 10Vgutierrez: mtail::varnishsli: Consider req.body read|writer errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [11:32:52] (03PS3) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 [11:33:03] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:15] (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [11:33:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:33:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:34:18] (03PS2) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [11:35:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:36:23] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:37] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:42] (03PS3) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [11:36:53] (03PS1) 10Marostegui: es1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831529 (https://phabricator.wikimedia.org/T317522) [11:37:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34497 and previous config saved to /var/cache/conftool/dbconfig/20220912-113702-ladsgroup.json [11:37:05] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:37:46] (03CR) 10Marostegui: [C: 03+2] es1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831529 (https://phabricator.wikimedia.org/T317522) (owner: 10Marostegui) [11:38:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34498 and previous config saved to /var/cache/conftool/dbconfig/20220912-113808-root.json [11:40:39] (03CR) 10CI reject: [V: 04-1] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [11:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34499 and previous config saved to /var/cache/conftool/dbconfig/20220912-114140-root.json [11:42:14] (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [11:43:33] (03CR) 10Jelto: [C: 03+1] "lgtm, see one nit about whitespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:46:29] (03CR) 10Majavah: "This is failing PCC profile::puppetmaster::common::base_config is still missing. Looks like the standalone role duplicates the entire logi" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [11:48:57] (03PS4) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 [11:50:07] (03PS4) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [11:52:02] (03PS5) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 [11:52:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P34500 and previous config saved to /var/cache/conftool/dbconfig/20220912-115208-ladsgroup.json [11:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34501 and previous config saved to /var/cache/conftool/dbconfig/20220912-115313-root.json [11:54:06] (03CR) 10CI reject: [V: 04-1] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [11:54:40] (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [11:56:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34502 and previous config saved to /var/cache/conftool/dbconfig/20220912-115645-root.json [11:59:14] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) Thank you @jhathaway -- crossing fingers it works! [11:59:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37238/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [12:01:34] (03PS3) 10Majavah: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 [12:04:18] (03CR) 10Majavah: [V: 03+1 C: 03+1] "Seems to work fine on my tests." [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [12:07:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P34503 and previous config saved to /var/cache/conftool/dbconfig/20220912-120715-ladsgroup.json [12:07:58] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1146.eqiad.wmnet [12:08:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet [12:08:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34504 and previous config saved to /var/cache/conftool/dbconfig/20220912-120818-root.json [12:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34505 and previous config saved to /var/cache/conftool/dbconfig/20220912-121150-root.json [12:12:50] (03CR) 10Jelto: [C: 03+1] "lgtm. I've done a helm template before and after and differences are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 (owner: 10JMeybohm) [12:16:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet [12:18:17] (03PS1) 10Btullis: Add the locations of the new hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/831532 (https://phabricator.wikimedia.org/T275767) [12:18:49] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34506 and previous config saved to /var/cache/conftool/dbconfig/20220912-122221-ladsgroup.json [12:22:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:22:25] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:22:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:22:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34507 and previous config saved to /var/cache/conftool/dbconfig/20220912-122242-ladsgroup.json [12:23:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34508 and previous config saved to /var/cache/conftool/dbconfig/20220912-122323-root.json [12:25:41] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [12:26:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34509 and previous config saved to /var/cache/conftool/dbconfig/20220912-122654-root.json [12:30:14] (03PS4) 10Hashar: jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) [12:30:16] (03PS1) 10Hashar: systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) [12:33:11] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [12:36:28] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34510 and previous config saved to /var/cache/conftool/dbconfig/20220912-123828-root.json [12:40:43] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:41:43] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:49:01] (03CR) 10Hashar: "That is to be used by the child change https://gerrit.wikimedia.org/r/c/operations/puppet/+/808900/ . My aim is to replace our own systemd" [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [12:49:11] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:53:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34511 and previous config saved to /var/cache/conftool/dbconfig/20220912-125333-root.json [12:54:40] (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [12:54:45] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans) [12:57:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans) [12:58:20] (03CR) 10Jbond: [C: 03+2] prepare: drop old hiera file location [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/831502 (owner: 10Jbond) [12:58:59] (03Abandoned) 10Jbond: O:puppetmaster::standalone: add correct hiere config default [puppet] - 10https://gerrit.wikimedia.org/r/831504 (owner: 10Jbond) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1300). [13:00:05] koi and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] (03CR) 10Majavah: [V: 03+1 C: 03+1] O:puppetmaster::standalone: move to useing P:puppetmaster::common (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [13:00:11] o/ [13:00:20] o/ [13:00:23] I can deploy! [13:02:45] (03PS4) 10Lucas Werkmeister (WMDE): Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:02:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:03:47] (03Merged) 10jenkins-bot: Revert "kowiki: Change logo for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831211 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:04:39] koi: I’ve pulled the first change to mwdebug1001, can you test it? [13:05:00] looking [13:05:14] (looks good on my end, I think) [13:05:37] yeah, also looks good from my side [13:05:40] ok! [13:05:53] the files can probably be synced in any order [13:06:12] I think I’ll do yaml, logos.php, then IS.php [13:06:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:06:41] syncing [13:07:25] (03CR) 10Volans: [C: 03+2] doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans) [13:08:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34512 and previous config saved to /var/cache/conftool/dbconfig/20220912-130838-root.json [13:09:11] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Update: we're [[ https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=swift_account_stats_byt... [13:09:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:31] 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon) 05Open→03Resolved Fix with https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/6 [13:09:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maint [13:09:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maint [13:10:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:10:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (1/3) (duration: 03m 53s) [13:10:34] T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127 [13:12:49] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [13:12:59] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [13:13:59] (03Merged) 10jenkins-bot: doc: add TOX_SKIP_ENV example for development [software/spicerack] - 10https://gerrit.wikimedia.org/r/831483 (owner: 10Volans) [13:14:24] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (2/3) (duration: 03m 39s) [13:14:55] are there any known issues with the mwdebug logstash dashboard? [13:15:10] it looks empty for me, and usually there’s at least a few messages there during a backport window, e.g. from scap pull IIRC [13:17:06] (03PS5) 10Lucas Werkmeister (WMDE): Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:17:44] I’m also wondering whether to purge the kowiki-600k files from the HTTP cache after the deployment is done, or not [13:18:12] I feel like that would be a good idea – if anything still accesses those files, we want that to be a noticeable error now, not a total mystery a year later when the cache finally expires [13:18:20] koi: any thoughts on that? :) [13:18:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:831211|Revert "kowiki: Change logo for 600k articles" (T315127)]] (3/3) (duration: 03m 53s) [13:18:34] T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127 [13:18:50] IIRC someone said purge files is only needed if you rename a file [13:18:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:19:34] (03Merged) 10jenkins-bot: Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831212 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:19:44] but that idea is sense making at least [13:19:47] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:12] (03PS3) 10Clément Goubert: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto) [13:20:29] koi: the second change is on mwdebug1001, anything to test? [13:20:36] and yeah, I don’t think it’s exactly necessary, just an extra cleanup [13:20:58] looking [13:20:59] (03PS1) 10Jbond: C:jenkins: remove migrate file [puppet] - 10https://gerrit.wikimedia.org/r/831541 [13:21:21] https://en.wikipedia.org/static/images/project-logos/kowiki-600k-2x.png is a 404 on mwdebug1001, so that looks good [13:21:46] I got a "Page not found" notice for /static/images/project-logos/kowiki-600k.png , so LGTM [13:21:52] ok, syncing [13:23:07] (03PS3) 10Volans: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [13:23:19] (03PS2) 10Volans: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 [13:23:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34513 and previous config saved to /var/cache/conftool/dbconfig/20220912-132343-root.json [13:24:57] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:25:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:23] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:831212|Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" (T315127)]] (1/2; deleted files require syncing whole directory) (duration: 03m 50s) [13:26:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:26:26] T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127 [13:26:41] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:12] logstash doesn’t seem to have any messages for host:mwdebug* [13:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34514 and previous config saved to /var/cache/conftool/dbconfig/20220912-132846-ladsgroup.json [13:28:50] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:28:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:29:25] (03PS1) 10Gergő Tisza: Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) [13:29:29] (03CR) 10Clément Goubert: [C: 03+2] sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto) [13:30:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:831212|Revert "kowiki: Add logo (legacy vector and vector-2022) for 600k articles" (T315127)]] (2/2; deleted file requires syncing whole directory) (duration: 03m 44s) [13:31:36] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans) [13:31:43] (03CR) 10Volans: [C: 03+2] Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [13:32:03] (03Merged) 10jenkins-bot: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto) [13:33:19] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/%s\n' {mobile/copyright/wikipedia-ko-600k.svg,project-logos/kowiki-600k{,-1.5x,-2x}.png} | mwscript purgeList.php # T315127 [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:22] T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127 [13:34:10] alright, now I’ll test if T317520 would affect production as well if the train rolls forward [13:34:10] T317520: Score: Call to a member function getExpensiveParserFunctionLimit() on null - https://phabricator.wikimedia.org/T317520 [13:35:20] (03Abandoned) 10Jbond: C:jenkins: remove migrate file [puppet] - 10https://gerrit.wikimedia.org/r/831541 (owner: 10Jbond) [13:35:22] !log manually applying [[gerrit:830691]] on mwdebug1001 to test if T317520 affects production (expected to cause getExpensiveParserFunctionLimit-related logstash errors) [13:35:22] (03Merged) 10jenkins-bot: sre.hosts.decommission: test IPMI connection [cookbooks] - 10https://gerrit.wikimedia.org/r/831484 (owner: 10Volans) [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:56] (03CR) 10Jbond: [C: 04-1] systemd: allow changing override filename (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [13:36:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [13:38:43] (03Merged) 10jenkins-bot: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [13:38:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34515 and previous config saved to /var/cache/conftool/dbconfig/20220912-133848-root.json [13:39:04] yup, there’s an internal error [13:39:20] aha, and it’s in logstash as well [13:39:36] so host:mwdebug* messages still make it to logstash – I suppose scap pull just doesn’t produce any logs anymore? [13:39:47] but it’s the same error, so this is indeed a train blocker [13:40:16] !log scap pull on mwdebug1001 to restore good code (confirmed that T317520 affects production) [13:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] T317520: Score: Call to a member function getExpensiveParserFunctionLimit() on null - https://phabricator.wikimedia.org/T317520 [13:41:20] (03CR) 10JMeybohm: [C: 03+1] sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [13:41:51] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:43:43] !log UTC afternoon backport+config window done [13:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P34516 and previous config saved to /var/cache/conftool/dbconfig/20220912-134353-ladsgroup.json [13:49:51] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [13:50:00] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [13:50:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) Upgrade of cr3-esams went well earlier. Firmware upgrade works as per docs. I will put up more info on that later for our own reference. [13:51:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [13:51:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [13:53:07] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1028-1030] [13:57:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [13:58:51] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P34517 and previous config saved to /var/cache/conftool/dbconfig/20220912-135859-ladsgroup.json [14:01:14] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:01:52] (03PS9) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [14:02:24] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:25] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wtp[1028-1030] [14:02:25] (03PS5) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [14:02:30] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `wtp[1028-1030]` - wtp1028 (**FAIL**) - //No DNS record found for th... [14:05:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [14:06:11] 10SRE-OnFire, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Clement_Goubert) [14:06:29] 10SRE-OnFire, 10serviceops, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Clement_Goubert) [14:07:31] (03PS6) 10Vgutierrez: mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) [14:14:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T312863)', diff saved to https://phabricator.wikimedia.org/P34518 and previous config saved to /var/cache/conftool/dbconfig/20220912-141405-ladsgroup.json [14:14:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:14:09] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [14:14:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34519 and previous config saved to /var/cache/conftool/dbconfig/20220912-141427-ladsgroup.json [14:18:08] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [14:18:13] (03PS2) 10Filippo Giunchedi: sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) [14:43:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34520 and previous config saved to /var/cache/conftool/dbconfig/20220912-144339-ladsgroup.json [14:43:43] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:46:20] should we worry about wikidata? [14:48:11] (03PS1) 10Elukey: Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) [14:48:56] (03PS5) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [14:50:13] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:11] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P34521 and previous config saved to /var/cache/conftool/dbconfig/20220912-145845-ladsgroup.json [14:58:57] (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:02:25] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-33): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF) [15:04:16] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) [15:13:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P34522 and previous config saved to /var/cache/conftool/dbconfig/20220912-151352-ladsgroup.json [15:15:49] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:07] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:17:56] !log dancy@deploy1002 Installing scap version "4.18.0" for 561 hosts [15:18:13] !log dancy@deploy1002 Installation of scap version "4.18.0" completed for 561 hosts [15:26:01] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) @MoritzMuehlenhoff @jbond Facter does not appear to be detecting the raid on some hosts. Not sure how widespread the issue is. current fact (direct c... [15:26:38] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) @pfischer After I asked for your public key, it looks like someone updated the original request with the key. Thus... [15:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T314041)', diff saved to https://phabricator.wikimedia.org/P34523 and previous config saved to /var/cache/conftool/dbconfig/20220912-152858-ladsgroup.json [15:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [15:29:03] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:29:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [15:29:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34524 and previous config saved to /var/cache/conftool/dbconfig/20220912-152920-ladsgroup.json [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1530). [15:32:41] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:17] (03PS2) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) [15:39:00] (03CR) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [15:39:39] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:58] 10SRE, 10ops-codfw, 10Observability-Logging: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10Papaul) 05Open→03Resolved @colewhite disk replaced [15:43:46] (03PS10) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [15:44:15] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [15:46:44] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Better test environments for Elastic - https://phabricator.wikimedia.org/T317420 (10Gehel) 05Open→03Invalid This is too broad as it is. We'll revisit this if we have a better defined need. [15:46:56] (03CR) 10Volans: [C: 03+1] "LGTM, one final nit you couldn't foresee inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [15:49:44] (03CR) 10Herron: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:51:59] (03CR) 10Muehlenhoff: "This is expected, see the sysusers.d manpage: https://manpages.debian.org/unstable/systemd/sysusers.d.5.en.html" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [15:54:01] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [15:54:10] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [15:55:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [15:55:54] (03PS15) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [16:00:29] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [16:02:33] (03PS11) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [16:02:49] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:11] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [16:05:00] 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis) As a note, such sites also include "everything on WMCS / toolserver" and it would probably be good to extend NEL to that as well. [16:09:47] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:41] PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: srv.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:44] (03PS6) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [16:12:03] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) It seems this comment was about T316090 [16:12:16] (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [16:13:18] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Dzahn) @pfischer Hi, please also see this comment over here: T316922#8229340 . If you could try to ssh into an... [16:13:59] (03PS1) 10Ebernhardson: Re-enable track_total_hits for elastic7 [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) [16:15:29] (03PS1) 10Ebernhardson: Set track_total_hits to true [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 [16:17:23] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:24:29] (03CR) 10RLazarus: "Hmm, this is a really interesting case! If I understand right, we're talking about situations where there was e.g. a network failure somew" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [16:28:18] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) Hi @pfischer You are in the requested wmf LDAP group and the WMF-NDA group in Phabricator meanwhle. If you could... [16:32:06] 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) [16:33:05] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [16:40:03] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:21] RECOVERY - Check systemd state on logstash2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:06] (03CR) 10Vgutierrez: varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [16:54:29] (03CR) 10Vgutierrez: "1" [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [16:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34527 and previous config saved to /var/cache/conftool/dbconfig/20220912-165720-ladsgroup.json [16:57:24] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T1700). [17:03:14] (03PS6) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [17:03:21] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:14] (03CR) 10RLazarus: [C: 03+1] mtail::varnishsli: Consider req.body read|write errors as good requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [17:07:37] (03CR) 10Vlad.shapik: Remove division operation hack related to Python2 (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik) [17:07:41] RECOVERY - OpenSearch health check for shards on 9200 on logstash2027 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, active_primary_shards: 562, active_shards: 1281, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [17:07:41] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:08:53] !log rebuilt raid on logstash2027 T316996 [17:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:56] T316996: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 [17:10:21] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P34528 and previous config saved to /var/cache/conftool/dbconfig/20220912-171227-ladsgroup.json [17:14:00] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10odimitrijevic) Approved [17:21:00] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [17:21:09] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [17:27:01] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) [17:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P34529 and previous config saved to /var/cache/conftool/dbconfig/20220912-172733-ladsgroup.json [17:30:57] (03CR) 10Cwhite: [C: 03+1] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [17:37:46] (03PS6) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) [17:39:05] RECOVERY - MD RAID on logstash2027 is OK: OK: Active: 24, Working: 24, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:42:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34531 and previous config saved to /var/cache/conftool/dbconfig/20220912-174239-ladsgroup.json [17:42:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [17:42:43] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:42:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [17:43:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34532 and previous config saved to /var/cache/conftool/dbconfig/20220912-174301-ladsgroup.json [17:57:51] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.116`. Pre-deploy tests passing on canary `wdqs1003` [17:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:57] !log [WDQS Deploy] Tests passing following deploy of `wdqs1003` on canary `wdqs1003`; proceeding to rest of fleet [18:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:46] !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116 [18:05:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:23] !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 05m 37s) [18:10:34] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) p:05Triage→03Medium a:03BCornwall [18:12:35] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:30] 10SRE, 10Traffic, 10Patch-For-Review: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) [18:13:45] !log dancy@deploy1002 Installing scap version "4.16.0" for 561 hosts [18:14:02] !log dancy@deploy1002 Installation of scap version "4.16.0" completed for 561 hosts [18:14:34] !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116 [18:14:53] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:17:35] (03CR) 10BCornwall: [C: 03+2] varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [18:19:23] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:06] !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 07m 31s) [18:24:10] (03CR) 10Vgutierrez: [C: 03+1] varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [18:24:28] (03CR) 10Dduvall: "Just a friendly ping. Should I refactor `SETENV` to some alternative or is this good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [18:26:23] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:24] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10cmooney) @Jgreen I believe I've done what's required now (not all that familiar with this workflow however). Both ports that are labelled for frdata100... [18:37:35] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [18:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:42] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [18:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:50] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/831481 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [18:42:55] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) @cmooney Both interfaces show no-carrier, can you confirm that the switch ports are enabled? [18:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34535 and previous config saved to /var/cache/conftool/dbconfig/20220912-184317-ladsgroup.json [18:43:21] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:48:10] !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14] (wcqs): Deploy 0.3.116 to WCQS [18:49:13] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:33] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:11] !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14] (wcqs): Deploy 0.3.116 to WCQS (duration: 08m 01s) [18:56:11] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P34536 and previous config saved to /var/cache/conftool/dbconfig/20220912-185823-ladsgroup.json [18:58:33] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:24] !log [WCQS Deploy] Test query passed on commons-query.wikimedia.org; WCQS deploy complete [19:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] !log [WCQS] Depooled `wcqs100[1,2]` while they catch up on ~1.5 days worth of lag (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wcqs&viewPanel=8&from=1662910789183&to=1663068616559) [19:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:30] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [19:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34537 and previous config saved to /var/cache/conftool/dbconfig/20220912-191000-ladsgroup.json [19:10:03] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:10:22] (03PS3) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) [19:12:09] !log dancy@deploy1002 Installing scap version "4.18.0" for 561 hosts [19:12:27] !log dancy@deploy1002 Installation of scap version "4.18.0" completed for 561 hosts [19:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P34538 and previous config saved to /var/cache/conftool/dbconfig/20220912-191330-ladsgroup.json [19:14:39] (03PS4) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) [19:15:33] jouncebot: now [19:15:33] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [19:17:27] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:19:40] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10gmodena) Hi - what is the status of the linked CR? >>! In T303543#7768019, @gerritbot wrote: > Chang... [19:20:03] !log dancy@deploy1002 Installing scap version "4.19.0" for 561 hosts [19:20:20] !log dancy@deploy1002 Installation of scap version "4.19.0" completed for 561 hosts [19:24:24] (03CR) 10Cwhite: [C: 03+2] logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [19:25:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P34539 and previous config saved to /var/cache/conftool/dbconfig/20220912-192506-ladsgroup.json [19:26:17] !log bking@deploy1002 Started deploy [wdqs/wdqs@e012d14]: 0.3.116 [19:28:22] !log bking@deploy1002 Finished deploy [wdqs/wdqs@e012d14]: 0.3.116 (duration: 02m 04s) [19:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34540 and previous config saved to /var/cache/conftool/dbconfig/20220912-192837-ladsgroup.json [19:28:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:28:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:28:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34541 and previous config saved to /var/cache/conftool/dbconfig/20220912-192858-ladsgroup.json [19:31:10] Hey all - mstyles and I would like to try to deploy a couple of security patches right now, if there are no objections. [19:39:55] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P34542 and previous config saved to /var/cache/conftool/dbconfig/20220912-194013-ladsgroup.json [19:48:45] (03PS7) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [19:48:57] (03PS2) 10Jdlrobson: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) [19:50:46] (03CR) 10BCornwall: "I've updated the patch set to include a little more formatting and an explicit change to bash since we're using bashisms now in the script" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [19:53:24] !log Deployed security patch for T311337 [19:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34543 and previous config saved to /var/cache/conftool/dbconfig/20220912-195519-ladsgroup.json [19:55:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:55:23] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:55:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:55:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34544 and previous config saved to /var/cache/conftool/dbconfig/20220912-195540-ladsgroup.json [19:56:22] (03PS8) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [19:58:22] !log mstyles@deploy1002 Synchronized php-1.39.0-wmf.28/extensions/PageTriage/includes/Api/ApiPageTriageAction.php: (no justification provided) (duration: 03m 42s) [19:59:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts theemin.codfw.wmnet [19:59:39] !log deployed security patch for T314245 [19:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T2000). [20:00:05] ebernhardson, zabe, Aishik, danisztls, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:16] o/ [20:00:59] o/ [20:01:17] Evening all, I can deploy :) [20:01:34] (end security patch deployments - both of which seem to have gone out ok!) [20:02:12] \o [20:02:20] ah good hi ebernhardson, you're up first :) [20:03:09] Going to start with 831548 [20:03:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson) [20:04:32] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:05:55] (03PS2) 10Samtar: Mark spcomwiki and searchcomwiki as closed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe) [20:06:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts theemin.codfw.wmnet [20:06:57] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: `theemin.codfw.wmnet` - theemin.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanage... [20:07:19] ebernhardson: zabe: I'm going to get 831167 deployed while ^ merges [20:07:22] !log samtar@deploy1002 backport aborted: (duration: 03m 46s) [20:07:23] (i'm lurking ) [20:07:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe) [20:08:38] (03Merged) 10jenkins-bot: Mark spcomwiki and searchcomwiki as closed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831167 (https://phabricator.wikimedia.org/T285685) (owner: 10Zabe) [20:08:54] !log samtar@deploy1002 Started scap: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]] [20:08:57] T285685: Mark searchcom and spcom wikis as closed on Special:SiteMatrix - https://phabricator.wikimedia.org/T285685 [20:09:13] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) [20:09:16] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:09:44] zabe: can you test on mwdebug1001? [20:09:57] lemme see [20:10:24] TheresNoTime, lgtm, listed as closed now [20:10:32] syncing :) [20:10:34] 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10BCornwall) 05Open→03Resolved It looks like this ticket has been resolved. I'm going to close it but please do re-open if there is any unfinished business. Thank you! [20:11:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:11:52] (03PS4) 10Samtar: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:12:46] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10BCornwall) 05Stalled→03Resolved I'm going to mark this as resolved since no verification has occurred. If there's any unfin... [20:12:55] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host dispatch-be1001.eqiad.wmnet [20:12:56] !log herron@cumin1001 START - Cookbook sre.dns.netbox [20:12:57] Hi Aishik :) you're up next if you're available? [20:13:23] (03PS5) 10Samtar: Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman) [20:13:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:36] I am here! [20:13:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:34] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831167|Mark spcomwiki and searchcomwiki as closed (T285685)]] (duration: 05m 40s) [20:14:37] T285685: Mark searchcom and spcom wikis as closed on Special:SiteMatrix - https://phabricator.wikimedia.org/T285685 [20:14:57] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:14:57] !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache dispatch-be1001.eqiad.wmnet on all recursors [20:15:01] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dispatch-be1001.eqiad.wmnet on all recursors [20:15:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman) [20:16:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34545 and previous config saved to /var/cache/conftool/dbconfig/20220912-201604-ladsgroup.json [20:16:07] (03Merged) 10jenkins-bot: Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman) [20:16:07] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:16:21] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]] [20:16:24] T317424: Create six more namespaces on the Bengali Wiktionary - https://phabricator.wikimedia.org/T317424 [20:16:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:16:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:16:34] Aishik: Can you test this on mwdebug1001? [20:16:41] !log samtar@deploy1002 samtar and aishik: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:17:17] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) The only thing left on this task is to unrack the server and remove all the disks. [20:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:29] Aishik, do you know what mwdebug1001 means? [20:18:41] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10BCornwall) @dcausse Are these action items filed into appropriate places such that this ticket, which seems "finished", can be closed? [20:19:44] * TheresNoTime should have asked, apologies :) https://wikitech.wikimedia.org/wiki/WikimediaDebug [20:20:28] Yeap! Its working [20:20:36] Great! Will sync :) [20:21:45] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:21:53] Thanks! Do I need to anything else? [20:22:15] (03Merged) 10jenkins-bot: Re-enable track_total_hits for elastic7 [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson) [20:22:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:52] Aishik: Test again on production proper in about ~4 minutes, I'll ping you :) [20:23:04] (er, more like 2 minutes) [20:23:18] ebernhardson: will loop back to 831548 next, are you available to test? I will note that's a lot of files to be backported [20:23:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:23:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:23:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:23:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:24:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34546 and previous config saved to /var/cache/conftool/dbconfig/20220912-202359-ladsgroup.json [20:24:03] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:24:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:24:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:24:36] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830982|Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary (T317424)]] (duration: 08m 14s) [20:24:39] T317424: Create six more namespaces on the Bengali Wiktionary - https://phabricator.wikimedia.org/T317424 [20:24:59] Aishik: sync'd fully :) just test if you don't mind, this time not using mwdebug [20:26:00] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) 05Open→03Resolved a:03dcausse @BCornwall yes, this ticket can be closed, remaining work is tracked here: - complete the cleanup:... [20:26:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831548 (https://phabricator.wikimedia.org/T317374) (owner: 10Ebernhardson) [20:26:33] !log samtar@deploy1002 Started scap: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]] [20:26:37] T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade - https://phabricator.wikimedia.org/T317374 [20:26:55] !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:27:01] 10SRE, 10Observability-Metrics, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10lmata) [20:27:17] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10lmata) [20:27:46] ebernhardson: please test on mwdebug1001 [20:28:22] TheresNoTime: works as expected [20:28:27] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Dzahn) > SSH: configured to access all our servers, including an-launcher1002 We can't be sure what the definition of "all our servers" is. In gener... [20:28:31] Syncing [20:28:43] It's totally ok! (this 🙂 emoji is my favourite too) [20:28:54] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10cmooney) @Jgreen my bad yeah they were both still part of the disabled group. Both up/up now, hopefully looks better your side too. ` cmooney@fasw-c-eq... [20:28:57] P [20:29:15] ^^ [20:29:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:40] (03PS2) 10Cwhite: rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) [20:30:41] 10SRE, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q1): librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10lmata) [20:30:57] (03PS5) 10Samtar: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:31:03] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P34547 and previous config saved to /var/cache/conftool/dbconfig/20220912-203110-ladsgroup.json [20:31:26] danisztls: going to do 830917 next, are you available to test? [20:32:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:19] TheresNoTime: yes [20:32:45] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831548|Re-enable track_total_hits for elastic7 (T317374)]] (duration: 06m 12s) [20:32:48] T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade - https://phabricator.wikimedia.org/T317374 [20:32:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:33:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:45] (03Merged) 10jenkins-bot: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:34:00] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]] [20:34:03] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:34:19] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:34:20] (03CR) 10Samtar: [C: 03+2] "Deploy, set this merging as it takes a while.." [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson) [20:34:24] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10lmata) [20:34:42] danisztls: Live on mwdebug1001, please test :) [20:36:06] TheresNoTime: looks good [20:36:30] danisztls: okay, syncing [20:37:26] TheresNoTime: thanks [20:37:55] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:37:59] (03PS3) 10Samtar: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:38:03] o/ [20:38:07] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dispatch-be1001.eqiad.wmnet [20:38:15] Jdlrobson: will be doing 831117 next [20:38:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:18] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:39:35] !log testing exim config change on mx1001.wikimedia.org [20:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:58] (03CR) 10Dzahn: "I am wondering how many SCAP env variables there are. If it's just a few it seems nicer to list them explicitly and use "env_keep"." [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [20:40:26] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830917|Deploy Research Incentive Survey to idwiki (T316466)]] (duration: 06m 25s) [20:40:28] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:40:46] danisztls: sync'd, could you give it another test to be sure? :) [20:40:47] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:41:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:41:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:41:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:42:06] (03Merged) 10jenkins-bot: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:42:08] (03CR) 10Cwhite: [C: 03+2] rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [20:42:10] (03CR) 10Dzahn: "Is it only "$SCAP_FINAL_PATH and $SCAP_REV_PATH" in scap3?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [20:42:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:42:19] !log samtar@deploy1002 Started scap: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]] [20:42:21] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Thanks for the clarification, @Dzahn! Unless there's dissent, I'll just add them to the analytics-admins group as was suggested. @Milimetri... [20:42:22] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:42:38] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:42:58] Jdlrobson: Live on mwdebug1001, could you test please? :) [20:43:04] looking [20:43:20] TheresNoTime: yes, not working now [20:43:41] danisztls: your patch is not working in production? [20:44:08] TheresNoTime: only on debug [20:44:15] not working on production [20:45:05] hm, okay, one moment [20:45:05] TheresNoTime: please sync! [20:45:31] danisztls: Going to sync Jdlrobson's patch and then come back to look at that.. [20:45:57] TheresNoTime: pc issue, working on another device, sorry [20:46:05] phew! [20:46:15] best kind of bug! :D [20:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P34548 and previous config saved to /var/cache/conftool/dbconfig/20220912-204617-ladsgroup.json [20:47:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:03] ebernhardson: once this patch is merged, I'll move onto 831549 - it's almost merged :) [20:48:09] kk [20:48:14] s/merged/sync'd [20:48:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:46] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831117|Enable Nearby on Hebrew and French Wikipedia (T246493)]] (duration: 07m 27s) [20:49:50] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:49:55] Jdlrobson: Sync'd ^ :) [20:50:47] (03PS1) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) [20:50:54] Thanks TheresNoTime [20:50:55] (03Merged) 10jenkins-bot: Set track_total_hits to true [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson) [20:50:59] ill keep an eye on the logs [20:51:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831549 (owner: 10Ebernhardson) [20:51:27] !log samtar@deploy1002 Started scap: Backport for [[gerrit:831549|Set track_total_hits to true]] [20:51:42] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q2): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) [20:51:46] !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:831549|Set track_total_hits to true]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:51:52] TheresNoTime: this one isn't properly testable, none of the changes here are run in reponse to an http request. Should be fine to sync out [20:52:09] ebernhardson: ack, syncing :) [20:53:59] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [20:54:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:27] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:831549|Set track_total_hits to true]] (duration: 05m 00s) [20:56:51] everything sync'd [20:57:26] !log closing UTC late backport window [20:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:46] TheresNoTime: thanks! [20:57:53] No worries! [20:58:50] 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220912T2100). [21:01:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T312863)', diff saved to https://phabricator.wikimedia.org/P34549 and previous config saved to /var/cache/conftool/dbconfig/20220912-210123-ladsgroup.json [21:01:24] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:01:27] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:04:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) >>! In T317539#8230385, @cmooney wrote: > @Jgreen my bad yeah they were both still part of the disabled group. > > Both up/up now, hopefully lo... [21:04:36] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Set frdata1001 switch ports to fundraising vlan - https://phabricator.wikimedia.org/T317539 (10Jgreen) 05Open→03Resolved a:03Jgreen [21:07:22] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [21:07:30] (03CR) 10Dzahn: "from a glance at hieradata this groups includes a LOT of things and the access request was for "all the things". that's all I know." [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [21:07:42] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 19s) [21:12:32] (03PS1) 10BCornwall: prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) [21:18:09] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) From the CR which is currently not approved: > from a glance at hieradata this groups includes a LOT of things and the access request was for "... [21:20:10] 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway) [21:21:18] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [21:23:10] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:23:14] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:12] (03PS1) 10JHathaway: mail::mx: Modify the Received header [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) [21:24:50] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:25:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [21:25:37] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [21:32:12] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:32:15] (03CR) 10Dduvall: phabricator: Allow deploy user to preserve environment when sudoing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [21:35:23] (03PS1) 10Cwhite: logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) [21:36:51] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#8228079, @elukey wrote: > @colewhite does it sound good? SGTM! Thanks! [21:51:07] (03CR) 10Dzahn: phabricator: Allow deploy user to preserve environment when sudoing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [21:51:42] (03CR) 10Dzahn: phabricator: Allow deploy user to preserve environment when sudoing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [21:54:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34550 and previous config saved to /var/cache/conftool/dbconfig/20220912-215407-ladsgroup.json [21:54:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:56:41] (03PS1) 10Dzahn: disable git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) [21:57:01] (03PS2) 10Dzahn: disable git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) [21:58:08] (03CR) 10Dzahn: "though.. if we do this we will get a lot of monitoring alerts... hrmmm. First removing it as a service from LVS/pybal is not as easy and c" [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [22:02:21] (03CR) 10Thcipriani: [C: 03+1] "🎉" [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [22:03:55] 10SRE, 10serviceops: mediawiki::api: net.ipv4.local_port_range sysctl config does not exist - https://phabricator.wikimedia.org/T317454 (10Dzahn) thanks @paladox confirmed. it's `ip_local_port_range` under `/ipv4/`. https://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec70.html [22:06:58] (03PS1) 10Dzahn: mediawiki::api: fix kernel parameter name ip_local_port_range [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) [22:07:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:08:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:08:40] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/831629" [puppet] - 10https://gerrit.wikimedia.org/r/401714 (https://phabricator.wikimedia.org/T182568) (owner: 10Giuseppe Lavagetto) [22:09:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P34551 and previous config saved to /var/cache/conftool/dbconfig/20220912-220914-ladsgroup.json [22:11:27] (03CR) 10JHathaway: [C: 03+1] "Looks good to me!" [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen) [22:12:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:13:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:13:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:14:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:20:18] !log phabricator - disabling repository "tool-ranker" [22:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:11] !log phabricator - disabling repositories: tool-xh-bot, tool-editor-contribution-dashboard, tool-ranker, tool-editor-contribution, tool-mikasa-bot-1, tool-maintun, tool-add-text, tool-wikibookassamese-book.php (none of them had commits) T296022 - T315706 [22:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:16] T315706: Migrate existing Striker created Diffusion repos to GitLab - https://phabricator.wikimedia.org/T315706 [22:23:17] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [22:24:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P34552 and previous config saved to /var/cache/conftool/dbconfig/20220912-222420-ladsgroup.json [22:27:07] (03PS3) 10Dduvall: scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) [22:27:43] (03CR) 10CI reject: [V: 04-1] scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [22:29:18] (03PS4) 10Dduvall: scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) [22:30:59] (03CR) 10Dduvall: "Note this is now a change to `scap::target` and will effect all cases where `scap::target` is used with the `sudo_rules` parameter. Howeve" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [22:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T314041)', diff saved to https://phabricator.wikimedia.org/P34553 and previous config saved to /var/cache/conftool/dbconfig/20220912-223927-ladsgroup.json [22:39:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [22:39:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:39:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [22:39:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:40:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34554 and previous config saved to /var/cache/conftool/dbconfig/20220912-224006-ladsgroup.json [22:43:44] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:53:29] !log phabricator - disabling MediaWiki extension repositories in Diffusion that have 0 commits - T296022 - T315706 [22:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:34] T315706: Migrate existing Striker created Diffusion repos to GitLab - https://phabricator.wikimedia.org/T315706 [22:53:34] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [23:05:49] (03PS1) 10Dzahn: phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) [23:06:25] (03CR) 10CI reject: [V: 04-1] phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:08:54] (03PS2) 10Dzahn: phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) [23:12:02] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37240/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:13:42] (03CR) 10Dduvall: [C: 03+1] "Seems like a good initial approach! Thanks for doing the legwork, Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:14:06] (03CR) 10Dzahn: [C: 03+2] phabricator: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:16:34] (03CR) 10Dzahn: [C: 03+2] "nope, that would have been not complex enough yet:/" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:18:20] (03CR) 10Dzahn: [C: 03+2] "I am glad we did not do this in scap::target :) puppet is broken. disabled on phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:18:55] (03CR) 10Dzahn: [C: 03+2] "but I totally CAN manually run that command that failed in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:19:30] (03CR) 10Dzahn: [C: 03+2] "..because that file with the new rules does not exist anymore now." [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:24:15] (03CR) 10Dzahn: [C: 03+2] ">>> /etc/sudoers.d/scap_sudo_rules_phab-deploy_phabricator_deployment: syntax error near line 3 <<<" [puppet] - 10https://gerrit.wikimedia.org/r/831634 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [23:30:11] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [23:31:00] (03CR) 10Dzahn: "an issue here is that sudo::user always starts a line with the user name, so this ends up becoming:" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:31:53] (03CR) 10Dzahn: "I'll try to come up with a fix for that tomorrow. Maybe we can just turn the entire sudo file into a template for this case." [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:32:56] (03CR) 10Dzahn: "..or we can add a new class lets us add generic sudo config lines that don't need to start with the user name" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:33:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34555 and previous config saved to /var/cache/conftool/dbconfig/20220912-233327-ladsgroup.json [23:33:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:33:37] (03CR) 10Dzahn: "tested at https://gerrit.wikimedia.org/r/c/operations/puppet/+/831634 and WIP" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:34:03] (03PS1) 10Dzahn: Revert "phabricator: Allow deploy user to keep scap3 environment variables with sudo" [puppet] - 10https://gerrit.wikimedia.org/r/831554 [23:36:01] (03CR) 10Dzahn: "wait, "phab-deploy env_keep+=SCAP_* ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_config_deploy" could also do it I guess" [puppet] - 10https://gerrit.wikimedia.org/r/831554 (owner: 10Dzahn) [23:48:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P34556 and previous config saved to /var/cache/conftool/dbconfig/20220912-234833-ladsgroup.json [23:50:49] (03CR) 10Dduvall: "Darn! How about we add an additional parameter to sudo::user for defaults?" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:51:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:53:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:39] (03CR) 10Dzahn: "Yea, either that or maybe we use the restricted_env_file or env_file. We could define all the SCAP env variables there and give them value" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:57:29] (03CR) 10Dzahn: "fwiw, toolforge just does it like this, with a plain file dropped into sudoers.d:" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall)