[00:14:39] (03PS2) 10Raymond Ndibe: prometheus: Add new scrape target [puppet] - 10https://gerrit.wikimedia.org/r/836310 [00:20:11] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-drop-eventlogging-legacy-raw-partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:07] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:31:55] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [00:46:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [01:01:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2037.codfw.wmnet with OS buster [01:01:13] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash2037.codfw.wmnet with OS buster completed: - logstash2037 (**PASS**) -... [01:04:53] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:17:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:33:59] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:35] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) [01:49:08] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) 05Open→03Resolved @herron all yours [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:43] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:35] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:29:43] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:39] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35104 and previous config saved to /var/cache/conftool/dbconfig/20220929-031127-ladsgroup.json [03:11:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:20:57] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:07] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:26:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P35105 and previous config saved to /var/cache/conftool/dbconfig/20220929-032634-ladsgroup.json [03:30:31] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:22] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [03:40:32] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 10s) [03:41:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P35106 and previous config saved to /var/cache/conftool/dbconfig/20220929-034140-ladsgroup.json [03:46:59] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35107 and previous config saved to /var/cache/conftool/dbconfig/20220929-035647-ladsgroup.json [03:56:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [03:56:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:57:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [03:57:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [03:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [03:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T314041)', diff saved to https://phabricator.wikimedia.org/P35108 and previous config saved to /var/cache/conftool/dbconfig/20220929-035724-ladsgroup.json [04:00:57] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:54] (03PS1) 10Andrew Bogott: nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 [04:05:43] (03CR) 10CI reject: [V: 04-1] nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 (owner: 10Andrew Bogott) [04:06:54] (03PS2) 10Andrew Bogott: nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 [04:07:48] (03CR) 10CI reject: [V: 04-1] nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 (owner: 10Andrew Bogott) [04:09:44] (03PS3) 10Andrew Bogott: nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 [04:12:48] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: specify per-region resolvers [puppet] - 10https://gerrit.wikimedia.org/r/836328 (owner: 10Andrew Bogott) [04:22:03] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:13] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:29:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T314041)', diff saved to https://phabricator.wikimedia.org/P35109 and previous config saved to /var/cache/conftool/dbconfig/20220929-044224-ladsgroup.json [04:42:29] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:48:05] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:27] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P35110 and previous config saved to /var/cache/conftool/dbconfig/20220929-045730-ladsgroup.json [05:04:05] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [05:06:18] (03PS1) 10Marostegui: db2140: Remove old comment [puppet] - 10https://gerrit.wikimedia.org/r/836333 [05:07:55] (03CR) 10Marostegui: [C: 03+2] db2140: Remove old comment [puppet] - 10https://gerrit.wikimedia.org/r/836333 (owner: 10Marostegui) [05:10:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T318886 [05:10:58] T318886: Switchover s4 codfw master (db2110 -> db2140) - https://phabricator.wikimedia.org/T318886 [05:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2140 with weight 0 T318886', diff saved to https://phabricator.wikimedia.org/P35111 and previous config saved to /var/cache/conftool/dbconfig/20220929-051114-root.json [05:11:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T318886 [05:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P35112 and previous config saved to /var/cache/conftool/dbconfig/20220929-051237-ladsgroup.json [05:13:59] (03PS1) 10Marostegui: mariadb: Promote db2140 to s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836449 (https://phabricator.wikimedia.org/T318886) [05:14:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2140 to s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836449 (https://phabricator.wikimedia.org/T318886) (owner: 10Marostegui) [05:17:13] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T314041)', diff saved to https://phabricator.wikimedia.org/P35113 and previous config saved to /var/cache/conftool/dbconfig/20220929-052743-ladsgroup.json [05:27:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:27:48] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:27:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T314041)', diff saved to https://phabricator.wikimedia.org/P35114 and previous config saved to /var/cache/conftool/dbconfig/20220929-052805-ladsgroup.json [05:32:18] !log Starting s4 codfw failover from db2110 to db2140 - T318886 [05:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:22] T318886: Switchover s4 codfw master (db2110 -> db2140) - https://phabricator.wikimedia.org/T318886 [05:33:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2140 to s4 primary and set section read-write T318886', diff saved to https://phabricator.wikimedia.org/P35115 and previous config saved to /var/cache/conftool/dbconfig/20220929-053302-root.json [05:34:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 T318886', diff saved to https://phabricator.wikimedia.org/P35116 and previous config saved to /var/cache/conftool/dbconfig/20220929-053407-root.json [05:38:55] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2140 from API T318886', diff saved to https://phabricator.wikimedia.org/P35117 and previous config saved to /var/cache/conftool/dbconfig/20220929-053951-root.json [05:39:56] T318886: Switchover s4 codfw master (db2110 -> db2140) - https://phabricator.wikimedia.org/T318886 [05:42:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35118 and previous config saved to /var/cache/conftool/dbconfig/20220929-054211-root.json [05:43:09] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T318888 [05:44:52] T318888: Switchover s7 codfw master (db2110 -> db2118) - https://phabricator.wikimedia.org/T318888 [05:45:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T318888 [05:45:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2118 with weight 0 T318888', diff saved to https://phabricator.wikimedia.org/P35119 and previous config saved to /var/cache/conftool/dbconfig/20220929-054509-root.json [05:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2118 from API T318888', diff saved to https://phabricator.wikimedia.org/P35120 and previous config saved to /var/cache/conftool/dbconfig/20220929-054542-root.json [05:47:34] (03PS1) 10Marostegui: mariadb: Promote db2118 to s7 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836602 (https://phabricator.wikimedia.org/T318888) [05:50:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2118 to s7 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836602 (https://phabricator.wikimedia.org/T318888) (owner: 10Marostegui) [05:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35121 and previous config saved to /var/cache/conftool/dbconfig/20220929-055716-root.json [06:00:04] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T0600). nyaa~ [06:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:31] !log Starting s7 codfw failover from db2121 to db2118 - T318888 [06:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:36] T318888: Switchover s7 codfw master (db2121 -> db2118) - https://phabricator.wikimedia.org/T318888 [06:04:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2118 to s7 primary and set section read-write T318888', diff saved to https://phabricator.wikimedia.org/P35122 and previous config saved to /var/cache/conftool/dbconfig/20220929-060425-root.json [06:05:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2121 T318888', diff saved to https://phabricator.wikimedia.org/P35123 and previous config saved to /var/cache/conftool/dbconfig/20220929-060532-root.json [06:06:24] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:06:42] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35124 and previous config saved to /var/cache/conftool/dbconfig/20220929-061221-root.json [06:16:59] (03PS1) 10Giuseppe Lavagetto: mwdebug: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836605 [06:17:01] (03PS1) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836606 [06:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35125 and previous config saved to /var/cache/conftool/dbconfig/20220929-061805-root.json [06:24:04] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:19] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:27:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35126 and previous config saved to /var/cache/conftool/dbconfig/20220929-062726-root.json [06:27:44] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:30:24] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:48] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35127 and previous config saved to /var/cache/conftool/dbconfig/20220929-063310-root.json [06:34:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [06:34:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [06:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177', diff saved to https://phabricator.wikimedia.org/P35128 and previous config saved to /var/cache/conftool/dbconfig/20220929-063508-root.json [06:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35129 and previous config saved to /var/cache/conftool/dbconfig/20220929-064222-root.json [06:42:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35130 and previous config saved to /var/cache/conftool/dbconfig/20220929-064231-root.json [06:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35131 and previous config saved to /var/cache/conftool/dbconfig/20220929-064815-root.json [06:51:26] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:24] (03PS1) 10Muehlenhoff: Extend access for Tumult Labs contractors [puppet] - 10https://gerrit.wikimedia.org/r/836690 [06:55:01] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for Tumult Labs contractors [puppet] - 10https://gerrit.wikimedia.org/r/836690 (owner: 10Muehlenhoff) [06:55:42] (03PS1) 10Marostegui: mariadb: Promote db2165 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836691 (https://phabricator.wikimedia.org/T318892) [06:57:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35132 and previous config saved to /var/cache/conftool/dbconfig/20220929-065727-root.json [06:57:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35133 and previous config saved to /var/cache/conftool/dbconfig/20220929-065736-root.json [06:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T0700). [07:00:23] morning! there are no trainees signed up today and no patches scheduled for deployment in the window. [07:00:46] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35134 and previous config saved to /var/cache/conftool/dbconfig/20220929-070320-root.json [07:03:42] (03PS1) 10Elukey: istio: disable zipkin and tracing for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/836692 (https://phabricator.wikimedia.org/T318814) [07:07:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:10:37] (03CR) 10Elukey: [C: 03+2] istio: disable zipkin and tracing for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/836692 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [07:12:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35135 and previous config saved to /var/cache/conftool/dbconfig/20220929-071232-root.json [07:12:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35136 and previous config saved to /var/cache/conftool/dbconfig/20220929-071240-root.json [07:13:58] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [07:18:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35137 and previous config saved to /var/cache/conftool/dbconfig/20220929-071825-root.json [07:21:52] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836605 (owner: 10Giuseppe Lavagetto) [07:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:25:59] (03PS1) 10Giuseppe Lavagetto: Revert "mwdebug: use php 7.4 by default" [puppet] - 10https://gerrit.wikimedia.org/r/836204 [07:26:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "mwdebug: use php 7.4 by default" [puppet] - 10https://gerrit.wikimedia.org/r/836204 (owner: 10Giuseppe Lavagetto) [07:27:32] (03CR) 10Muehlenhoff: [C: 03+2] turnilo: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831111 (owner: 10Muehlenhoff) [07:27:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35138 and previous config saved to /var/cache/conftool/dbconfig/20220929-072737-root.json [07:27:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35139 and previous config saved to /var/cache/conftool/dbconfig/20220929-072745-root.json [07:31:14] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:32] (03PS2) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836606 (https://phabricator.wikimedia.org/T318894) [07:32:34] (03PS1) 10Giuseppe Lavagetto: mwdebug: use php7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836693 (https://phabricator.wikimedia.org/T318894) [07:32:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: use php7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836693 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [07:33:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35140 and previous config saved to /var/cache/conftool/dbconfig/20220929-073330-root.json [07:34:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 35280 [07:35:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:36:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 35280 [07:38:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 38040 [07:38:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 38040 [07:39:57] (03PS3) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836606 (https://phabricator.wikimedia.org/T318894) [07:40:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 18106 [07:42:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 18106 [07:42:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35141 and previous config saved to /var/cache/conftool/dbconfig/20220929-074242-root.json [07:45:49] !log installing expat security updates [07:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35142 and previous config saved to /var/cache/conftool/dbconfig/20220929-074835-root.json [07:49:10] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cr1-eqiad,cr1-eqiad IPv6,re0.cr1-eqiad.mgmt with reason: router upgrade [07:49:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cr1-eqiad,cr1-eqiad IPv6,re0.cr1-eqiad.mgmt with reason: router upgrade [07:49:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1ea26f52-695b-41ae-a3b4-28808d44161a) set by ayounsi@cumin1001 for 4:00:00 on 3 host(s) and th... [07:52:20] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836606 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [07:57:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35143 and previous config saved to /var/cache/conftool/dbconfig/20220929-075747-root.json [07:57:58] !log drain traffic away from cr1-eqiad - T295690 [07:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:02] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:00:05] brennen and jnuche: (Dis)respected human, time to deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T0800). Please do the needful. [08:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35144 and previous config saved to /var/cache/conftool/dbconfig/20220929-080340-root.json [08:07:23] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for FPM/LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/836697 (https://phabricator.wikimedia.org/T135991) [08:12:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35145 and previous config saved to /var/cache/conftool/dbconfig/20220929-081252-root.json [08:13:04] (03PS1) 10Elukey: admin_ng: raise resource quotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/836698 (https://phabricator.wikimedia.org/T318814) [08:13:08] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:13:56] that's expected ^ [08:14:04] there should be a pybal one too soon [08:14:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for FPM/LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/836697 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:15:51] !log first cr1-eqiad RE switchover (for NVM firmware) - T295690 [08:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:55] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:16:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [08:16:44] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:17:57] XioNoX: is that related to the maintenance? ^^^ [08:18:10] arturo: yep [08:18:14] ok, thanks [08:18:36] there should be no impact, let me know if any issues [08:18:54] basically disabling BGP on the router side to fail traffic over the other link [08:19:01] (03CR) 10Elukey: [C: 03+2] admin_ng: raise resource quotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/836698 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [08:19:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:54] 👍 [08:20:18] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:24] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:34] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:21:44] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:22:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:22:12] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:16] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:22:28] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:24:52] (03PS1) 10Muehlenhoff: keyholder: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/836700 [08:25:02] (03PS2) 10Muehlenhoff: keyholder: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/836700 [08:26:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:26:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:26:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:26:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:27:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35146 and previous config saved to /var/cache/conftool/dbconfig/20220929-082757-root.json [08:31:56] Hello, https://phabricator.wikimedia.org/T318904 most likely a regression? [08:32:51] (03PS1) 10Jcrespo: dbbackups: Test mariadb 10.6 on a (currently passive) backup source [puppet] - 10https://gerrit.wikimedia.org/r/836701 (https://phabricator.wikimedia.org/T318062) [08:33:38] (03CR) 10Jcrespo: "Let me know if it is too soon- this is just a test and I don't intend to touch production ones yet." [puppet] - 10https://gerrit.wikimedia.org/r/836701 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:37:30] (03CR) 10Marostegui: [C: 03+1] dbbackups: Test mariadb 10.6 on a (currently passive) backup source [puppet] - 10https://gerrit.wikimedia.org/r/836701 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:38:46] (03CR) 10Jbond: [C: 03+1] keyholder: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/836700 (owner: 10Muehlenhoff) [08:40:51] (03PS7) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [08:43:35] !log second cr1-eqiad RE switchover - T295690 [08:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:40] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:44:11] (03PS1) 10Hashar: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 [08:44:27] I think I should be more verbose: https://phabricator.wikimedia.org/T318904 filed as a train blocker because it stops Stewards from doing our essential job (ie. granting OS to hide sensitive PII leak) but not setting priority (for you to deal with, I think) [08:44:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37386/console" [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [08:45:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet: codfw1dev: switch to a single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/836240 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [08:46:36] (03CR) 10Hashar: "operations/puppet is the parent repository of most if not all SRE maintained repositories. It lacked the rights for SRE to push annotated" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [08:47:19] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:29] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:35] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:47:35] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:37] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:48:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:48:03] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:48:04] (03CR) 10Clément Goubert: "We saw that there was no way to push annotated or signed tags to our repos, can you check this out?" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [08:48:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:48:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [08:48:25] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:48:41] expected, waiting for the linecards to upgrade and boot up [08:49:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:50] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:49:51] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:06] (03CR) 10DCausse: [C: 03+1] wmgCirrusSearchShardCount: Override prod settings for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836301 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [08:50:14] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:50:49] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:50:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:50:59] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:09] (03PS1) 10Filippo Giunchedi: netmon: add blackbox-exporter for mgmt probes [puppet] - 10https://gerrit.wikimedia.org/r/836704 (https://phabricator.wikimedia.org/T169860) [08:51:09] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:51:11] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:52:46] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2098.codfw.wmnet with OS bullseye [08:52:55] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host db2098.codfw.wmnet with OS bullseye [08:54:11] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Test mariadb 10.6 on a (currently passive) backup source [puppet] - 10https://gerrit.wikimedia.org/r/836701 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [08:55:43] (03CR) 10Jbond: O:wikidough: drop wikidough abuse nets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) (owner: 10Jbond) [08:55:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:57:31] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) [08:58:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/836704 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:58:58] (03CR) 10Jbond: [C: 03+2] "thanks merging" [puppet] - 10https://gerrit.wikimedia.org/r/826390 (owner: 10Ryan Kemper) [09:00:11] (03PS2) 10Filippo Giunchedi: netmon: add blackbox-exporter for mgmt probes [puppet] - 10https://gerrit.wikimedia.org/r/836704 (https://phabricator.wikimedia.org/T169860) [09:00:13] (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/836704 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:00:41] (03CR) 10Jbond: [C: 03+1] Remove duplicate YAML hash from releases hieradata [puppet] - 10https://gerrit.wikimedia.org/r/830569 (owner: 10Btullis) [09:01:05] (03CR) 10Jbond: [C: 03+1] Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [09:03:10] <_joe_> s[_]: thanks for noticing, we're discussing the best course of action now [09:03:40] ACK [09:04:06] (03CR) 10Filippo Giunchedi: [C: 03+2] netmon: add blackbox-exporter for mgmt probes [puppet] - 10https://gerrit.wikimedia.org/r/836704 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:04:12] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2098.codfw.wmnet with reason: host reimage [09:07:06] (03CR) 10Jbond: "thanks but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/834340 (https://phabricator.wikimedia.org/T318345) (owner: 10Bking) [09:07:28] (03PS4) 10Jbond: puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah) [09:07:43] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2098.codfw.wmnet with reason: host reimage [09:07:46] (03CR) 10Jbond: "apparently i forgot to merge this, doing so now" [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah) [09:07:48] (03PS10) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [09:07:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster: drop support for locale_servers [puppet] - 10https://gerrit.wikimedia.org/r/831500 (owner: 10Majavah) [09:08:09] (03PS1) 10Ayounsi: Fully remove VRRP auth [homer/public] - 10https://gerrit.wikimedia.org/r/836727 (https://phabricator.wikimedia.org/T295690) [09:08:59] (03CR) 10Jbond: [C: 03+1] gitlab_runner: enable unprivileged_userns_clone in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto) [09:09:15] (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/836727 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [09:09:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [09:10:20] (03Abandoned) 10Jbond: beaker: add a method to hack fixes specific to beaker [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond) [09:10:45] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:16] (03CR) 10Ayounsi: [C: 03+2] Fully remove VRRP auth [homer/public] - 10https://gerrit.wikimedia.org/r/836727 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [09:11:17] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.40.0-wmf.3" [09:12:04] (03Merged) 10jenkins-bot: Fully remove VRRP auth [homer/public] - 10https://gerrit.wikimedia.org/r/836727 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [09:13:16] (03PS1) 10Jaime Nuche: Revert "group1 wikis to 1.40.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836728 (https://phabricator.wikimedia.org/T318904) [09:13:18] (03CR) 10Jaime Nuche: [C: 03+2] Revert "group1 wikis to 1.40.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836728 (https://phabricator.wikimedia.org/T318904) (owner: 10Jaime Nuche) [09:14:10] (03PS1) 10Giuseppe Lavagetto: Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [09:14:27] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.40.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836728 (https://phabricator.wikimedia.org/T318904) (owner: 10Jaime Nuche) [09:14:36] (03PS1) 10Clément Goubert: Release 3.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 [09:16:26] !log repool cr1-eqiad - T295690 [09:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:30] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [09:17:46] (03CR) 10CI reject: [V: 04-1] Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [09:17:52] _joe_ and jnuche thanks for the fix :) [09:18:26] (03PS5) 10Jbond: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) [09:18:28] XioNoX: are you done with the maintenance? [09:18:36] (03CR) 10Muehlenhoff: [C: 04-2] "Please don't merge until we've completed the removal of stretch hosts from our infrastucture! This deprecation has been on our radar since" [puppet] - 10https://gerrit.wikimedia.org/r/834340 (https://phabricator.wikimedia.org/T318345) (owner: 10Bking) [09:18:46] marostegui: only half way, done with 1 router [09:19:00] ah excellent - no rush [09:19:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37389/console" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [09:19:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:10] (03CR) 10Jbond: "i have re-based this however i ben has also done some work around this so not sure if this CR is still valid. @ben can you comment" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [09:20:19] (03PS2) 10Giuseppe Lavagetto: Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [09:20:35] PROBLEM - Host cloudsw1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [09:20:35] PROBLEM - Host cloudsw1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [09:20:41] RECOVERY - Host cloudsw1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [09:20:41] RECOVERY - Host cloudsw1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [09:21:14] <_joe_> uhhh [09:21:21] <_joe_> XioNoX: expected? [09:21:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:21:57] it probably briefly lost connectivity when BGP re-converged [09:22:00] (03PS2) 10Jbond: wmflib::service::lvs_ipblock: remove unused function [puppet] - 10https://gerrit.wikimedia.org/r/834609 [09:22:19] <_joe_> XioNoX: can you pause? [09:22:27] <_joe_> I had a report of etcd failures [09:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T314041)', diff saved to https://phabricator.wikimedia.org/P35148 and previous config saved to /var/cache/conftool/dbconfig/20220929-092308-ladsgroup.json [09:23:12] _joe_: I'm done with the router, it's fully back in service [09:23:13] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:23:26] (03PS1) 10Jbond: wmf-update-known-hosts-production: bump version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/836731 [09:23:30] but yep I'll wait for your green light [09:23:45] (03CR) 10CI reject: [V: 04-1] Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [09:24:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] update-known-hosts-production: Capture all fingerprints (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [09:24:12] where was this report? [09:24:42] 10SRE, 10observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10fgiunchedi) [09:24:56] (03CR) 10Hashar: [C: 04-1] "I will build the wheel given Clément is unable to do so rapidly" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (owner: 10Clément Goubert) [09:25:09] <_joe_> XioNoX: things are ok, just a temp failure writing to etcd during the maintenance I guess [09:25:16] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: don't sysctl base interface [puppet] - 10https://gerrit.wikimedia.org/r/836732 (https://phabricator.wikimedia.org/T318824) [09:25:25] (03CR) 10Filippo Giunchedi: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:26:04] (03CR) 10Jbond: [C: 03+2] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/816004 (owner: 10Jbond) [09:26:10] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2098.codfw.wmnet with OS bullseye [09:26:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:26:12] I guess because of the routing protocols re-convergence some packets got confused :) [09:26:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:26:19] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host db2098.codfw.wmnet with OS bullseye completed: - db2098 (**WARN**) - Downtimed on Icinga/... [09:26:56] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/836733 (https://phabricator.wikimedia.org/T318387) [09:27:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/836732 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [09:27:18] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/836733 (https://phabricator.wikimedia.org/T318387) (owner: 10Kosta Harlan) [09:28:05] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:28:05] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:28:07] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:28:45] (03CR) 10Jbond: [C: 03+2] C:ferm: update ferm to use restart-or-reload instead of restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) (owner: 10Jbond) [09:28:54] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cr2-eqiad,cr2-eqiad IPv6,re0.cr2-eqiad.mgmt with reason: router upgrade [09:28:57] (03PS3) 10Jbond: C:ferm: update ferm to use restart-or-reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) [09:29:06] (03PS3) 10Giuseppe Lavagetto: Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [09:29:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cr2-eqiad,cr2-eqiad IPv6,re0.cr2-eqiad.mgmt with reason: router upgrade [09:29:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=acbab0ff-4998-42b3-b0ad-a6be933dfff6) set by ayounsi@cumin1001 for 4:00:00 on 3 host(s) and th... [09:29:25] (03CR) 10Jbond: [C: 03+2] wmflib::service::lvs_ipblock: remove unused function [puppet] - 10https://gerrit.wikimedia.org/r/834609 (owner: 10Jbond) [09:29:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:30:44] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: l3_agent: don't sysctl base interface [puppet] - 10https://gerrit.wikimedia.org/r/836732 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [09:31:37] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/836733 (https://phabricator.wikimedia.org/T318387) (owner: 10Kosta Harlan) [09:32:35] _joe_: all good to proceed with the other router? all my checks for cr1 are good [09:32:43] (03CR) 10CI reject: [V: 04-1] Adapt sre.switchdc.mediawiki to active-active mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [09:32:49] <_joe_> XioNoX: oh yes, sorry [09:32:59] <_joe_> I thought it was clear from my previous message [09:33:03] cool! that's what I thought but wanted to be 100% sure :) [09:33:29] !log drain cr2-eqiad - T295690 [09:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:33] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [09:33:49] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:34:52] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:35:04] (03PS1) 10Elukey: istio: add option to disable dns queries for zipkin on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/836734 (https://phabricator.wikimedia.org/T318814) [09:36:36] (03CR) 10Klausman: [C: 03+1] istio: add option to disable dns queries for zipkin on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/836734 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [09:36:52] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:36:53] (03PS2) 10Elukey: istio: add option to disable dns queries for zipkin on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/836734 (https://phabricator.wikimedia.org/T318814) [09:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P35149 and previous config saved to /var/cache/conftool/dbconfig/20220929-093815-ladsgroup.json [09:38:55] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:41:14] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [09:41:42] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:41:48] expected ^ [09:42:37] !log first cr2-eqiad RE switchover - T295690 [09:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:41] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [09:43:21] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [09:44:06] 10SRE, 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll be working on this as part of a larger effo... [09:44:11] 10SRE, 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff) [09:45:34] !log restarting superset to pick up expat security update [09:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:26] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:36] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:47:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:00] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:00] (03PS2) 10Muehlenhoff: mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809134 [09:50:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:21] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Ferm unloads all iptables rules when it hits a parsing error - https://phabricator.wikimedia.org/T315305 (10jbond) i think with the merged of 823621 this can be closed please re-open if you still see issues [09:50:29] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) [09:51:02] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Ferm unloads all iptables rules when it hits a parsing error - https://phabricator.wikimedia.org/T315305 (10jbond) 05Open→03Resolved a:03jbond [09:51:14] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P35150 and previous config saved to /var/cache/conftool/dbconfig/20220929-095321-ladsgroup.json [09:54:03] 10SRE, 10Analytics-Clusters, 10Data-Engineering-Radar, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10BTullis) >>! In T258700#8271831, @MoritzMuehlenhoff wrote: > I'll be working on this... [09:54:42] (03PS1) 10Jbond: dns/generate_dns_snippet: removed unused type ignore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/836740 [09:55:22] (03PS2) 10Hashar: Release 3.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (owner: 10Clément Goubert) [09:55:42] PROBLEM - BGP status on lsw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:58:40] (03PS5) 10Jbond: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:59:15] (03CR) 10Elukey: [C: 03+2] istio: add option to disable dns queries for zipkin on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/836734 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [09:59:21] (03PS6) 10Jbond: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:59:34] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:44] (03CR) 10Filippo Giunchedi: [C: 03+1] dns/generate_dns_snippet: removed unused type ignore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/836740 (owner: 10Jbond) [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1000). [10:00:59] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [10:03:08] (03CR) 10Hashar: [C: 03+1] "We have compared the existing wheels and they are bit to bit equals since there are whl published on Pypi for all those dependencies \o/" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (owner: 10Clément Goubert) [10:06:26] (03CR) 10Jbond: [C: 03+2] dns/generate_dns_snippet: removed unused type ignore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/836740 (owner: 10Jbond) [10:07:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:07:44] !log second (and longest) cr2-eqiad RE switchover - T295690 [10:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:48] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:08:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T314041)', diff saved to https://phabricator.wikimedia.org/P35152 and previous config saved to /var/cache/conftool/dbconfig/20220929-100828-ladsgroup.json [10:08:30] (03PS1) 10Muehlenhoff: Update cloud* Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/836745 [10:08:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:08:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:08:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:08:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T314041)', diff saved to https://phabricator.wikimedia.org/P35153 and previous config saved to /var/cache/conftool/dbconfig/20220929-100849-ladsgroup.json [10:11:14] (03CR) 10Jbond: [C: 03+1] Update cloud* Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/836745 (owner: 10Muehlenhoff) [10:11:51] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809134 (owner: 10Muehlenhoff) [10:12:47] (03CR) 10Filippo Giunchedi: [C: 03+1] k8s: Limit envoy metrics scraped from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [10:13:01] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:14:59] (03PS1) 10Muehlenhoff: Also add db-core-test to db-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836747 [10:15:17] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:17:38] (03CR) 10JMeybohm: [C: 03+1] k8s: Limit envoy metrics scraped from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [10:18:23] PROBLEM - Check systemd state on db1183 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:33] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:26:39] (03CR) 10Jbond: [C: 03+1] Also add db-core-test to db-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836747 (owner: 10Muehlenhoff) [10:26:54] (03PS3) 10Muehlenhoff: New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 [10:27:55] (03CR) 10Volans: "reply inline" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [10:29:17] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:29:43] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:53] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:36:13] !log installing poppler security updates [10:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:57] (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [10:39:57] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836745 (owner: 10Muehlenhoff) [10:40:06] !log repool cr2-eqiad - T295690 [10:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:10] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:42:40] (03PS2) 10Muehlenhoff: Update cloud* Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/836745 [10:43:15] (03Abandoned) 10Muehlenhoff: cumin: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/734245 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:44:26] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) Hi! Thanks for looking at this. To answer your questions: Yep, the convert-ssds fix... [10:45:17] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:21] (03CR) 10Muehlenhoff: [C: 03+2] Update cloud* Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/836745 (owner: 10Muehlenhoff) [10:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:46:05] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:29] RECOVERY - BGP status on lsw1-f1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:31] RECOVERY - Check systemd state on db1183 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe) [10:50:02] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr2-eqord,cr2-eqord IPv6 with reason: router upgrade [10:50:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T318892 [10:50:08] T318892: Switchover s8 codfw master (db2161 -> db2165) - https://phabricator.wikimedia.org/T318892 [10:50:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-eqord,cr2-eqord IPv6 with reason: router upgrade [10:50:19] (03CR) 10David Caro: [C: 03+1] prometheus: Add new scrape target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe) [10:50:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01f0d013-5101-4278-93a6-1ea49f9dea28) set by ayounsi@cumin1001 for 1:00:00 on 2 host(s) and th... [10:50:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T318892 [10:50:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s8 T318892 [10:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:51:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s8 T318892 [10:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T318892', diff saved to https://phabricator.wikimedia.org/P35154 and previous config saved to /var/cache/conftool/dbconfig/20220929-105206-root.json [10:52:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:31] !log drain cr2-eqord - T295690 [10:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:35] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:53:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2165 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/836691 (https://phabricator.wikimedia.org/T318892) (owner: 10Marostegui) [10:56:33] 10SRE, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10taavi) [10:56:47] (03PS1) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [10:57:30] 10SRE, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) [10:57:53] ^ I created the tracking task I asked about yesterday ^^ [10:57:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T314041)', diff saved to https://phabricator.wikimedia.org/P35155 and previous config saved to /var/cache/conftool/dbconfig/20220929-105912-ladsgroup.json [10:59:16] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:59:53] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:32] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [11:02:29] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Lucas_Werkmeister_WMDE) Adding #continuous-integration-infrastructure (or should it be #continuous-integration-config?) since the patched PHP is... [11:02:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [11:03:15] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:39] (03CR) 10CI reject: [V: 04-1] redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [11:04:33] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [11:05:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [11:06:21] !log restart cr2-eqord for upgrade - T295690 [11:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:25] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [11:07:25] (03CR) 10Ladsgroup: [C: 03+1] Remove wmgEntityUsageModifierLimitsStatement on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) (owner: 10Lucas Werkmeister (WMDE)) [11:08:54] (03PS1) 10Hokwelum: remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 [11:09:44] kart_ Nikerabbit: I'm planning to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/835589 Are you around for babysitting? [11:10:45] !log Starting s8 codfw failover from db2161 to db2165 - T318892 [11:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:49] T318892: Switchover s8 codfw master (db2161 -> db2165) - https://phabricator.wikimedia.org/T318892 [11:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 codfw primary T318892', diff saved to https://phabricator.wikimedia.org/P35156 and previous config saved to /var/cache/conftool/dbconfig/20220929-111127-root.json [11:12:05] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:05] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2161 T318892', diff saved to https://phabricator.wikimedia.org/P35157 and previous config saved to /var/cache/conftool/dbconfig/20220929-111217-root.json [11:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P35158 and previous config saved to /var/cache/conftool/dbconfig/20220929-111418-ladsgroup.json [11:14:50] (03PS2) 10Hokwelum: remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 [11:15:50] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on mirror [puppet] - 10https://gerrit.wikimedia.org/r/836756 (https://phabricator.wikimedia.org/T135991) [11:16:00] (03PS2) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [11:16:02] (03PS1) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [11:16:06] !log re-pool cr2-eqord - T295690 [11:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:10] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [11:16:49] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:16:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:17:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) eqiad and eqord went extremely well. Thanks @cmooney for the [[ https://wikitech.wikimedia.org/wiki/Juniper_RE_i40e_firmware | firmware instructions ]] [11:21:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [11:22:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:33] (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [11:23:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:00] (03PS1) 10Muehlenhoff: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 [11:24:39] (03CR) 10CI reject: [V: 04-1] redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [11:24:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [11:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:25:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) 05Open→03Resolved [11:29:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P35159 and previous config saved to /var/cache/conftool/dbconfig/20220929-112925-ladsgroup.json [11:29:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35160 and previous config saved to /var/cache/conftool/dbconfig/20220929-112933-root.json [11:32:41] Amir1: I would be around now [11:33:02] ok [11:33:20] (03CR) 10Ladsgroup: [C: 03+2] Update Translate job names [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [11:33:27] (03PS4) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) [11:33:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836751 (owner: 10Hokwelum) [11:35:29] (03PS1) 10Jbond: P:lvs::configueration: Dont alert for missing lvs definitions [puppet] - 10https://gerrit.wikimedia.org/r/836776 [11:36:18] (03PS2) 10Jbond: P:lvs::configuration: Dont alert for missing lvs definitions [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) [11:36:50] (03Merged) 10jenkins-bot: Update Translate job names [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [11:37:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 62955 [11:37:24] (03CR) 10Hokwelum: "Hello Moritz, we updated php7.2-fpm to php7.4-fpm but we aren’t quite sure if profile::debdeploy::client::filter_services is still needed " [puppet] - 10https://gerrit.wikimedia.org/r/836751 (owner: 10Hokwelum) [11:37:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: bump version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/836731 (owner: 10Jbond) [11:38:23] !log ladsgroup@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:38:53] !log ladsgroup@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:38:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 62955 [11:39:01] !log ladsgroup@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:39:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 42 [11:40:12] !log ladsgroup@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:40:48] (03CR) 10Giuseppe Lavagetto: "LGTM, but I would tweak the rspec tests a bit" [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:41:39] !log ladsgroup@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:41:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42 [11:41:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3856 [11:41:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37390/console" [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:43:45] (03PS3) 10Jbond: P:lvs::configuration: Dont alert for missing lvs definitions [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) [11:44:13] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:44:22] Nikerabbit: deployed [11:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T314041)', diff saved to https://phabricator.wikimedia.org/P35161 and previous config saved to /var/cache/conftool/dbconfig/20220929-114431-ladsgroup.json [11:44:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:44:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35162 and previous config saved to /var/cache/conftool/dbconfig/20220929-114438-root.json [11:44:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:44:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3856 [11:45:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 42 [11:45:04] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 42 [11:45:19] Amir1: ack. monitoring via logstash and grafana [11:46:05] (03CR) 10Jbond: [C: 03+2] P:lvs::configuration: Dont alert for missing lvs definitions [puppet] - 10https://gerrit.wikimedia.org/r/836776 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:46:18] (03PS3) 10Hokwelum: remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) [11:48:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15695 [11:50:15] Amir1: everything I've seen so far indicates that the fix is working [11:50:35] awesome [11:51:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15695 [11:51:55] !log ladsgroup@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:51:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 209453 [11:52:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 209453 [11:54:59] (03CR) 10Ayounsi: [C: 03+2] Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [11:56:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1178', diff saved to https://phabricator.wikimedia.org/P35163 and previous config saved to /var/cache/conftool/dbconfig/20220929-115612-root.json [11:56:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 199524 [11:58:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 199524 [11:58:19] (03Merged) 10jenkins-bot: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [11:59:27] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35164 and previous config saved to /var/cache/conftool/dbconfig/20220929-115943-root.json [12:00:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: allow removing a php version from a running system [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) [12:00:03] (03PS1) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) [12:02:19] (03CR) 10CI reject: [V: 04-1] mediawiki::php: allow removing a php version from a running system [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [12:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35165 and previous config saved to /var/cache/conftool/dbconfig/20220929-120309-root.json [12:04:00] Amir1: hmm I'm seeing some indication that not all of the jobs are being shfited to the new runner: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-job=RenderTranslationPageJob (expecting orange line to go zero, replaced by green) [12:04:10] not sure if this will fix itself automatically over time [12:04:19] !log ladsgroup@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:04:45] Nikerabbit: the deploy failed and got rolled back [12:05:00] Amir1: oh, that could explain [12:05:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3292 [12:05:50] Error: UPGRADE FAILED: release production failed, and has been rolled back due to atomic being set: timed out waiting for the condition [12:06:12] I think the pods are being too slow to die [12:06:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3292 [12:08:03] Going to try again, if it fails again, gonna ping people [12:10:04] !log ladsgroup@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:10:46] Nikerabbit: it seems it got deployed now [12:11:17] (03CR) 10Muehlenhoff: remove php7.2 from the snapshot hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) (owner: 10Hokwelum) [12:12:16] Amir1: ack, will keep monitoring [12:14:03] (03PS1) 10Ladsgroup: Revert "rdbms: improve LoadBalancer connection pool reuse" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836713 (https://phabricator.wikimedia.org/T318904) [12:14:09] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: improve LoadBalancer connection pool reuse" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836713 (https://phabricator.wikimedia.org/T318904) (owner: 10Ladsgroup) [12:14:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35166 and previous config saved to /var/cache/conftool/dbconfig/20220929-121448-root.json [12:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35167 and previous config saved to /var/cache/conftool/dbconfig/20220929-121814-root.json [12:18:29] (03CR) 10Ayounsi: [C: 03+1] "I had a look at the cookbook side and I now see how it all works!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:20:00] Saw a (temporary) spike "MessageIndexRebuildJob [MediaWiki]: MessageIndex: unable to acquire lock" on mediawikiwiki [12:22:43] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:17] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:29:39] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35168 and previous config saved to /var/cache/conftool/dbconfig/20220929-122953-root.json [12:32:06] (03PS1) 10Muehlenhoff: Add cookbook to perform rolling restart of maps [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 [12:32:26] (03CR) 10CI reject: [V: 04-1] Revert "rdbms: improve LoadBalancer connection pool reuse" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836713 (https://phabricator.wikimedia.org/T318904) (owner: 10Ladsgroup) [12:32:45] (03Merged) 10jenkins-bot: Revert "rdbms: improve LoadBalancer connection pool reuse" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836713 (https://phabricator.wikimedia.org/T318904) (owner: 10Ladsgroup) [12:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35169 and previous config saved to /var/cache/conftool/dbconfig/20220929-123319-root.json [12:34:09] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:14] (03PS1) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 [12:34:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836713 (https://phabricator.wikimedia.org/T318904) (owner: 10Ladsgroup) [12:34:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:836713|Revert "rdbms: improve LoadBalancer connection pool reuse" (T318904)]] [12:34:59] T318904: Special:UserRights not allowing cross-wiki user rights change - https://phabricator.wikimedia.org/T318904 [12:35:27] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:836713|Revert "rdbms: improve LoadBalancer connection pool reuse" (T318904)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [12:36:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:37:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:37:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:38:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:40:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks all for the reviews! I'll wait for Riccardo's thoughts in case he has any other comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:42:34] (03PS1) 10Muehlenhoff: Also apply labweb->cloudweb rename for the Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836795 [12:44:00] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:836713|Revert "rdbms: improve LoadBalancer connection pool reuse" (T318904)]] (duration: 09m 05s) [12:44:04] T318904: Special:UserRights not allowing cross-wiki user rights change - https://phabricator.wikimedia.org/T318904 [12:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35171 and previous config saved to /var/cache/conftool/dbconfig/20220929-124458-root.json [12:46:24] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) [12:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35172 and previous config saved to /var/cache/conftool/dbconfig/20220929-124824-root.json [12:52:40] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836799 (https://phabricator.wikimedia.org/T314192) [12:52:42] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836799 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [12:53:47] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836799 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [12:57:45] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.3 refs T314192 [12:57:49] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [12:59:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35173 and previous config saved to /var/cache/conftool/dbconfig/20220929-130003-root.json [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1300). [13:00:04] koi and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:00:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:00:15] I can deploy! [13:00:21] if jnuche is done with the train for now [13:00:31] o/ [13:00:40] Lucas_WMDE: not yet, please hold for a few minutes [13:00:45] ok [13:01:49] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.3 refs T314192 (duration: 04m 04s) [13:02:12] Lucas_WMDE: I realize I dropped my patch into yesterday's window on the wiki page -- can I sneak into this one? [13:02:32] (03PS2) 10Muehlenhoff: Also add db-core-test to db-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836747 [13:02:41] Kemayo: sure [13:03:24] Lucas_WMDE: all done now, you can go ahead with the patches [13:03:25] Awesome, thanks. I'll update the page. [13:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35174 and previous config saved to /var/cache/conftool/dbconfig/20220929-130329-root.json [13:03:30] jnuche: ok, thanks! [13:03:41] let’s start with koi’s votewiki change [13:03:47] (03PS2) 10Lucas Werkmeister (WMDE): votewiki: Change wgLanguageCode to zh for Sep 2022 admins election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835291 (https://phabricator.wikimedia.org/T318147) (owner: 10Stang) [13:04:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] votewiki: Change wgLanguageCode to zh for Sep 2022 admins election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835291 (https://phabricator.wikimedia.org/T318147) (owner: 10Stang) [13:04:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:04:49] (03Merged) 10jenkins-bot: votewiki: Change wgLanguageCode to zh for Sep 2022 admins election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835291 (https://phabricator.wikimedia.org/T318147) (owner: 10Stang) [13:06:00] oops, forgot that I wanted to use scap backport [13:06:04] I’ll use it for my change instead [13:06:04] (03PS1) 10Hoo man: Wikibase: Set UnconnectedPage page prop format for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836803 [13:06:13] koi: the change should be on mwdebug1001, can you test it? [13:06:19] * Lucas_WMDE waves at hoo [13:06:44] Lucas_WMDE: site language changed to zh, LGTM [13:06:48] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:06:50] ok, thanks [13:06:55] (03CR) 10Muehlenhoff: [C: 03+2] Also add db-core-test to db-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836747 (owner: 10Muehlenhoff) [13:07:13] hey Lucas_WMDE :) [13:07:22] (koi: I guess it’s actually the Oct 2022 admins election now?) [13:07:44] Care to have a look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/836803? [13:07:48] sure [13:07:49] (03CR) 10Bking: k8s: Limit envoy metrics scraped from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [13:07:55] Given it is test only, I think we can just squeeze it into this window [13:08:05] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:25] yeah it actually happens at Oct, but we name it with the init date [13:08:35] (03CR) 10Bking: [C: 03+2] k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [13:08:39] ok [13:09:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks good to me; can be synced in either order due to the isset(). (The relevant code is only in wmf.3, so we probably don’t want to roll" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836803 (owner: 10Hoo man) [13:10:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:51] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:835291|votewiki: Change wgLanguageCode to zh for Sep 2022 admins election (T318147)]] (duration: 03m 40s) [13:10:54] T318147: Carry out an admin election of zhwiki on votewiki (Sep 2022) - https://phabricator.wikimedia.org/T318147 [13:11:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) (owner: 10Lucas Werkmeister (WMDE)) [13:11:34] ok, trying my config change with the confusingly named scap backport command [13:11:42] !log rolling restart of apache2 in mw/codfw to pick up Expat security updates [13:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:56] o_O it can’t rebase the change on its own? [13:12:34] well, the gate-and-submit is still running, but `scap backport` exited already [13:13:14] strange, I thought mediawiki-config changes got auto-rebased during gate-and-submit [13:13:18] apparently this one didn’t [13:13:22] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgEntityUsageModifierLimitsStatement on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) [13:13:46] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) (owner: 10Lucas Werkmeister (WMDE)) [13:13:51] let’s try again [13:13:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:45] (03Merged) 10jenkins-bot: Remove wmgEntityUsageModifierLimitsStatement on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) (owner: 10Lucas Werkmeister (WMDE)) [13:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35175 and previous config saved to /var/cache/conftool/dbconfig/20220929-131507-root.json [13:15:09] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]] [13:15:14] T296384: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 [13:15:33] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and lucaswerkmeister-wmde: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:16:17] don’t think there’s any way to test this; cebwiki isn’t on fire, so lgtm [13:18:00] (03CR) 10Alexandros Kosiaris: "What's the use case for this? Any specific reason we need this at this repo/level?" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [13:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35176 and previous config saved to /var/cache/conftool/dbconfig/20220929-131834-root.json [13:19:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:18] (03CR) 10Alexandros Kosiaris: P:ci::docker: Install upstream docker packages for all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [13:20:33] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]] (duration: 05m 23s) [13:20:36] T296384: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 [13:21:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:18] (03PS1) 10Muehlenhoff: Add Cumin alias for mariadb objectstash [puppet] - 10https://gerrit.wikimedia.org/r/836805 [13:22:21] (03PS3) 10Lucas Werkmeister (WMDE): Stop mobile visual enhancements from rolling out to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) (owner: 10DLynch) [13:23:01] Lucas_WMDE: I don't have any frontend testing I can reasonably do for this one, so feel free to roll it straight out if there's no errors. [13:23:35] Kemayo: this needs a backport of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/836303 first, I think [13:23:45] that change isn’t in any train yet afaict [13:24:24] Idea is just to get the config ready for when the train does roll out. [13:24:44] then it shouldn’t have a Depends-On imho [13:24:50] (scap backport refuses to deploy the change for that reason) [13:25:10] That's a fair point. I can remove that. [13:25:19] (03PS4) 10DLynch: Stop mobile visual enhancements from rolling out to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) [13:25:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) (owner: 10DLynch) [13:25:45] ok [13:26:00] !log restartting Apache on lists [13:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:21] (03Merged) 10jenkins-bot: Stop mobile visual enhancements from rolling out to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) (owner: 10DLynch) [13:27:44] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:836304|Stop mobile visual enhancements from rolling out to jawiki (T318871)]] [13:27:48] T318871: Implement config that enables us to exclude ja.wiki from receiving mobile visual enhancements/usability improvemets - https://phabricator.wikimedia.org/T318871 [13:28:08] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and kemayo: Backport for [[gerrit:836304|Stop mobile visual enhancements from rolling out to jawiki (T318871)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:28:19] checking mwdebug [13:29:00] lgtm, continuing [13:29:45] 👍🏻 [13:30:17] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:32:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:32:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:31] (03PS1) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) [13:32:44] (03PS2) 10Lucas Werkmeister (WMDE): Wikibase: Set UnconnectedPage page prop format for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836803 (owner: 10Hoo man) [13:33:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:20] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:836304|Stop mobile visual enhancements from rolling out to jawiki (T318871)]] (duration: 05m 36s) [13:33:24] T318871: Implement config that enables us to exclude ja.wiki from receiving mobile visual enhancements/usability improvemets - https://phabricator.wikimedia.org/T318871 [13:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35178 and previous config saved to /var/cache/conftool/dbconfig/20220929-133339-root.json [13:33:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836803 (owner: 10Hoo man) [13:34:10] hoo: it should be possible to test this change, right? purge some unconnected page with links update and then look at its page props in the database? [13:34:19] (also the different Special:UnconnectedPages behavior I guess) [13:34:30] (03Merged) 10jenkins-bot: Wikibase: Set UnconnectedPage page prop format for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836803 (owner: 10Hoo man) [13:34:33] Lucas_WMDE: Yes, I'm ready to do just that [13:34:39] \o/ [13:34:50] (scap is still doing its thing) [13:34:55] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:836803|Wikibase: Set UnconnectedPage page prop format for test wikis]] [13:35:03] * hoo waits [13:35:19] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and hoo: Backport for [[gerrit:836803|Wikibase: Set UnconnectedPage page prop format for test wikis]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:35:27] 10SRE, 10vm-requests: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931 (10MoritzMuehlenhoff) [13:35:29] ^ there you go [13:35:37] 10SRE, 10vm-requests: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:36:57] Lucas_WMDE: Seems to work, null edit on test makes the value negative [13:37:02] \o/ [13:37:04] thanks! [13:37:11] continuing sync [13:37:30] (03PS1) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [13:37:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8966 [13:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:38:34] The special page changes as expected as well... I'll run the maint. scrip on the test wikis later today so that it should be all good, then. [13:38:39] nice [13:38:53] (03CR) 10Volans: [C: 04-1] "If you want those to be used with the cookbooks, they accept also a query, so you can totally pass A:foo and A:eqiad as query to the cookb" [puppet] - 10https://gerrit.wikimedia.org/r/836792 (owner: 10Muehlenhoff) [13:39:21] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [13:39:22] looks like we’ll have a bit of time left over in the window, if anyone else has something to deploy [13:39:35] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:39:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8966 [13:40:19] Lucas_WMDE: I'll leave it to you for the rest, if that's fine [13:40:23] * hoo will be back later today [13:40:29] ok [13:40:53] jouncebot: now [13:40:54] For the next 0 hour(s) and 19 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1300) [13:40:54] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1300) [13:41:09] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:836803|Wikibase: Set UnconnectedPage page prop format for test wikis]] (duration: 06m 13s) [13:41:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [13:41:32] !log UTC afternoon backport+config window done [13:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] Lucas_WMDE: I'm good to merge some memcache puppet changes? Should be noop but will follow proper procedure for cache config changes. [13:42:23] claime: go ahead, I’m done with the window [13:42:29] Lucas_WMDE: ack, thanks [13:42:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:40] (03PS2) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) [13:43:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:44:51] (03CR) 10Clément Goubert: [C: 03+2] C:memcached Fix memcached bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) (owner: 10Clément Goubert) [13:45:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [13:46:08] !log Disabling puppet for C:memcache hosts to merge [[gerrit:835585|C:memcached Fix memcached bootstrap]] [13:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35179 and previous config saved to /var/cache/conftool/dbconfig/20220929-134844-root.json [13:48:58] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) cableid 2207506656 fpc7 - fpc5 cableid 2207506655 fpc2 - fpc6 cableid 2207506658 fpc7 - fpc3 Ayounsi did you... [13:49:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:49:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:50:14] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 32934 [13:50:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:54:35] !log Enabled puppet for C:memcache hosts following merge [[gerrit:835585|C:memcached Fix memcached bootstrap]] [13:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:51] All done :) [13:56:47] (03CR) 10Clément Goubert: "Little bit of doc can't hurt. Absolutely do tell me if I missed something (apart from what hashar and I haven't done yet)" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [13:57:10] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/836756 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:01:36] (03PS2) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 [14:02:01] (03CR) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/836792 (owner: 10Muehlenhoff) [14:02:04] (03CR) 10Volans: [C: 03+1] "LGTM, we now just need to update the cookbook accordingly:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:03:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836792 (owner: 10Muehlenhoff) [14:04:07] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11164 [14:06:21] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11164 [14:09:52] (03CR) 10Muehlenhoff: [C: 03+2] keyholder: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/836700 (owner: 10Muehlenhoff) [14:13:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) kubernetes1023 c6 u42 port 42 cableid 23000039 reseated cable on 1024 it has light now [14:13:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [14:16:41] (03PS3) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) [14:16:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:48] !log rolling restart of apache2 in mw/eqiad to pick up Expat security updates [14:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:33] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: allow removing a php version from a running system [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) [14:18:35] (03PS2) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) [14:19:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Verified all of these they are all connected to management and have link [14:19:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Wire new event stream for maps interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [14:25:49] (03CR) 10Jbond: redfish: store all manager info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [14:28:22] (03CR) 10Klausman: [C: 03+1] coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [14:28:31] 10SRE-swift-storage: flip/flop mounting filesystems between systemd and swift-drive-audit - https://phabricator.wikimedia.org/T265450 (10MatthewVernon) I've offered upstream https://review.opendev.org/c/openstack/swift/+/859861 to fix the lack of reload. [14:29:29] !log uploaded glib2.0 2.50.3-2+deb9u3+wmf1 to apt.wikimedia.org/stretch-wikimedia [14:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:12] (03PS2) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [14:30:35] !log installing glib2.0 security updates [14:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:12] (03PS3) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 [14:36:48] (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [14:40:04] (03CR) 10Volans: "LGTM once CI is happy" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [14:40:37] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:40:54] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on mirror [puppet] - 10https://gerrit.wikimedia.org/r/836756 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:42:34] (03CR) 10Clément Goubert: [C: 03+1] "fpm::pool and systemd::unit get purged with the package purge. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [14:44:03] (03PS1) 10David Caro: maintain-dbusers: add missing collate to the account table [puppet] - 10https://gerrit.wikimedia.org/r/836849 (https://phabricator.wikimedia.org/T318047) [14:46:02] (03CR) 10Elukey: "Some extra context in https://github.com/istio/istio/issues/13710" [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [14:49:04] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) [14:49:43] 10SRE, 10observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10ssingh) Thanks very much for creating this task @fgiunchedi! We were recently discussing certspotter in the team as well and the various issues with it. On one hand, it's an important service that we need t... [14:53:58] (03PS1) 10Ssingh: certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/836851 (https://phabricator.wikimedia.org/T318911) [14:54:31] (03CR) 10CI reject: [V: 04-1] certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/836851 (https://phabricator.wikimedia.org/T318911) (owner: 10Ssingh) [14:54:53] (03CR) 10Muehlenhoff: [C: 03+2] java: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:55:17] (03PS2) 10Ssingh: certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/836851 (https://phabricator.wikimedia.org/T318911) [14:56:51] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [14:58:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37392/console" [puppet] - 10https://gerrit.wikimedia.org/r/836851 (https://phabricator.wikimedia.org/T318911) (owner: 10Ssingh) [14:58:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: remove rate-limiting CT log [puppet] - 10https://gerrit.wikimedia.org/r/836851 (https://phabricator.wikimedia.org/T318911) (owner: 10Ssingh) [15:04:18] (03CR) 10Muehlenhoff: [C: 03+2] confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:04:52] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [15:06:32] (03PS1) 10Muehlenhoff: grub: Update includes [puppet] - 10https://gerrit.wikimedia.org/r/836855 [15:16:24] (03PS1) 10Lucas Werkmeister (WMDE): Configure `mul` Wikibase language code on Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836858 [15:18:35] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on piwik/matomo [puppet] - 10https://gerrit.wikimedia.org/r/836859 (https://phabricator.wikimedia.org/T135991) [15:19:03] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There's a few nits I would fix about the patch, but most importantly the integration tests in test_reqconfig are placed in the wrong place" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [15:22:47] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941 (10hashar) [15:23:43] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:28:59] jounebot now [15:29:01] jouncebot now [15:29:01] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [15:35:17] !log dancy@deploy1002 Installing scap version "4.25.0" for 561 hosts [15:35:35] !log dancy@deploy1002 Installation of scap version "4.25.0" completed for 561 hosts [15:37:54] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941 (10thcipriani) Also happened for the Score extension around the same time: `name=exception.trace,lines=10 from /srv/mediawiki/php-1.40.... [15:40:29] jouncebot: now [15:40:29] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [15:40:47] I’d like to deploy a Beta-only config change if that’s okay (dancy, is the scap update done?) [15:40:57] Yep I'm done [15:41:01] ok thanks [15:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T314041)', diff saved to https://phabricator.wikimedia.org/P35184 and previous config saved to /var/cache/conftool/dbconfig/20220929-154231-ladsgroup.json [15:42:35] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:42:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "deploying (beta-only)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836858 (owner: 10Lucas Werkmeister (WMDE)) [15:43:32] Lucas_WMDE: You can use `scap backport 836858` and The Right Stuff (TM) will happen. [15:43:47] dancy: I wanted to do this one manually so I can add “(beta-only)” to the sync-file message :) [15:43:54] (03Merged) 10jenkins-bot: Configure `mul` Wikibase language code on Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836858 (owner: 10Lucas Werkmeister (WMDE)) [15:43:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:44:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:44:50] Lucas_WMDE: If that's a feature you want, please file a phab ticket. That can be arranged. [15:45:04] We already have some beta-only-change logic that can be extended. [15:45:05] here [15:45:16] hey [15:45:20] Lucas_WMDE: did we just roll something out? [15:45:23] Btw, that beta-only-change logic skips the sync [15:45:40] cdanis: I only started the sync-file just now [15:45:44] only pulled to mwdebug before [15:45:47] did something happen? [15:46:14] (sync-file is currently waiting for canary traffic) [15:46:18] (now started sync-apaches) [15:46:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:47:09] oh, now I see the p.age above, sorry [15:47:10] some sort of traffic spike, plus serving a bunch of 5xx at the edge [15:47:22] if you could pause further deployments for a few minutes that'd be good [15:47:36] want me to Ctrl+C? currently in php-fpm-restart [15:47:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:47:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:47:40] I have nothing else to deploy otherwise [15:47:46] no, that's okay [15:47:49] ok [15:48:00] would rather not leave things in an inconsistent state, and things look to be recoverign [15:48:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:48:48] dancy: tbh I don’t like the sound of not syncing beta-only changes and leaving differences between the deployment host and everything else [15:48:54] but I guess I’d better get used to that… [15:48:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:49:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:836858|Configure `mul` Wikibase language code on Beta wikis]] (beta-only, prod noop) (duration: 03m 41s) [15:49:13] mwdebug-deploy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:49:23] hm [15:49:26] 503 here [15:49:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:49:45] uhmm [15:49:55] sukhe: eqiad cdn quite slow for you too? [15:50:36] cdanis: pretty OK for me, just checked [15:50:46] ok spoke too soon :D [15:50:47] yep [15:51:13] any theories so far? I haven't caught up on the backlog [15:51:15] but here now [15:51:18] (ProbeDown) firing: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:52:04] (ProbeDown) firing: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#commons.wikimedia.org:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:11] I'm getting 503s for loading https://phabricator.wikimedia.org/ [15:52:32] Phab is ok for me. [15:52:33] bd808: yeah, see -security [15:52:58] sukhe: *nod* saw right after I whinged [15:53:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:54:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:54:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:55:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:56:18] (ProbeDown) resolved: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:57:04] (ProbeDown) resolved: (12) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P35185 and previous config saved to /var/cache/conftool/dbconfig/20220929-155737-ladsgroup.json [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:12:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P35186 and previous config saved to /var/cache/conftool/dbconfig/20220929-161244-ladsgroup.json [16:16:41] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/836859 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T314041)', diff saved to https://phabricator.wikimedia.org/P35187 and previous config saved to /var/cache/conftool/dbconfig/20220929-162750-ladsgroup.json [16:27:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:27:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:28:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:28:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T314041)', diff saved to https://phabricator.wikimedia.org/P35188 and previous config saved to /var/cache/conftool/dbconfig/20220929-162812-ladsgroup.json [16:32:46] (03PS4) 10Jdlrobson: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) [16:33:07] (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-09-29-111815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/836872 [16:47:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:48:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:49:03] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Markjgraham_hmb) I would very much appreciate the opportunity to speak with whoever is in charge of the SRE team. I am mark@archive.org (917) 697-0110 [16:49:19] (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [16:53:41] (03PS1) 10Jdlrobson: Enable desktop improvements on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836878 (https://phabricator.wikimedia.org/T318344) [16:55:05] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:57:23] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:59:13] (03PS1) 10Jdlrobson: Add Nepalese Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1700). [17:00:18] o/ [17:00:48] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-09-29-111815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/836872 (owner: 10BryanDavis) [17:02:55] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:18] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-09-29-111815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/836872 (owner: 10BryanDavis) [17:06:11] (03PS5) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [17:06:13] (03PS1) 10Jbond: test_syncer: genralise temp data context manager [software/conftool] - 10https://gerrit.wikimedia.org/r/836883 [17:06:55] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:07:54] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:08:22] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [17:08:51] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:09:28] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:35] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:10:15] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:13:13] * bd808 is done with deploys for this slot [17:28:57] (03PS1) 10Ebernhardson: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836886 [17:43:01] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10KOfori) Hello Mark, I will be in touch with you concerning this. [17:44:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) 05Open→03Resolved updated their status. [17:44:31] (03PS6) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [17:44:51] 10SRE, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10sbassett) [17:45:28] 10SRE, 10Security-Team, 10Security: Deprecate use of ssh-rsa keys? - https://phabricator.wikimedia.org/T311368 (10sbassett) p:05Triage→03Medium [17:46:48] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [17:56:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:58:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:04] brennen and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T1800). [18:01:47] (03PS7) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [18:03:46] o/ [18:06:51] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:09:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED [18:10:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-stretch1002.mgmt.eqiad.wmnet with reboot policy FORCED [18:10:57] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836888 (https://phabricator.wikimedia.org/T314192) [18:10:59] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836888 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:11:03] (03CR) 10Jbond: reqconfig: add ip validation for ipblocks (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [18:12:23] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836888 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:15:38] <_joe_> jbond: let's pair a bit about that patch tomorrow, your comment is mostly correct, sorry for misleading you :) [18:16:44] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.3 refs T314192 [18:16:49] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [18:17:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:18:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:18:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:18:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:20:29] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:26:36] (03PS8) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [18:28:00] (03PS9) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [18:29:15] (03CR) 10Jbond: reqconfig: add ip validation for ipblocks (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [18:30:31] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [18:32:59] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:33:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED [18:34:32] (03CR) 10Jbond: "i created this with the intention to use it for future test but i was going down the wrong rabbit whole. i still think its could be a use" [software/conftool] - 10https://gerrit.wikimedia.org/r/836883 (owner: 10Jbond) [18:35:30] (03PS10) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [18:36:39] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Markjgraham_hmb) I just spoke with Kwaku Addo Ofori via a video call. Thank you Kwaku and the entire SRE team for your care and attention! I am grateful for your efforts... [18:39:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [18:40:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [18:40:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:44] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [18:41:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED [18:42:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [18:42:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [18:43:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [18:45:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-stretch1002.mgmt.eqiad.wmnet with reboot policy FORCED [18:46:27] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10KOfori) Thanks for speaking to me, Mark. We can confirm the bit about the IP being added to the allow list next week but I believe the best way forward is for us to work to... [18:51:54] (03PS1) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) [18:52:44] (03PS2) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) [18:52:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [18:53:50] (03PS3) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) [18:56:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [18:56:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [18:59:51] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4021.ulsfo.wmnet [19:00:21] (03PS4) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) [19:00:23] (03PS1) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) [19:02:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [19:04:23] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:05:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [19:06:10] (03PS1) 10Gehel: elasticsearch: Elasticsearch 7 does not need to specify number of masters [puppet] - 10https://gerrit.wikimedia.org/r/836912 (https://phabricator.wikimedia.org/T313431) [19:08:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) [19:08:12] (03CR) 10Ryan Kemper: [C: 03+1] elasticsearch: Increase number of master-eligible nodes to 5 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:08:45] (03CR) 10Gehel: [C: 03+2] elasticsearch: Increase number of master-eligible nodes to 5 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/836890 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:09:31] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4021.ulsfo.wmnet [19:11:21] (03PS2) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) [19:11:22] (03PS2) 10Gehel: elasticsearch: Elasticsearch 7 does not need to specify number of masters [puppet] - 10https://gerrit.wikimedia.org/r/836912 (https://phabricator.wikimedia.org/T313431) [19:12:59] (03PS3) 10Gehel: elasticsearch: Increase number of master-eligible nodes to 5 for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/836908 (https://phabricator.wikimedia.org/T313431) [19:13:39] (03PS2) 10Ebernhardson: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836886 [19:24:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [19:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:29:02] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: T313431 [19:29:07] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [19:29:24] (03PS1) 10Cmjohnson: Adding site.pp and netboo for kafka-stretch [puppet] - 10https://gerrit.wikimedia.org/r/836914 (https://phabricator.wikimedia.org/T314156) [19:29:47] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: T313431 [19:32:40] !log T313431 Restarting elasticsearch_7* services on `elastic207[3,4]` to pick up new master-eligible status [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:47] (03CR) 10Cmjohnson: [C: 03+2] Adding site.pp and netboo for kafka-stretch [puppet] - 10https://gerrit.wikimedia.org/r/836914 (https://phabricator.wikimedia.org/T314156) (owner: 10Cmjohnson) [19:40:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [19:40:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 3 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye [19:41:56] PROBLEM - ElasticSearch setting check - 9600 on elastic2075 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [19:42:19] ryankemper: ^ [19:43:09] gehel: ack, we'll have to update the cross cluster seeds with the new masters [19:43:28] .tiouc [19:43:41] heh, fingers off by one there. [19:47:33] brett: i have a clinic duty ask. it looks like icinga has gone awol for our fundraising passive checks. history behind this happening is in T196336 - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=fundraising [19:47:34] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [19:48:42] there have been no new hosts or changes on our end so we may need an nsca restart to help encourage it along. [19:55:49] !log T313431 Restarting elasticsearch_7* services on `elastic208[1,3]` to pick up new master-eligible status [19:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:53] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [20:00:04] brennen and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220929T2000). [20:00:04] Jdlrobson and ebernhardson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] \o [20:00:54] present [20:01:33] hihi, I can deploy! [20:01:43] (03PS5) 10Jdlrobson: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) [20:01:48] (03PS2) 10Jdlrobson: Enable desktop improvements on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836878 (https://phabricator.wikimedia.org/T318344) [20:01:51] (03PS2) 10Jdlrobson: Add Nepalese Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) [20:03:01] TheresNoTime: can we steal one of these patch sets to deploy for deployment-training purposes? [20:03:23] thcipriani: sure! [20:03:47] o/ eigyan :) [20:03:53] Greetings All [20:03:59] Jdlrobson's are all relation-chained up so they'll be fun :p [20:04:01] howdy eigyan [20:04:07] TheresNoTime: thanks :) [20:05:18] in fact thcipriani, brennen - if y'all are around and deploying, are you able to take over completely? [20:05:32] yup [20:05:33] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:43] appreciate it, thanks :) [20:06:18] TheresNoTime: all can be cherry picked to master if that's easier [20:06:58] Jdlrobson: brennen / thcipriani are going to be running the deploy fwiw :) (and I don't *think* it'll be easier, but worth seeing what they say) [20:08:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:11:07] Jdlrobson: seems like https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/835244 needs a backport here? [20:11:22] nope that should be in production now [20:11:27] i can drop the depends on [20:11:39] kk [20:14:24] (03PS6) 10Brennen Bearnes: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) (owner: 10Jdlrobson) [20:15:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) (owner: 10Jdlrobson) [20:15:43] Jdlrobson: edited commit message, going ahead [20:16:44] (03Merged) 10jenkins-bot: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) (owner: 10Jdlrobson) [20:16:59] !log brennen@deploy1002 Started scap: Backport for [[gerrit:835246|Web team config cleanup (T316568)]] [20:17:03] T316568: Clean up Vector config - https://phabricator.wikimedia.org/T316568 [20:17:18] !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:835246|Web team config cleanup (T316568)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:17:25] !log T313431 Restarting elasticsearch_7* services on `elastic2086` to pick up new master-eligible status [20:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:29] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [20:18:48] Jdlrobson: holler when good to continue. [20:19:00] looking [20:19:04] !log Ran foreachwikiindblist wikidataclient-test extensions/Wikibase/client/maintenance/PopulateUnexpectedUnconnectedPagePageProp.php [20:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:43] brennen: please sync looks good [20:20:54] ack, thx [20:21:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:32] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:24:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:25:04] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:835246|Web team config cleanup (T316568)]] (duration: 08m 05s) [20:25:13] T316568: Clean up Vector config - https://phabricator.wikimedia.org/T316568 [20:25:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:25:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:26:36] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) >>! In T262996#8002643, @Nemo_bis wrote: > Is this related to "T... [20:26:44] (03PS3) 10Brennen Bearnes: Enable desktop improvements on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836878 (https://phabricator.wikimedia.org/T318344) (owner: 10Jdlrobson) [20:26:56] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) [20:27:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:28:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836878 (https://phabricator.wikimedia.org/T318344) (owner: 10Jdlrobson) [20:28:13] (don't mind me, just some gerrit flailing) [20:29:20] scap backport FTW! [20:29:31] ^ :D [20:30:03] (03Merged) 10jenkins-bot: Enable desktop improvements on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836878 (https://phabricator.wikimedia.org/T318344) (owner: 10Jdlrobson) [20:30:19] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836878|Enable desktop improvements on nowikimedia (T318344)]] [20:30:23] T318344: Add nowikimedia to desktop-improvements group - https://phabricator.wikimedia.org/T318344 [20:30:38] !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:836878|Enable desktop improvements on nowikimedia (T318344)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:31:21] as discussed in training, we probably could have done a couple of these at the same time. still kind of feeling out the scap backport workflow. [20:32:13] (IcingaOverload) firing: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [20:32:47] Jdlrobson: good to continue? [20:33:23] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4027.ulsfo.wmnet [20:34:05] brennen: 1 more minute please! [20:34:12] sure thing [20:34:25] brennen: yep lgtm! [20:35:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:35:29] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts cp4027.ulsfo.wmnet [20:35:40] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4027.ulsfo.wmnet [20:36:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:36:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:37:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:23] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:836878|Enable desktop improvements on nowikimedia (T318344)]] (duration: 08m 03s) [20:38:26] T318344: Add nowikimedia to desktop-improvements group - https://phabricator.wikimedia.org/T318344 [20:38:55] (03PS1) 10Jdlrobson: Web cleanup: Labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836921 (https://phabricator.wikimedia.org/T316568) [20:39:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:39:42] (03PS3) 10Brennen Bearnes: Add Nepalese Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:39:52] (03CR) 10TrainBranchBot: "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:41:02] (03Merged) 10jenkins-bot: Add Nepalese Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:41:14] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836880|Add Nepalese Wikipedia tagline (T318737)]] [20:41:20] T318737: Add tagline for Nepali Wikipedia (vector-2022) - https://phabricator.wikimedia.org/T318737 [20:41:28] !log T313431 Restarting elasticsearch_7* services on `elastic2080` to pick up new master-eligible status [20:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:31] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [20:41:34] !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:836880|Add Nepalese Wikipedia tagline (T318737)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:42:10] PROBLEM - Host cp4027 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:12] brennen: okay.. this one looks like a definite revert [20:42:13] (IcingaOverload) resolved: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [20:42:32] !log brennen@deploy1002 Sync cancelled. [20:42:57] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [20:43:01] I'll leave a note on ticket, but no need to proceed with this one in this window :-) [20:43:26] brennen: that's me done. I was wondering if someone could +2 this beta cluster only change though on my behalf: https://gerrit.wikimedia.org/r/836921 [20:43:32] (03PS2) 10Jdlrobson: Web cleanup: Labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836921 (https://phabricator.wikimedia.org/T316568) [20:43:39] (03PS1) 10TrainBranchBot: Revert "Add Nepalese Wikipedia tagline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836922 [20:43:41] (03CR) 10TrainBranchBot: "brennen@deploy1002 created a revert of this change as I62d1e5f95ace105b2d02743329e12377bc0a80f6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836880 (https://phabricator.wikimedia.org/T318737) (owner: 10Jdlrobson) [20:44:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836922 (owner: 10TrainBranchBot) [20:45:06] Jdlrobson: sure. [20:45:09] oooh, we have a special thing for scap backport for beta changes now, IIRC: a good chance to try it out [20:45:12] Thanks brennen for your help today [20:45:17] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye [20:45:18] any time [20:45:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye... [20:45:25] (03Merged) 10jenkins-bot: Revert "Add Nepalese Wikipedia tagline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836922 (owner: 10TrainBranchBot) [20:45:26] thcipriani: glad to provide testing opportunities :) [20:45:29] thcipriani: please tell me this new tool let's me self-serve? :D [20:45:33] :D [20:45:39] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836922|Revert "Add Nepalese Wikipedia tagline"]] [20:45:40] brett: any chance you can look at the request above? ^^^ it's emailing us ~50 times a minute with icinga alerts. [20:45:58] !log brennen@deploy1002 brennen and trainbranchbot: Backport for [[gerrit:836922|Revert "Add Nepalese Wikipedia tagline"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:46:04] !log brennen@deploy1002 Sync cancelled. [20:46:13] dwisehaupt: ack, looking now [20:46:18] thanks. :) [20:46:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836921 (https://phabricator.wikimedia.org/T316568) (owner: 10Jdlrobson) [20:46:34] Oh, did you tag me because I'm listed as clinic duty? [20:46:39] I'm not, but still happy to look! [20:46:41] Jdlrobson: it's for people in the deployment group currently. Someday. [20:47:19] oh, yeah. i saw your name in the topic for clinic. if there is someone else i should reach out to please let me know. [20:47:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:48] (03Merged) 10jenkins-bot: Web cleanup: Labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836921 (https://phabricator.wikimedia.org/T316568) (owner: 10Jdlrobson) [20:47:56] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:48:08] right now it's bblack, so if a chanop could update the topic that'd be swell :) [20:48:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:28] (03PS2) 10Brennen Bearnes: wmgCirrusSearchShardCount: Override prod settings for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836301 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:48:47] dwisehaupt: Specifically, you're wanting approval on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/836922/ right? [20:48:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836301 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:49:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:27] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:49:32] brett: nope. around icinga reporting all of the fr-tech services as awol. [20:49:59] (03Merged) 10jenkins-bot: wmgCirrusSearchShardCount: Override prod settings for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836301 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:50:09] from above: it looks like icinga has gone awol for our fundraising passive checks. history behind this happening is in T196336 - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=fundraising [20:50:09] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [20:50:12] Thanks thcipriani brennen i look forward to the day we have a web app for deploys :) [20:50:51] or just merged == deployed :D [20:50:53] hopefully a service restart of nsca would put it right. [20:50:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836886 (owner: 10Ebernhardson) [20:51:25] and a button to stop deploys instead :D [20:52:22] (03Merged) 10jenkins-bot: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836886 (owner: 10Ebernhardson) [20:52:34] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836886|cirrus: Don't configure cloud clusters for private wikis]] [20:52:34] !log brennen@deploy1002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --force-version "1.40.0-wmf.3" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.gcoIZ0BTKW"' returned non-zero exit status 255. (duration: 00m 00s) [20:53:01] bblack: If you're available, would love to work on this with you [20:53:02] well that's new [20:54:00] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:54:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:09] ebernhardson: we're getting a syntax error on this one - Parse error: syntax error, unexpected ')' in /srv/mediawiki-staging/wmf-config/CirrusSearch-production.php on line 34 [20:55:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:55:56] (03PS1) 10TrainBranchBot: Revert "cirrus: Don't configure cloud clusters for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836928 [20:55:58] (03CR) 10TrainBranchBot: "brennen@deploy1002 created a revert of this change as I0e7656a2e2eb2b3130117ff4033461c1747bf62d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836886 (owner: 10Ebernhardson) [20:56:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836928 (owner: 10TrainBranchBot) [20:56:18] brennen: hmm, ok sec. Thought i checked :S [20:56:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:45] putting a revert through here. [20:56:53] trailing comma in a function call: what does CI do!? :D [20:57:32] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:57:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4027.ulsfo.wmnet [20:57:45] (03Merged) 10jenkins-bot: Revert "cirrus: Don't configure cloud clusters for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836928 (owner: 10TrainBranchBot) [20:58:00] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836928|Revert "cirrus: Don't configure cloud clusters for private wikis"]] [20:58:01] !log T313431 Updated cross-cluster seed conf with new masters; should resolve the settings check alerts [20:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:06] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [20:58:19] !log brennen@deploy1002 brennen and trainbranchbot: Backport for [[gerrit:836928|Revert "cirrus: Don't configure cloud clusters for private wikis"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:58:19] yea i'll have to check how CI didn't see the trailing comma [20:58:37] !log brennen@deploy1002 Sync cancelled. [20:59:21] !log T313431 Repooled `elastic[2073-2074,2080-2081,2083,2086].codfw.wmnet`. Codfw's all on 5 masters now and cluster is back to green. [20:59:23] i'm going to bet ... trailing commas in unset came in 7.3, and CI is running 7.4? [20:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:45] and i'm running 7.4 as well... [21:00:16] hrm [21:00:31] deployment server is running on 7.4.30 according to php --version [21:00:34] dwisehaupt: Did you try restarting icinga already? I see you in the ticket mentioning a previous time that restarting hadn't helped that time [21:01:32] version difference seems fairly plausible. [21:01:37] but....php7.2 is still installed [21:01:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:01:52] thcipriani: odd, testing with 3v4l.org verifies thats valid in 7.3, and syntax error in 7.2: https://3v4l.org/IJC7f#v7.2.34 https://3v4l.org/IJC7f#v7.3.0 [21:01:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:02:01] ah, scap uses php7.2 [21:02:09] 7.4 is everywhere now thcipriani [21:02:14] Scap should probably use it [21:02:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:03:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:01] (03PS1) 10Ebernhardson: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836719 [21:04:13] (03PS2) 10Ebernhardson: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836719 [21:04:20] ^ updated patch without trailing comma [21:04:22] <3 [21:04:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836719 (owner: 10Ebernhardson) [21:04:47] brett: i don't have the rights to restart icinga. [21:05:19] being in fr-tech, i have limited prod access outside of our rig. [21:05:23] ...batphone? [21:05:42] (03Merged) 10jenkins-bot: cirrus: Don't configure cloud clusters for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836719 (owner: 10Ebernhardson) [21:05:45] nananana [21:05:56] !log brennen@deploy1002 Started scap: Backport for [[gerrit:836719|cirrus: Don't configure cloud clusters for private wikis]] [21:06:04] boom...pow...smash [21:06:16] !log brennen@deploy1002 brennen and ebernhardson: Backport for [[gerrit:836719|cirrus: Don't configure cloud clusters for private wikis]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:06:34] ebernhardson: i await your signal. :) [21:07:03] brennen: do we have a fake-private instance? I could always to a search from shell.php on a real private wiki, but i try not to :) [21:07:23] i have _no_ idea. [21:07:46] I'd restart but I've never done it before. https://wikitech.wikimedia.org/wiki/Service_restarts#Icinga mentions re-arming, which is something I'm not familiar with :( [21:08:13] So I'm concerned I might make things worse without an adult :) [21:08:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:09:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:09:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:10:01] brennen: looks reasonable to me [21:10:06] brett: keyholder needing arming would likely only be if the server got restarted. If just icinga restarted, doubt that would be needed. Don't quote me on that though and please check with an actual SRE. [21:10:09] thx, syncing [21:10:13] brett: FWIW in the history of https://phabricator.wikimedia.org/T196336 it's been restarted several times over the last few years and there hasn't been much mention of needing to rearm [21:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:10:58] brett: indeed I think what RhinosF1 said makes a lot of sense [21:11:33] ryankemper: I still wouldn't do anything unless you are confident. I'd rather not be blamed for blowing icinga up and paging all of SRE [21:12:28] yeah, no need for that. i like the folks i just met last week. :) [21:13:29] Yup, probably not any harm in waiting for someone from o11y (or whoever owns icinga) to be around [21:13:49] Yep that would be o11y [21:13:53] * RhinosF1 is off for the night [21:14:19] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:836719|cirrus: Don't configure cloud clusters for private wikis]] (duration: 08m 22s) [21:14:27] !log end of utc late backport and config window [21:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:37] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:17:46] RhinosF1: Do you know if beta cluster moved to 7.4 already? [21:18:01] dancy: no idea [21:18:11] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:19:01] dancy: special:version says yes [21:19:16] I kinda remember Jo.e might have done it for testing [21:19:31] dancy: If we hadn't it'd be on fire right now. [21:19:32] https://phabricator.wikimedia.org/T271736 [21:19:35] Great. Thanks! [21:19:39] James_F: when isn't beta? [21:19:40] (And yes, moved months ago.) [21:19:45] RhinosF1: Even more on fire. [21:19:56] https://phabricator.wikimedia.org/T306042 [21:20:24] Apparently two weeks feels like months ago now. [21:20:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:21:03] Did that really happen on the day I started my job? [21:21:17] I feel like 7.4 has been going on ages [21:21:40] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4045.mgmt.ulsfo.wmnet with reboot policy FORCED [21:21:44] Well, the /original/ patch of mine we landed today was C-2'ed by R.eedy two years ago. [21:21:50] So "going on for ages" isn't wrong. :-) [21:22:49] (03PS1) 10Ahmon Dancy: scap.cfg.erb: 7.2 -> 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836932 (https://phabricator.wikimedia.org/T271736) [21:26:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4045.mgmt.ulsfo.wmnet with reboot policy FORCED [21:36:22] 10SRE-Access-Requests: Please add to Restricted Group - https://phabricator.wikimedia.org/T318983 (10eigyan) [21:37:13] (IcingaOverload) firing: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [21:38:36] ^ brett was pinged about this [21:38:52] it seems like a restart does fix the issue. if there are no concerns, we can go ahead and do it [21:39:44] probably a better decision for olly but I am not sure if someone is around from that. herron, cwhite: ^ if you are around [21:40:55] sukhe I’m on mobile but yes please do [21:41:08] thanks herron, doing [21:41:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:41:16] brett: ^ please go ahead if you want :) [21:42:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:42:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:43:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:43:42] !log alert1001: restart icinga [21:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:22] (IcingaOverload) resolved: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [21:49:36] sukhe: brett: thanks! that solved our issue for fr-tech checks. [21:49:57] good to know dwisehaupt! [21:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T314041)', diff saved to https://phabricator.wikimedia.org/P35189 and previous config saved to /var/cache/conftool/dbconfig/20220929-215333-ladsgroup.json [21:53:37] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:53:46] (03CR) 10Brennen Bearnes: [C: 03+1] scap.cfg.erb: 7.2 -> 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836932 (https://phabricator.wikimedia.org/T271736) (owner: 10Ahmon Dancy) [21:56:59] thcipriani: "batphone" is the state outside of working hours, where no one is specifically on call so any page notifies all of us at once :) [21:57:16] RECOVERY - ElasticSearch setting check - 9600 on elastic2075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:18] working hours is defined wrt the time zone of whoever is actually on call for that week, so it shifts around [21:57:33] TIL :) [21:59:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:59:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) a:05BBlack→03RobH [22:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T314041)', diff saved to https://phabricator.wikimedia.org/P35190 and previous config saved to /var/cache/conftool/dbconfig/20220929-220130-ladsgroup.json [22:01:35] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:10:30] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:14:04] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P35191 and previous config saved to /var/cache/conftool/dbconfig/20220929-221637-ladsgroup.json [22:31:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P35192 and previous config saved to /var/cache/conftool/dbconfig/20220929-223143-ladsgroup.json [22:35:44] (03PS1) 10Jforrester: deployment-prep: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836945 [22:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T314041)', diff saved to https://phabricator.wikimedia.org/P35193 and previous config saved to /var/cache/conftool/dbconfig/20220929-224649-ladsgroup.json [22:46:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:25:01] (03CR) 10Reedy: [C: 03+1] "cherry picked and works on beta 😊" [puppet] - 10https://gerrit.wikimedia.org/r/836932 (https://phabricator.wikimedia.org/T271736) (owner: 10Ahmon Dancy) [23:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:44:11] (03PS1) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [23:58:06] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook