[00:03:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:11] looking [00:05:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T361627)', diff saved to https://phabricator.wikimedia.org/P62132 and previous config saved to /var/cache/conftool/dbconfig/20240509-000554-marostegui.json [00:06:04] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [00:07:04] marostegui: This related to the DB stuff? [00:07:31] brett: it's 2 AM for him, that's a script :) [00:07:35] oh [00:07:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028934 (owner: 10TrainBranchBot) [00:07:51] s8 is showing an increase in errors [00:07:52] I see a traffic spike, looks like the impact has passed, just checking for any lingering effect [00:08:16] Perhaps that's just related to the script [00:08:39] why do you say that? [00:08:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:28] I thought it happened earlier than I remembered, indeed it did happen at 00:00 [00:09:36] *00:03 [00:10:04] I'm not seeing an increase in traffic to varnish via the main grafana dashboard... [00:21:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P62133 and previous config saved to /var/cache/conftool/dbconfig/20240509-002105-marostegui.json [00:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:27:57] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:36:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P62134 and previous config saved to /var/cache/conftool/dbconfig/20240509-003614-marostegui.json [00:43:01] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:51:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T361627)', diff saved to https://phabricator.wikimedia.org/P62135 and previous config saved to /var/cache/conftool/dbconfig/20240509-005122-marostegui.json [00:51:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [00:51:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [00:51:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [00:51:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62136 and previous config saved to /var/cache/conftool/dbconfig/20240509-005146-marostegui.json [00:56:07] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:02:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62137 and previous config saved to /var/cache/conftool/dbconfig/20240509-010250-marostegui.json [01:02:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [01:17:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P62138 and previous config saved to /var/cache/conftool/dbconfig/20240509-011758-marostegui.json [01:28:19] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 108, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:33:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P62139 and previous config saved to /var/cache/conftool/dbconfig/20240509-013305-marostegui.json [01:48:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62140 and previous config saved to /var/cache/conftool/dbconfig/20240509-014814-marostegui.json [01:48:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1228.eqiad.wmnet with reason: Maintenance [01:48:18] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [01:48:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1228.eqiad.wmnet with reason: Maintenance [01:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T361627)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-014836-marostegui.json [01:50:13] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:59:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T361627)', diff saved to https://phabricator.wikimedia.org/P62142 and previous config saved to /var/cache/conftool/dbconfig/20240509-015909-marostegui.json [01:59:13] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [01:59:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P62143 and previous config saved to /var/cache/conftool/dbconfig/20240509-015942-ladsgroup.json [01:59:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:14:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P62144 and previous config saved to /var/cache/conftool/dbconfig/20240509-021417-marostegui.json [02:14:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P62145 and previous config saved to /var/cache/conftool/dbconfig/20240509-021452-ladsgroup.json [02:29:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P62146 and previous config saved to /var/cache/conftool/dbconfig/20240509-022925-marostegui.json [02:30:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P62147 and previous config saved to /var/cache/conftool/dbconfig/20240509-023000-ladsgroup.json [02:44:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T361627)', diff saved to https://phabricator.wikimedia.org/P62148 and previous config saved to /var/cache/conftool/dbconfig/20240509-024432-marostegui.json [02:44:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [02:44:36] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [02:44:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [02:44:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62149 and previous config saved to /var/cache/conftool/dbconfig/20240509-024455-marostegui.json [02:45:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P62150 and previous config saved to /var/cache/conftool/dbconfig/20240509-024508-ladsgroup.json [02:45:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [02:45:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:45:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [02:45:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62151 and previous config saved to /var/cache/conftool/dbconfig/20240509-024531-ladsgroup.json [02:55:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62152 and previous config saved to /var/cache/conftool/dbconfig/20240509-025537-marostegui.json [02:55:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [03:03:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P62153 and previous config saved to /var/cache/conftool/dbconfig/20240509-031045-marostegui.json [03:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-032552-marostegui.json [03:40:13] FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:41:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62155 and previous config saved to /var/cache/conftool/dbconfig/20240509-034105-marostegui.json [03:41:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [03:41:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [03:41:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [03:41:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62156 and previous config saved to /var/cache/conftool/dbconfig/20240509-034128-marostegui.json [03:53:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62157 and previous config saved to /var/cache/conftool/dbconfig/20240509-035320-marostegui.json [03:53:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:08:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P62158 and previous config saved to /var/cache/conftool/dbconfig/20240509-040830-marostegui.json [04:20:29] RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P62159 and previous config saved to /var/cache/conftool/dbconfig/20240509-042337-marostegui.json [04:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:38:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62160 and previous config saved to /var/cache/conftool/dbconfig/20240509-043845-marostegui.json [04:38:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:38:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:39:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:39:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T361627)', diff saved to https://phabricator.wikimedia.org/P62161 and previous config saved to /var/cache/conftool/dbconfig/20240509-043908-marostegui.json [04:39:35] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 62390360 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:40:37] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:43:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067 [04:51:43] T364067: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T364067 [04:51:54] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067) [04:52:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067 [04:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1231 with weight 0 T364067', diff saved to https://phabricator.wikimedia.org/P62162 and previous config saved to /var/cache/conftool/dbconfig/20240509-045216-marostegui.json [04:55:02] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067) (owner: 10Gerrit maintenance bot) [04:58:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:58:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [05:06:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [05:06:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [05:07:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62163 and previous config saved to /var/cache/conftool/dbconfig/20240509-050752-root.json [05:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [05:14:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [05:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62164 and previous config saved to /var/cache/conftool/dbconfig/20240509-052258-root.json [05:24:35] (03PS1) 10Marostegui: Revert "mariadb: Promote db1231 to s6 master" [puppet] - 10https://gerrit.wikimedia.org/r/1029249 [05:25:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:25:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:26:40] (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Promote db1231 to s6 master" [puppet] - 10https://gerrit.wikimedia.org/r/1029249 (owner: 10Marostegui) [05:29:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P62165 and previous config saved to /var/cache/conftool/dbconfig/20240509-052912-root.json [05:31:29] (03PS1) 10Marostegui: db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029313 [05:32:27] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1180.eqiad.wmnet onto db1231.eqiad.wmnet [05:32:28] (03CR) 10Marostegui: [C:03+2] db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029313 (owner: 10Marostegui) [05:34:31] (03PS1) 10Marostegui: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029315 [05:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1172 T363792', diff saved to https://phabricator.wikimedia.org/P62166 and previous config saved to /var/cache/conftool/dbconfig/20240509-053442-marostegui.json [05:34:47] T363792: Upgrade s8 to MariaDB 10.6 - https://phabricator.wikimedia.org/T363792 [05:35:34] (03CR) 10Marostegui: [C:03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029315 (owner: 10Marostegui) [05:37:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS bookworm [05:38:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 10%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-053804-root.json [05:41:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:41:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:49:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [05:52:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [05:53:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62167 and previous config saved to /var/cache/conftool/dbconfig/20240509-055314-root.json [05:54:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:54:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:54:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62168 and previous config saved to /var/cache/conftool/dbconfig/20240509-055429-marostegui.json [05:54:33] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:58:01] (03PS1) 10Marostegui: Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029250 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62169 and previous config saved to /var/cache/conftool/dbconfig/20240509-060821-root.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:06] (03CR) 10Marostegui: [C:03+2] Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029250 (owner: 10Marostegui) [06:14:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS bookworm [06:14:31] (03PS1) 10Zabe: beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433 [06:14:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T364451 [06:14:58] T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451 [06:15:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2021 with weight 0 T364451', diff saved to https://phabricator.wikimedia.org/P62170 and previous config saved to /var/cache/conftool/dbconfig/20240509-061500-root.json [06:15:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T364451 [06:17:05] (03PS1) 10Marostegui: mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029434 (https://phabricator.wikimedia.org/T364451) [06:17:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62171 and previous config saved to /var/cache/conftool/dbconfig/20240509-061742-root.json [06:17:49] (03CR) 10Zabe: [C:03+2] beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433 (owner: 10Zabe) [06:17:53] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029434 (https://phabricator.wikimedia.org/T364451) (owner: 10Marostegui) [06:18:31] !log Starting es4 codfw failover from es2020 to es2021 - T364451 [06:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:36] (03Merged) 10jenkins-bot: beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433 (owner: 10Zabe) [06:19:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2021 to es4 primary and set section read-write T364451', diff saved to https://phabricator.wikimedia.org/P62172 and previous config saved to /var/cache/conftool/dbconfig/20240509-061904-marostegui.json [06:19:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2020 T364451', diff saved to https://phabricator.wikimedia.org/P62173 and previous config saved to /var/cache/conftool/dbconfig/20240509-061957-root.json [06:20:02] T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451 [06:20:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es4 codfw master', diff saved to https://phabricator.wikimedia.org/P62174 and previous config saved to /var/cache/conftool/dbconfig/20240509-062027-marostegui.json [06:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62175 and previous config saved to /var/cache/conftool/dbconfig/20240509-062327-root.json [06:24:13] (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029435 [06:24:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2020.codfw.wmnet with OS bookworm [06:24:52] (03CR) 10Marostegui: [C:03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029435 (owner: 10Marostegui) [06:29:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62176 and previous config saved to /var/cache/conftool/dbconfig/20240509-062926-marostegui.json [06:29:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:32:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62177 and previous config saved to /var/cache/conftool/dbconfig/20240509-063248-root.json [06:33:41] (03PS1) 10Marostegui: Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029251 [06:34:48] (03CR) 10Marostegui: [C:03+2] Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029251 (owner: 10Marostegui) [06:35:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1180.eqiad.wmnet onto db1231.eqiad.wmnet [06:35:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62178 and previous config saved to /var/cache/conftool/dbconfig/20240509-063514-root.json [06:36:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1028935 (https://phabricator.wikimedia.org/T364523) [06:36:57] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028936 (https://phabricator.wikimedia.org/T364523) [06:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62179 and previous config saved to /var/cache/conftool/dbconfig/20240509-063832-root.json [06:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62180 and previous config saved to /var/cache/conftool/dbconfig/20240509-063845-root.json [06:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P62181 and previous config saved to /var/cache/conftool/dbconfig/20240509-064434-marostegui.json [06:47:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2020.codfw.wmnet with reason: host reimage [06:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62182 and previous config saved to /var/cache/conftool/dbconfig/20240509-064754-root.json [06:50:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62183 and previous config saved to /var/cache/conftool/dbconfig/20240509-065020-root.json [06:50:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2020.codfw.wmnet with reason: host reimage [06:54:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-065355-root.json [06:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P62185 and previous config saved to /var/cache/conftool/dbconfig/20240509-065941-marostegui.json [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700). [07:00:05] James_F and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62186 and previous config saved to /var/cache/conftool/dbconfig/20240509-070300-root.json [07:04:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62187 and previous config saved to /var/cache/conftool/dbconfig/20240509-070526-root.json [07:05:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:09:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62188 and previous config saved to /var/cache/conftool/dbconfig/20240509-070905-root.json [07:10:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:23] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9782546 (10jcrespo) 05Open→03Resolved a:03jcrespo [07:14:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2020.codfw.wmnet with OS bookworm [07:14:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62189 and previous config saved to /var/cache/conftool/dbconfig/20240509-071449-marostegui.json [07:14:52] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:14:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:14:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:15:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:15:06] (03PS1) 10Abijeet Patro: Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) [07:15:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:15:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:15:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62190 and previous config saved to /var/cache/conftool/dbconfig/20240509-071527-marostegui.json [07:15:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 9.869 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:16:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:17:08] (03CR) 10Abijeet Patro: [C:03+1] Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro) [07:17:26] hello deployers, there is currently a UBN! (https://phabricator.wikimedia.org/T364522) that's blocking pages from being marked for translation. We have a patch that fixes the issue, but given CI times, it'll take a while to get merged: 1029257: Fix error when marking a new page for translations | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1029257 [07:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62191 and previous config saved to /var/cache/conftool/dbconfig/20240509-071805-root.json [07:18:26] (03CR) 10Volans: [C:03+1] "LGTM, optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [07:18:27] We might miss the UTC morning backport window, but it would be nice to have this fix deployed given the severe impact of the issue. [07:19:40] jouncebot: nowandnext [07:19:41] For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700) [07:19:41] In 2 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1000) [07:20:23] abijeet: I can deploy it if you can test? [07:20:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62192 and previous config saved to /var/cache/conftool/dbconfig/20240509-072032-root.json [07:20:33] zabe, thanks. I'm around to test. [07:20:39] (03CR) 10Zabe: [C:03+2] Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro) [07:22:56] (03CR) 10Zabe: [C:03+2] Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029200 (owner: 10Zabe) [07:23:10] zabe, added to the backport window: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700 [07:23:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:46] (03Merged) 10jenkins-bot: Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029200 (owner: 10Zabe) [07:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-072411-root.json [07:24:36] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]] [07:24:59] thx [07:28:33] !log zabe@deploy1002 zabe: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:29:11] !log zabe@deploy1002 zabe: Continuing with sync [07:33:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62194 and previous config saved to /var/cache/conftool/dbconfig/20240509-073311-root.json [07:33:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw1381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:34:03] PROBLEM - Check whether ferm is active by checking the default input chain on parse1022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:34:15] PROBLEM - Check whether ferm is active by checking the default input chain on mw1469 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:35:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62195 and previous config saved to /var/cache/conftool/dbconfig/20240509-073537-root.json [07:37:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw1435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:37:11] PROBLEM - Check whether ferm is active by checking the default input chain on parse1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62196 and previous config saved to /var/cache/conftool/dbconfig/20240509-073922-root.json [07:41:34] (03Merged) 10jenkins-bot: Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro) [07:41:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:14] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]] (duration: 17m 37s) [07:42:30] (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029258 [07:43:03] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]] [07:43:06] T364522: Internal error when trying to mark a page for translation not yet in translation system - https://phabricator.wikimedia.org/T364522 [07:43:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Fully repool db1172', diff saved to https://phabricator.wikimedia.org/P62197 and previous config saved to /var/cache/conftool/dbconfig/20240509-074355-marostegui.json [07:44:01] (03CR) 10Marostegui: [C:03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029258 (owner: 10Marostegui) [07:44:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62198 and previous config saved to /var/cache/conftool/dbconfig/20240509-074408-root.json [07:45:42] !log zabe@deploy1002 zabe and abi: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:45:57] abijeet: could you test?:) [07:47:07] Sure [07:49:27] zabe, tested. Looks good. [07:49:57] cool, syncing [07:50:01] !log zabe@deploy1002 zabe and abi: Continuing with sync [07:50:17] Argh, I had the deploy window in my calendar with the wrong hour, sorry! [07:50:41] Will deploy it later instead. [07:50:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62199 and previous config saved to /var/cache/conftool/dbconfig/20240509-075043-root.json [07:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62200 and previous config saved to /var/cache/conftool/dbconfig/20240509-075118-marostegui.json [07:51:23] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:54:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62201 and previous config saved to /var/cache/conftool/dbconfig/20240509-075429-root.json [07:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62202 and previous config saved to /var/cache/conftool/dbconfig/20240509-075914-root.json [08:02:32] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]] (duration: 19m 28s) [08:02:37] T364522: Internal error when trying to mark a page for translation not yet in translation system - https://phabricator.wikimedia.org/T364522 [08:03:35] abijeet: fix should be live [08:03:53] zabe, thanks! I just verified that it works as expected [08:03:55] RECOVERY - Check whether ferm is active by checking the default input chain on mw1381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:04:03] RECOVERY - Check whether ferm is active by checking the default input chain on parse1022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:04:03] cool, yw [08:04:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw1469 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:05:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62203 and previous config saved to /var/cache/conftool/dbconfig/20240509-080549-root.json [08:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P62204 and previous config saved to /var/cache/conftool/dbconfig/20240509-080627-marostegui.json [08:07:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw1435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:07:11] RECOVERY - Check whether ferm is active by checking the default input chain on parse1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:08:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62205 and previous config saved to /var/cache/conftool/dbconfig/20240509-080936-root.json [08:13:23] !log set batphone oncall for May 9th - T350192 [08:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:26] T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192 [08:14:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62206 and previous config saved to /var/cache/conftool/dbconfig/20240509-081422-root.json [08:16:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:18:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P62207 and previous config saved to /var/cache/conftool/dbconfig/20240509-082135-marostegui.json [08:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62208 and previous config saved to /var/cache/conftool/dbconfig/20240509-082927-root.json [08:30:48] !log set batphone oncall for May 9th only for EMEA, not Americas - T350192 [08:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:52] T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192 [08:36:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62209 and previous config saved to /var/cache/conftool/dbconfig/20240509-083643-marostegui.json [08:36:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:36:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:36:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:37:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62210 and previous config saved to /var/cache/conftool/dbconfig/20240509-083705-marostegui.json [08:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62211 and previous config saved to /var/cache/conftool/dbconfig/20240509-084433-root.json [08:53:41] !log deploy new grants for es6, es7 backups T363812 [08:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:45] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:54:53] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [08:59:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62212 and previous config saved to /var/cache/conftool/dbconfig/20240509-085939-root.json [09:00:07] (03PS4) 10Jcrespo: dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) [09:02:25] (03PS1) 10Fabfur: cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) [09:04:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [09:07:06] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [09:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62213 and previous config saved to /var/cache/conftool/dbconfig/20240509-090726-marostegui.json [09:07:30] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:08:23] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9782652 (10jijiki) [09:14:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62214 and previous config saved to /var/cache/conftool/dbconfig/20240509-091413-ladsgroup.json [09:14:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:14:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62215 and previous config saved to /var/cache/conftool/dbconfig/20240509-091445-root.json [09:16:26] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2357/console" [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [09:22:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62216 and previous config saved to /var/cache/conftool/dbconfig/20240509-092234-marostegui.json [09:27:34] (03PS1) 10Marostegui: db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029482 [09:27:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1167', diff saved to https://phabricator.wikimedia.org/P62217 and previous config saved to /var/cache/conftool/dbconfig/20240509-092757-root.json [09:28:28] (03CR) 10Marostegui: [C:03+2] db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029482 (owner: 10Marostegui) [09:29:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P62218 and previous config saved to /var/cache/conftool/dbconfig/20240509-092921-ladsgroup.json [09:29:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1167.eqiad.wmnet with OS bookworm [09:31:03] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: upgrade to 10.6 [09:31:16] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: upgrade to 10.6 [09:31:33] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade to 10.6 [09:31:47] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade to 10.6 [09:32:14] (03CR) 10Dreamrimmer: [C:03+1] ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry) [09:33:06] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:33:14] (03PS3) 10Jcrespo: mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) [09:33:37] (03PS2) 10Lucas Werkmeister (WMDE): Disable ParserMigration on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) [09:34:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE)) [09:35:41] (03Merged) 10jenkins-bot: Disable ParserMigration on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE)) [09:36:08] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]] [09:36:11] T364228: Parsoid read views show empty SDC data - https://phabricator.wikimedia.org/T364228 [09:37:05] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:37:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62219 and previous config saved to /var/cache/conftool/dbconfig/20240509-093742-marostegui.json [09:38:50] !log jforrester@deploy1002 lucaswerkmeister-wmde and jforrester: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:39:12] !log jforrester@deploy1002 lucaswerkmeister-wmde and jforrester: Continuing with sync [09:40:03] (03PS1) 10Marostegui: Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029261 [09:43:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [09:43:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1375 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:43:58] PROBLEM - Check whether ferm is active by checking the default input chain on mw2395 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:44:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-094431-ladsgroup.json [09:45:40] PROBLEM - Check whether ferm is active by checking the default input chain on mw1470 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:45:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [09:48:03] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9782732 (10jijiki) @andrea.denisse please give me a headsup on IRC/slack to sync up, when you are planning on switching thanos-fe to cfssl, so we can kee... [09:52:25] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]] (duration: 16m 17s) [09:52:28] T364228: Parsoid read views show empty SDC data - https://phabricator.wikimedia.org/T364228 [09:52:31] Finally! [09:52:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62220 and previous config saved to /var/cache/conftool/dbconfig/20240509-095249-marostegui.json [09:52:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:52:54] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:53:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:53:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62221 and previous config saved to /var/cache/conftool/dbconfig/20240509-095313-marostegui.json [09:59:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62222 and previous config saved to /var/cache/conftool/dbconfig/20240509-095943-ladsgroup.json [09:59:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [09:59:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:59:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1000) [10:00:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T352010)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-100006-ladsgroup.json [10:01:59] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] apt-staging: Add timer for gitlab package puller job [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [10:03:39] (03CR) 10Marostegui: [C:03+2] Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029261 (owner: 10Marostegui) [10:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62224 and previous config saved to /var/cache/conftool/dbconfig/20240509-100405-root.json [10:06:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1167.eqiad.wmnet with OS bookworm [10:12:53] (03PS1) 10Marostegui: es2038: No longer in setup [puppet] - 10https://gerrit.wikimedia.org/r/1029488 [10:13:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1375 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:13:58] RECOVERY - Check whether ferm is active by checking the default input chain on mw2395 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:14:25] (03CR) 10Marostegui: [C:03+2] es2038: No longer in setup [puppet] - 10https://gerrit.wikimedia.org/r/1029488 (owner: 10Marostegui) [10:15:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1470 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:19:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62225 and previous config saved to /var/cache/conftool/dbconfig/20240509-101911-root.json [10:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62226 and previous config saved to /var/cache/conftool/dbconfig/20240509-102512-marostegui.json [10:25:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:31:43] 07sre-alert-triage: Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528 (10LSobanski) 03NEW [10:32:04] 07sre-alert-triage: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529 (10LSobanski) 03NEW [10:32:30] (03PS1) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 [10:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62227 and previous config saved to /var/cache/conftool/dbconfig/20240509-103417-root.json [10:35:26] (03CR) 10Fabfur: [C:03+2] fifo-log-demux: removed unused resources [puppet] - 10https://gerrit.wikimedia.org/r/1029191 (https://phabricator.wikimedia.org/T355905) (owner: 10Fabfur) [10:39:53] (03PS1) 10Santiago Faci: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 [10:40:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62228 and previous config saved to /var/cache/conftool/dbconfig/20240509-104019-marostegui.json [10:41:33] (03PS2) 10Btullis: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [10:46:01] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [10:47:20] (03CR) 10Aklapper: [C:03+1] "Thanks! This looks correct and I get the same results locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [10:49:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62229 and previous config saved to /var/cache/conftool/dbconfig/20240509-104922-root.json [10:50:53] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [10:52:01] (03CR) 10Btullis: [C:03+2] Drop the deprecated dumps fetcher that pulls from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [10:53:44] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529#9782931 (10LSobanski) [10:55:09] 07sre-alert-triage, 10SRE Observability (FY2023/2024-Q4): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9782948 (10LSobanski) [10:55:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62230 and previous config saved to /var/cache/conftool/dbconfig/20240509-105527-marostegui.json [10:55:31] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529#9782946 (10LSobanski) →14Duplicate dup:03T354255 [10:56:32] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528#9782956 (10LSobanski) [10:57:07] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [11:01:51] (03CR) 10Santiago Faci: [C:03+2] Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [11:02:55] (03Merged) 10jenkins-bot: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [11:02:59] (03PS5) 10Btullis: hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [11:03:30] (03CR) 10Btullis: "I updated the commit message a bit to refer to the correct user/group." [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [11:04:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62231 and previous config saved to /var/cache/conftool/dbconfig/20240509-110430-root.json [11:05:30] (03PS2) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 [11:05:37] (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci) [11:05:56] (03CR) 10CI reject: [V:04-1] hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [11:09:49] (03PS1) 10Majavah: site: Move cloudnet2007/8-dev back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1029496 (https://phabricator.wikimedia.org/T358761) [11:09:51] (03PS1) 10Majavah: site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) [11:09:53] (03PS1) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [11:10:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62232 and previous config saved to /var/cache/conftool/dbconfig/20240509-111037-marostegui.json [11:10:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:10:41] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:10:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:11:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62233 and previous config saved to /var/cache/conftool/dbconfig/20240509-111100-marostegui.json [11:11:48] (03PS6) 10Btullis: hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol) [11:19:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62234 and previous config saved to /var/cache/conftool/dbconfig/20240509-111936-root.json [11:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62235 and previous config saved to /var/cache/conftool/dbconfig/20240509-113443-root.json [11:35:17] (03PS3) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 [11:35:24] (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci) [11:35:42] (03PS4) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 [11:35:49] (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci) [11:36:34] (03Abandoned) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci) [11:39:33] (03PS1) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 [11:41:06] (03CR) 10Btullis: [C:03+1] "Looks good. Hopefully this will be the very last time we have to do it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci) [11:41:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:44] (03CR) 10Santiago Faci: [C:03+2] edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci) [11:42:14] (03PS1) 10Jforrester: Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 [11:42:44] (03Merged) 10jenkins-bot: edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci) [11:43:45] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [11:44:17] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [11:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62236 and previous config saved to /var/cache/conftool/dbconfig/20240509-114417-marostegui.json [11:44:23] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:45:26] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [11:45:47] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [11:45:59] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [11:46:36] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [11:47:18] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [11:48:21] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [11:49:13] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [11:49:29] (03CR) 10Majavah: [C:03+2] site: Move cloudnet2007/8-dev back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1029496 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [11:49:36] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [11:50:02] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [11:50:19] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [11:50:28] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bookworm [11:51:03] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bookworm [11:51:27] (03PS1) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029508 (https://phabricator.wikimedia.org/T364379) [11:52:32] (03PS1) 10Btullis: Move snapshot1009 to insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/1029509 (https://phabricator.wikimedia.org/T364456) [11:59:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62237 and previous config saved to /var/cache/conftool/dbconfig/20240509-115925-marostegui.json [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1200) [12:09:40] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [12:09:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192', diff saved to https://phabricator.wikimedia.org/P62239 and previous config saved to /var/cache/conftool/dbconfig/20240509-120955-root.json [12:10:11] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [12:11:01] (03PS1) 10Marostegui: db1192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029516 [12:11:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bookworm [12:11:39] (03CR) 10Marostegui: [C:03+2] db1192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029516 (owner: 10Marostegui) [12:12:34] (03PS1) 10Ladsgroup: Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538) [12:12:57] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [12:13:12] jouncebot: nowandnext [12:13:13] For the next 0 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1200) [12:13:13] In 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300) [12:13:23] (03CR) 10Ladsgroup: [C:03+2] Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538) (owner: 10Ladsgroup) [12:13:27] oh, securepoll beeing broken during an election [12:13:34] this never happened before [12:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62240 and previous config saved to /var/cache/conftool/dbconfig/20240509-121433-marostegui.json [12:16:05] (03Merged) 10jenkins-bot: Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538) (owner: 10Ladsgroup) [12:16:21] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [12:17:13] zabe: did we ever get to removing the labtestwikitech hack from there? [12:18:06] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]] [12:18:10] T364538: Voting in U4C election is not possible anymore - https://phabricator.wikimedia.org/T364538 [12:18:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:46] yeah I actually think it got removed a few months ago (but I wasn't involved) [12:19:21] ok [12:19:28] 6 weeks ago [12:19:29] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/854528 [12:19:52] and actually also in a rather hacky way [12:20:48] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:21:47] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [12:22:57] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2362/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [12:24:09] (03PS2) 10Majavah: site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) [12:24:09] (03PS2) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [12:24:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [12:25:32] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2363/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:26:02] PROBLEM - Check whether ferm is active by checking the default input chain on parse1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:26:07] (03PS1) 10Marostegui: Revert "db1192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029546 [12:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [12:27:51] (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM, PCC needs to run on titan hosts which now do show a diff: https://puppet-compiler.wmflabs.org/output/1028876/2362/" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [12:28:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1419 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:28:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:28:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1453 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:29:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62241 and previous config saved to /var/cache/conftool/dbconfig/20240509-122941-marostegui.json [12:29:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:29:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1463 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:29:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:30:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T364299)', diff saved to https://phabricator.wikimedia.org/P62242 and previous config saved to /var/cache/conftool/dbconfig/20240509-123004-marostegui.json [12:30:50] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bookworm [12:31:33] (03PS14) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:32:04] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:33:21] (03PS15) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:33:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bookworm [12:33:53] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:34:17] (03CR) 10Majavah: [V:03+1 C:03+2] site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:34:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]] (duration: 16m 41s) [12:35:38] (03PS16) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:36:11] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:37:21] (03PS17) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:37:54] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:38:14] (03PS18) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:38:49] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:39:56] (03PS19) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:40:20] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Create cookbook to rebuild an MD RAID array upon disk replacement - https://phabricator.wikimedia.org/T364540 (10Volans) 03NEW p:05Triage→03Medium [12:40:33] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:44:22] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm [12:44:47] (03CR) 10Marostegui: [C:03+2] Revert "db1192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029546 (owner: 10Marostegui) [12:44:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62243 and previous config saved to /var/cache/conftool/dbconfig/20240509-124449-root.json [12:45:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1192 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1028939 (https://phabricator.wikimedia.org/T364541) [12:45:41] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541) [12:48:23] (03PS2) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) [12:49:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1192.eqiad.wmnet with OS bookworm [12:50:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [12:50:55] !log depool/upgrade/repool ms-fe20[09-14] to upgrade envoy to TLS PKI certs [12:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:29] (03PS1) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 [12:52:32] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2009.codfw.wmnet [12:52:38] (03CR) 10Elukey: [V:03+1 C:03+2] role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [12:52:47] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:52:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:05] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:56:03] RECOVERY - Check whether ferm is active by checking the default input chain on parse1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:58:16] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2009.codfw.wmnet [12:58:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1419 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:58:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:58:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1453 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:58:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364299)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-125843-marostegui.json [12:58:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:58:57] (03PS20) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [12:59:11] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2010.codfw.wmnet [12:59:12] (03PS2) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 [12:59:31] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:59:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1463 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62244 and previous config saved to /var/cache/conftool/dbconfig/20240509-125955-root.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300) [13:00:05] DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] I am around [13:00:43] (03PS21) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [13:01:33] (03CR) 10Paladox: Allow users to recheck tests in checkers (036 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [13:03:25] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [13:03:32] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2010.codfw.wmnet [13:04:01] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2011.codfw.wmnet [13:04:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:04] (03PS1) 10Elukey: Add fake TLS keystore password for Cassandra clusters [labs/private] - 10https://gerrit.wikimedia.org/r/1029538 (https://phabricator.wikimedia.org/T352647) [13:05:08] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:53] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:06:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [13:07:46] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2011.codfw.wmnet [13:08:07] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2012.codfw.wmnet [13:12:02] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2012.codfw.wmnet [13:13:55] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2013.codfw.wmnet [13:13:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62245 and previous config saved to /var/cache/conftool/dbconfig/20240509-131355-marostegui.json [13:15:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62246 and previous config saved to /var/cache/conftool/dbconfig/20240509-131501-root.json [13:16:21] who is the deployer today? [13:17:29] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2013.codfw.wmnet [13:17:45] (03CR) 10CDanis: [C:03+1] cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [13:19:02] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2014.codfw.wmnet [13:20:09] DreamRimmer: I'm a WMF employee (but not a deployer), I'll see if I can raise somebody on slack. [13:23:04] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2014.codfw.wmnet [13:24:20] DreamRimmer: TheresNoTime will be here in 5 minutes [13:24:49] (o/ one moment) [13:24:59] jouncebot: nowandnext [13:24:59] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300) [13:24:59] In 2 hour(s) and 35 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1600) [13:25:43] thanks [13:26:02] DreamRimmer: starting now [13:26:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029237 (https://phabricator.wikimedia.org/T355129) (owner: 10Dreamrimmer) [13:26:49] (03PS1) 10Btullis: Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) [13:26:58] (03Merged) 10jenkins-bot: quwiki: Set MetaNamespaceName to Wikipidiya [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029237 (https://phabricator.wikimedia.org/T355129) (owner: 10Dreamrimmer) [13:27:34] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]] [13:27:34] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bookworm [13:27:37] T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129 [13:29:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62247 and previous config saved to /var/cache/conftool/dbconfig/20240509-132905-marostegui.json [13:30:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62248 and previous config saved to /var/cache/conftool/dbconfig/20240509-133009-root.json [13:30:13] !log samtar@deploy1002 dreamrimmer and samtar: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:18] DreamRimmer: patch is live on mwdebug, can you test? [13:30:26] doing [13:32:18] looks good [13:33:43] TheresNoTime: good to go [13:34:05] !log samtar@deploy1002 dreamrimmer and samtar: Continuing with sync [13:38:35] (03PS3) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) [13:38:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1485 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:38:56] (03CR) 10CI reject: [V:04-1] benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur) [13:39:38] (sync is a little slow..) [13:41:00] (03PS1) 10Elukey: services: move Swift config in staging to local envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324) [13:41:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:41:23] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:42:21] (03CR) 10Elukey: "I know that in the task Joe suggested otherwise, and for good reasons, but the ML team used the local proxy for recommendation-api-ng and " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:42:24] (03PS4) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) [13:44:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364299)', diff saved to https://phabricator.wikimedia.org/P62249 and previous config saved to /var/cache/conftool/dbconfig/20240509-134412-marostegui.json [13:44:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance [13:44:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance [13:45:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62250 and previous config saved to /var/cache/conftool/dbconfig/20240509-134514-root.json [13:47:15] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]] (duration: 19m 41s) [13:47:18] T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129 [13:50:16] TheresNoTime: Thanks for your valuable time, I appreciate it:) [13:50:49] (03CR) 10Eevans: [C:03+2] Reimage aqs1013 w/o preserving data [puppet] - 10https://gerrit.wikimedia.org/r/1029206 (https://phabricator.wikimedia.org/T364422) (owner: 10Eevans) [13:51:14] DreamRimmer: just need to run the dedupe script (I think) [13:52:00] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur) [13:55:47] (03PS1) 10Elukey: Delete the Cassandra directory in secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1029567 (https://phabricator.wikimedia.org/T352647) [13:57:10] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1013.eqiad.wmnet with OS bullseye [13:57:20] 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye [13:57:51] PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column pl_namespace in where clause on query. Default database: quwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:59:02] ^ checking that [13:59:14] jynus Amir1 ^ [13:59:30] quwiki [13:59:42] is it a missing schema change? [13:59:50] No, I know what it is [13:59:56] ? [14:00:19] ah yes [14:00:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62252 and previous config saved to /var/cache/conftool/dbconfig/20240509-140020-root.json [14:00:27] Sorry I misread the error message [14:00:42] sigh [14:00:43] hi, running namespaceDupes [14:00:47] on that wiki [14:00:52] It is part of https://phabricator.wikimedia.org/T352010 [14:00:53] is now stalled it seems? [14:00:56] TheresNoTime: do not do that [14:01:09] (03PS1) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) [14:01:13] which wiki is that [14:01:15] lots of hosts broken in s3 [14:01:19] possible same error [14:01:20] Amir1: shall I cancel the running, `quwiki` [14:01:22] quwiki.pagelinks [14:01:27] I fix this [14:01:29] yep, same error on all the broken hosts [14:01:53] (cancelled running) [14:02:07] I am going to start logging in on the status page [14:02:08] TheresNoTime: Yes, stop it for now [14:02:13] ack [14:02:29] jynus: Not sure if it is really needed, just 4 hosts affeted [14:02:34] only 4? [14:02:35] (if useful, https://phabricator.wikimedia.org/T355129#9783354 is the result of the dry run of `mwscript namespaceDupes.php --wiki quwiki`) [14:02:40] jynus: yep [14:02:40] ok then standing by [14:02:49] Amir1: need help? [14:02:51] I thougut it was a widespread breakage [14:03:11] jynus: because the schema change is running [14:03:12] in any case taking IC in case it is needed [14:03:46] marostegui: I'm doing db1157, let me give you a schema change to run on quwiki [14:03:51] downtimed the hosts to avoid paging [14:04:03] Amir1: if it is just adding the column, I can do that right now [14:04:29] yeah, pl_namespace and pl_title [14:04:34] ok doing [14:05:13] (03CR) 10Andrea Denisse: [V:03+1] "That makes sense, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [14:05:16] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] thanos: Update TLS certificate in Envoy config to match CFSSL provisioning [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [14:05:22] wiki edits looking good, so no apparent user impact [14:05:41] doing db1166 [14:05:42] done [14:05:44] doing db1175 now [14:05:53] but please anyone speak up if you see any weird wiki errors [14:05:59] !log ftr, did run `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki quwiki --fix` for T355129, cancelled before complete due to outage [14:06:08] db1175 done, going for db1189 [14:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129 [14:06:20] ALTER TABLE pagelinks ADD pl_namespace INT DEFAULT 0 NOT NULL, ADD pl_title VARBINARY(255) DEFAULT '' NOT NULL; [14:06:23] TheresNoTime: I don't think it is that per se, but the interaction of that and something else [14:06:37] fixed db1189 [14:06:41] going for the backup source now [14:06:46] TheresNoTime: please stand by until dbas give the green light [14:06:54] all done, all hosts replicating now [14:06:54] ack :) [14:07:43] TheresNoTime: simply don't run it again, until the code is fixed to respect migration stage [14:07:43] Amir1: db1189 is running the optimize now, that host is depooled anyway [14:07:50] So all fixed [14:07:51] RECOVERY - MariaDB Replica SQL: s3 on db1150 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:07:56] I see db2201 s5 lagging, but I am guessing that is unrelated [14:07:56] Amir1: ack, will not run it again unless told otherwise [14:08:06] jynus: yeah, that's a different schema change [14:08:23] this is the like fourth time the namespaceDupe is breaking stuff exactly because it bypasses links table abstraction [14:08:24] let's make sure icinga looks all green before continuing [14:08:25] Amir1: you will need to re-add the columns in db1189 as the schema change is running there [14:08:29] dropping them :) [14:08:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance [14:08:43] marostegui: I'll do it [14:08:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1485 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:08:50] icinga looks good to me now [14:08:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance [14:08:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62253 and previous config saved to /var/cache/conftool/dbconfig/20240509-140858-marostegui.json [14:09:02] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:09:18] is there a task for "code is fixed to respect migration stage" ? [14:09:18] !log Restarting envoyproxy on titan* hosts as part of the CFSSL migration - T360414 [14:09:20] Amir1: should be disable or do something to namespaceDupe to avoid breaking it again? [14:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:22] I still see lag on db1157 and db1189 [14:09:22] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [14:09:34] we should disable it again [14:09:37] jynus: orchestrator doesn't show it https://orchestrator.wikimedia.org/web/cluster/alias/s3 [14:09:42] only db1189 which is depooled [14:09:46] db1189 is expected [14:09:48] ok [14:09:51] it's running the schema change [14:10:04] so issue adverted, any followup needed? [14:10:06] I want to drop the columns again [14:10:21] or maybe you can coordinate directly with TheresNoTime [14:10:55] e.g. telling him to reschedule his maintenance [14:11:21] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:11:23] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:11:26] that's a different thing [14:11:43] I dropped it on quwiki on db1150 again [14:11:58] so far works normally [14:12:26] db1157 as well [14:12:28] right [14:12:44] Amir1: so i guess db1189 will need to get them re-added, let the data go, and then dropped them [14:12:54] because the maint script is not running, the normal writes shouldn't affect it [14:13:00] yeah, fun stuff [14:13:28] (03PS1) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [14:13:48] (03CR) 10BBlack: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:14:04] T364546 [14:14:07] (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [14:14:10] T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546 [14:14:11] So should wiki maintenance be stopped for now? [14:14:21] what is the right approach? [14:14:28] (03CR) 10Ssingh: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:14:40] or maybe you can coordinate on that ticket? [14:14:41] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9783406 (10MatthewVernon) Can I very tentatively ask if you have thoughts about timescales for this, please? It seems likely to be a non-trivial bi... [14:14:43] to close the issue [14:15:07] ^ TheresNoTime Amir1would that work? [14:15:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62254 and previous config saved to /var/cache/conftool/dbconfig/20240509-141526-root.json [14:15:41] jynus: I've commented on T355129 and do not intend to run that script until otherwise told :) [14:15:41] T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129 [14:15:46] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9783414 (10andrea.denisse) Hi @jijiki , I've abandoned the patch for the thanos-fe hosts ([[ https://gerrit.wikimedia.org/r/1028546 | #1028546 ]]) but feel free to restore it... [14:15:50] (03PS2) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) [14:16:04] thank you, sorry for the impact [14:16:27] !log eevans@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1013.eqiad.wmnet with OS bullseye [14:16:33] no problem for me, *I* was the one who broke things :D [14:16:45] 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye executed with errors: - aqs1013 (**FAIL**) - Downt... [14:16:51] as far as I understood, you werent', you only hit a bug [14:17:11] /j [14:17:22] but better be safe as this was not very impacting but quite scarey if it got more widespread [14:17:35] thank you for your understanding [14:17:54] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9783442 (10andrea.denisse) 05In progress→03Resolved [14:18:03] I have said multiple times, this is the only place that writes to links tables bypassing the abstraction in place. I asked multiple times to actually use the abstraction and every time the response I got was that "it's too much work, we fix this breakage to unlock the work" [14:18:23] (03PS2) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [14:18:54] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1013.eqiad.wmnet with OS bullseye [14:19:09] 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye [14:19:09] (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [14:20:13] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:20] (03PS1) 10Ladsgroup: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546) [14:23:29] (03CR) 10Ladsgroup: [C:03+2] Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup) [14:24:04] (03PS1) 10Ladsgroup: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) [14:24:10] (03CR) 10Ladsgroup: [C:03+2] Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup) [14:26:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:28:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2006 to codfw - jhancock@cumin2002" [14:29:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2006 to codfw - jhancock@cumin2002" [14:29:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:34] (03PS3) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) [14:32:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:32:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:32:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:32:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:32:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:32:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED [14:32:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED [14:32:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:33:29] (03CR) 10CDanis: [C:03+1] "thanks Scott and Riccardo" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [14:33:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:33:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:33:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:33:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:34:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:34:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:36:50] (03CR) 10Fabfur: [V:03+1 C:03+2] cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [14:37:25] (03CR) 10Scott French: [C:03+2] confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [14:37:27] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9783503 (10Jclark-ctr) @Marostegui you can put server back in rotation even though i uploaded multiple photos yesterday to Dell. They replied this morning requesting part number to send correct part {F51422232} I attache... [14:37:41] (03CR) 10Volans: reqconfig: add command to search IP in ipblocks (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [14:39:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62255 and previous config saved to /var/cache/conftool/dbconfig/20240509-143938-marostegui.json [14:39:38] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9783539 (10Marostegui) Thanks John, I will create a subtask for us to work on the formatting, reimage and recloning. Will leave this open until you've finished your side. [14:39:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:41:24] marostegui: do you have the query that broke it handy? [14:41:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:42:09] (03PS4) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) [14:43:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:43:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:44:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:05] (03Merged) 10jenkins-bot: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup) [14:45:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:45:52] (03CR) 10Elukey: "CI reports this:" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [14:46:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:46:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:46:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [14:48:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup) [14:48:52] (03CR) 10Ssingh: "# top TEST varnish/text/51-chrome-private-prefetch-proxy.vtc passed (7.040)" [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:48:54] (03CR) 10Volans: "Yes I know, thanks. I'm looking for a fix that does make it build properly both in CI and in debian upstream" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [14:49:38] (03CR) 10BBlack: [C:03+1] varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:50:49] (03CR) 10Ssingh: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:51:40] (03Merged) 10jenkins-bot: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup) [14:51:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998#9783587 (10Andrew) 05Stalled→03Invalid I'm closing this as invalid since those hosts have come and gone :) [14:52:10] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]] [14:52:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED [14:52:11] (03PS4) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) [14:52:13] T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546 [14:52:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:53:01] (03CR) 10Volans: "addressed comment" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [14:53:10] (03CR) 10Ssingh: [C:03+2] varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [14:53:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED [14:54:07] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage [14:54:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED [14:54:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62256 and previous config saved to /var/cache/conftool/dbconfig/20240509-145445-marostegui.json [14:54:49] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:55:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED [14:55:02] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [14:55:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2006'] [14:55:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2006'] [14:56:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED [14:56:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED [14:57:10] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage [14:57:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED [14:58:17] (03PS5) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) [14:58:40] (03CR) 10Filippo Giunchedi: "I see where you are going with this, let me know what you think of these alternatives:" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur) [14:59:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1361 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:59:37] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:59:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1496 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:00:17] !log sudo cumin 'A:cp' 'run-puppet-agent --enable "merging CR 1029570"' [15:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:00:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:00:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED [15:01:50] (03CR) 10Scott French: "Thanks, Riccardo!" [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [15:03:09] (03PS8) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029508 (https://phabricator.wikimedia.org/T364379) [15:05:24] sukhe: I hope you added some batching,that's 112 hosts all running puppet together otherwise :-P [15:06:11] volans: yeah, I usually add it but didn't this time since I tested it before. but maybe I should have. [15:06:39] even something like -s10 could have been nice, yeah [15:08:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]] (duration: 16m 02s) [15:08:16] T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546 [15:08:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2007'] [15:08:32] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet [15:08:45] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2009'] [15:09:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2007'] [15:09:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2008'] [15:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62257 and previous config saved to /var/cache/conftool/dbconfig/20240509-150953-marostegui.json [15:11:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2009'] [15:11:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2008'] [15:11:18] oh you can surely do -b 30 and even more, we didn't test the max concurrency yet with the new puppetservers [15:11:36] but it seems that they survived the 64 parallel runs pretty fine [15:11:45] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:14:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [15:14:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [15:14:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye [15:14:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [15:14:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye [15:14:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002" [15:14:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye [15:14:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye [15:14:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye [15:14:35] (03CR) 10Eevans: [C:03+1] Deprecate system::role for Cassandra services [puppet] - 10https://gerrit.wikimedia.org/r/1026940 (owner: 10Muehlenhoff) [15:15:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002" [15:15:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:30] sukhe: https://grafana.wikimedia.org/d/000000477/puppetdb?orgId=1&from=now-1h&to=now doesn't seem too bad and https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-1h&to=now too [15:15:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet [15:15:50] so yeah potentially we might start to not care about batches for puppet runs (to be verified) [15:16:14] volans: also depends on the change, this one was fairly light at least in Puppet resources related stuf [15:16:17] f [15:16:44] in the past catalog compilation was the failing bit [15:17:33] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:18:26] (03CR) 10CDanis: [C:03+1] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [15:19:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002" [15:20:13] FIRING: [3x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002" [15:20:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:21:05] (03CR) 10Elukey: "I checked the deployment-prep config for deployment-ms-fe04.deployment-prep.eqiad1.wikimedia.cloud:" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:21:30] !log dancy@deploy1002 Installing scap version "4.83.0" for 308 hosts [15:21:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:22:02] !log dancy@deploy1002 Installing scap version "4.83.0" for 307 hosts [15:22:42] !log dancy@deploy1002 Installation of scap version "4.83.0" completed for 307 hosts [15:23:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [15:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62258 and previous config saved to /var/cache/conftool/dbconfig/20240509-152501-marostegui.json [15:25:05] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:27:11] (03CR) 10Scott French: [C:03+2] confd: clean up confd-lint-wrap after error file fixes [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [15:27:21] !log eevans@deploy1002 Started deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): (no justification provided) [15:27:55] !log eevans@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): (no justification provided) (duration: 00m 33s) [15:29:17] (03CR) 10Fabfur: "I think I'll go with alternative #2: I'll drop this CR and do all the work (socket unit, service unit override for StandardInput) in the o" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur) [15:29:27] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1013.eqiad.wmnet with OS bullseye [15:29:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1361 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:29:38] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye completed: - aqs1013 (**PASS**) - Removed from Puppet and PuppetDB if pr... [15:29:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1496 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:30:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:30:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:31:04] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422 [15:31:08] T364422: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422 [15:31:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422 [15:31:27] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783703 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e110d57c-bacd-48ee-8333-fae55b264d8c) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrappin... [15:31:40] (03PS1) 10EoghanGaffney: apt: Update gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/1029608 (https://phabricator.wikimedia.org/T364481) [15:34:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED [15:34:59] (03PS2) 10Btullis: Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) [15:35:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2010'] [15:35:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-main2010'] [15:35:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2010'] [15:36:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2010'] [15:36:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [15:36:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye [15:37:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis) [15:39:49] (03Abandoned) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur) [15:39:54] (03CR) 10Volans: [C:03+2] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [15:43:00] (03Merged) 10jenkins-bot: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1600). [16:00:05] thedj: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:17] o/ [16:01:52] (03PS1) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) [16:03:06] (03PS1) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [16:03:26] (03CR) 10CI reject: [V:04-1] cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [16:03:32] (03PS1) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2006-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029616 (https://phabricator.wikimedia.org/T354206) [16:05:13] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:11] (03CR) 10Eevans: [C:03+1] Add fake TLS keystore password for Cassandra clusters [labs/private] - 10https://gerrit.wikimedia.org/r/1029538 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:07:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9783825 (10Miriam) Hello, sorry for the delay! Approved on my end. Thank you! [16:08:59] (03PS1) 10Andrew Bogott: Move rabbitmq01.codfw1dev to cloudcontrol2006-dev [dns] - 10https://gerrit.wikimedia.org/r/1029618 [16:09:21] (03CR) 10Eevans: [C:03+1] Delete the Cassandra directory in secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1029567 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:10:44] (03CR) 10Andrew Bogott: [C:03+2] Replace cloudcontrol2001-dev with cloudcontrol2006-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029616 (https://phabricator.wikimedia.org/T354206) (owner: 10Andrew Bogott) [16:13:12] (03PS1) 10Elukey: ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984) [16:13:58] (03CR) 10Elukey: [C:03+2] ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:14:52] (03Merged) 10jenkins-bot: ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:14:54] (03PS2) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) [16:16:02] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9783851 (10andrea.denisse) [16:18:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [16:22:03] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9783853 (10cmooney) >>! In T354872#9529469, @MatthewVernon wrote: > Sorry, I think object stores are often not really written with renumbering in m... [16:22:33] (03PS2) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) [16:23:54] 10ops-codfw, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559 (10Andrew) 03NEW [16:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:16] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:31:40] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new codfw cloudcontrol nodes - cmooney@cumin1002" [16:32:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new codfw cloudcontrol nodes - cmooney@cumin1002" [16:32:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2007.codfw.wmnet with OS bullseye [16:32:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2006.codfw.wmnet with OS bullseye [16:32:59] (03PS3) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) [16:33:55] 10ops-codfw, 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559#9783893 (10cmooney) Hey Andrew, Yeah this is on me, I'd not completed the work to ma... [16:34:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [16:34:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed... [16:35:43] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cloudcontrol2006-dev.private.codfw.wikimedia.cloud on all recursors [16:35:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2006-dev.private.codfw.wikimedia.cloud on all recursors [16:35:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2008.codfw.wmnet with OS bullseye [16:35:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye executed... [16:40:25] (03PS1) 10Elukey: ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984) [16:41:38] (03PS4) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) [16:43:11] (03CR) 10Elukey: [C:03+2] ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:43:48] (03CR) 10Elukey: [V:03+2 C:03+2] ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [16:47:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:49:02] (03CR) 10BBlack: [C:03+1] varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [16:49:11] (03CR) 10Ssingh: [C:03+2] varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh) [16:49:29] !log sudo cumin 'A:cp' 'disable-puppet "merging CR 1029614"' [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:51:08] (03PS1) 10Andrew Bogott: Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1029551 [16:51:35] (03CR) 10Andrew Bogott: [C:03+2] Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1029551 (owner: 10Andrew Bogott) [16:51:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P62259 and previous config saved to /var/cache/conftool/dbconfig/20240509-165141-ladsgroup.json [16:51:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:53:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2006-dev.codfw.wmnet with OS bookworm [16:55:32] !log sudo cumin -b30 'A:cp' 'run-puppet-agent --enable "merging CR 1029614"' [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [17:00:05] bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700) [17:06:30] PROBLEM - SSH on ncmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:06:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P62260 and previous config saved to /var/cache/conftool/dbconfig/20240509-170649-ladsgroup.json [17:09:20] RECOVERY - SSH on ncmonitor1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:13:19] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [17:16:37] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [17:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:21:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P62261 and previous config saved to /var/cache/conftool/dbconfig/20240509-172157-ladsgroup.json [17:23:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:27:21] (03PS1) 10Lucas Werkmeister: Skin: Fix UrlUtils calls [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) [17:30:55] (03CR) 10Lucas Werkmeister: "Deployment can be tested on Test Wikidata, because https://test.wikidata.org/wiki/MediaWiki:Recentchanges-url is a protocol-relative URL (" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister) [17:33:52] (03PS1) 10Xcollazo: Dumps: Include wikis with underscores in the list of folders to be mirrored. [puppet] - 10https://gerrit.wikimedia.org/r/1029633 (https://phabricator.wikimedia.org/T354687) [17:34:03] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1029634 [17:34:04] (03PS1) 10Jforrester: Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567) [17:34:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS bookworm [17:34:39] jouncebot: nowandnext [17:34:39] For the next 0 hour(s) and 25 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700) [17:34:40] For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700) [17:34:40] In 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1800) [17:34:45] I'm going to do an emergency deploy to unbreak Wikifunctions editing. [17:34:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567) (owner: 10Jforrester) [17:35:42] (03PS1) 10Andrew Bogott: Revert "Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1029557 [17:36:45] (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1029557 (owner: 10Andrew Bogott) [17:37:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P62262 and previous config saved to /var/cache/conftool/dbconfig/20240509-173705-ladsgroup.json [17:37:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [17:37:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:37:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [17:37:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P62263 and previous config saved to /var/cache/conftool/dbconfig/20240509-173728-ladsgroup.json [17:40:42] (03Merged) 10jenkins-bot: Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567) (owner: 10Jforrester) [17:40:50] Finally. [17:41:15] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]] [17:41:20] T364567: Editing labels in Wikifunctions' Objects doesn't get reflected in the API response because it's cached - https://phabricator.wikimedia.org/T364567 [17:43:53] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:45:22] !log jforrester@deploy1002 jforrester: Continuing with sync [17:53:11] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:58:32] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]] (duration: 17m 17s) [17:58:38] T364567: Editing labels in Wikifunctions' Objects doesn't get reflected in the API response because it's cached - https://phabricator.wikimedia.org/T364567 [18:00:05] jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1800). [18:10:24] (03PS5) 10Herron: pyrra: onboard haproxy slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1029634 (https://phabricator.wikimedia.org/T302995) [18:20:18] (03PS2) 10Scott French: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) [18:20:18] (03PS2) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) [18:21:42] (03CR) 10Andrew Bogott: [C:03+2] Move rabbitmq01.codfw1dev to cloudcontrol2006-dev [dns] - 10https://gerrit.wikimedia.org/r/1029618 (owner: 10Andrew Bogott) [18:27:10] (03PS3) 10Scott French: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) [18:27:13] (03PS3) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) [18:28:50] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1029654 [18:31:05] (03CR) 10Scott French: "Ah, right! I think I've got it right based on the diffs. Please take a look when you have a chance." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [18:38:44] 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9784309 (10Scott_French) 05Open→03Resolved The last two patches have been merged and subsequent confd checks commands show no issues. I believe there's nothing... [18:47:16] (03PS1) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [18:47:54] (03CR) 10CI reject: [V:04-1] Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [18:49:29] (03PS4) 10Herron: pyrra: varnish: add cluster [puppet] - 10https://gerrit.wikimedia.org/r/1029654 (https://phabricator.wikimedia.org/T302995) [18:51:38] 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9784341 (10Eevans) [18:52:55] (03PS1) 10Eevans: Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558 [18:53:38] (03PS2) 10Eevans: Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558 [18:53:50] (03PS2) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [18:55:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:55:51] (03CR) 10Andrea Denisse: "Please review my patch if you can, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [19:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:03:48] (03PS1) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:04:09] (03CR) 10CI reject: [V:04-1] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:08:58] !log Reset failed `pyrra-filesystem-notify-thanos.path`, and `reset-failed thanos-rule-reload.service` units on titan1001 [19:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:01] (03PS2) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:09:15] !log Restarting `pyrra-filesystem-notify-thanos.path`, and `reset-failed thanos-rule-reload.service` units on titan1001 [19:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:27] (03CR) 10CI reject: [V:04-1] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:13:22] (03PS3) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:17:03] (03PS1) 10Zabe: Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418) [19:17:16] (03CR) 10Zabe: [C:03+2] Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418) (owner: 10Zabe) [19:19:53] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2001-dev.codfw.wmnet [19:20:55] zabe: hi, are you planning to deploy? i was just talking about deploying that change with jeena [19:21:48] i don't really care who deploys, jeena: feel free to do it :) [19:22:17] (03PS1) 10Andrew Bogott: Remove references to cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029676 (https://phabricator.wikimedia.org/T364577) [19:22:20] I can deploy if you like, was just waiting for the changes to merge to master as well [19:23:10] alrigt [19:24:04] 10ops-codfw, 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559#9784420 (10Andrew) Reimaging cloudcontrol2006-dev works now, thanks! Bonus points: I... [19:25:21] (03PS1) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [19:25:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:25:36] (03CR) 10Andrew Bogott: [C:03+2] Remove references to cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029676 (https://phabricator.wikimedia.org/T364577) (owner: 10Andrew Bogott) [19:26:20] (03Merged) 10jenkins-bot: Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418) (owner: 10Zabe) [19:26:25] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [19:28:28] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [19:29:02] (03PS2) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [19:29:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [19:29:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:29:36] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2001-dev.codfw.wmnet [19:29:38] (03PS4) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:29:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:34:49] (03PS3) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [19:35:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:41:33] (03PS4) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [19:41:57] (03PS5) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:42:03] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]] [19:42:08] T312418: Migrate usage of Database::select to SelectQueryBuilder in Flow - https://phabricator.wikimedia.org/T312418 [19:42:09] T364499: Flow\Exception\WikitextException: Conversion from 'wikitext' to 'topic-title-wikitext' was requested, but this is not supported. - https://phabricator.wikimedia.org/T364499 [19:43:49] (03CR) 10CDanis: [C:03+1] cache:benthos: test for socket based activation in Benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [19:44:43] !log jhuneidi@deploy1002 jhuneidi and zabe: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:45:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:45:35] MatmaRex: are there any checks you need to do before I sync? [19:46:14] jeena: no [19:46:30] !log jhuneidi@deploy1002 jhuneidi and zabe: Continuing with sync [19:46:35] thanks! [19:46:35] the test plan is to see if the affected pages appear without errors now [19:47:14] (03PS6) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [19:47:51] (03PS5) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [19:50:29] (03CR) 10Cwhite: gitlab: enable custom exporter on all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [19:52:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:52:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1424 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:53:08] (03CR) 10Cwhite: prometheus::ops: scrape custom gitlab exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [19:53:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:54:02] PROBLEM - Check whether ferm is active by checking the default input chain on parse2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:41] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]] (duration: 17m 37s) [19:59:48] T312418: Migrate usage of Database::select to SelectQueryBuilder in Flow - https://phabricator.wikimedia.org/T312418 [19:59:48] T364499: Flow\Exception\WikitextException: Conversion from 'wikitext' to 'topic-title-wikitext' was requested, but this is not supported. - https://phabricator.wikimedia.org/T364499 [19:59:50] MatmaRex: done [20:00:02] thanks [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T2000). [20:00:05] lucaswerkmeister: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] o/ [20:00:39] things are working as expected for me [20:00:41] hi lucaswerkmeister, I need to delay the backport window a bit, since I need to deploy the train to all wikis [20:00:45] thanks MatmaRex! [20:00:52] ok, good luck with the train! [20:01:01] thank you, I'll ping you when done [20:01:43] MatmaRex: the errors are also going down now 👍 [20:02:22] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398) [20:02:28] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [20:03:12] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [20:06:37] FIRING: [3x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:11:37] FIRING: [4x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:13] FIRING: [4x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:13] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.4 refs T361398 [20:18:17] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [20:18:40] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1424 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:22:44] lucaswerkmeister: I can backport your change now [20:22:50] okay! I’m ready to test it [20:23:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister) [20:23:59] RECOVERY - Check whether ferm is active by checking the default input chain on parse2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:24:58] (03PS1) 10Zabe: wikireplicas: Drop gu_salt from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1029709 (https://phabricator.wikimedia.org/T364435) [20:28:59] (03PS2) 10Zabe: wikireplicas: Drop gu_salt from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1029709 (https://phabricator.wikimedia.org/T364435) [20:33:49] I was confused for a second why CI was taking so long and then I remembered this is a core patch and not a config change ;) [20:37:04] (03PS1) 10Umherirrender: specials: Fix "include templates" query builder for Special:Export [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554) [20:37:27] hehe [20:39:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:43:15] (03PS6) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) [20:43:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:47:04] (03Merged) 10jenkins-bot: Skin: Fix UrlUtils calls [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister) [20:47:13] (03CR) 10JHathaway: [C:03+2] postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:47:22] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]] [20:47:29] T364539: Protocol-relative URL in sidebar now interpreted as title (Query Service link in Wikidata sidebar broken) - https://phabricator.wikimedia.org/T364539 [20:48:37] (03PS7) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) [20:49:17] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:49:51] !log jhuneidi@deploy1002 jhuneidi and lucaswerkmeister: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:49:58] testing… [20:51:07] hm, I’m not seeing it working quite yet [20:51:14] I wonder if some cache is involved [20:51:35] (03CR) 10JHathaway: [C:03+2] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [20:54:44] lucaswerkmeister: I'm not sure...what/how are we testing? [20:55:05] if you look at https://test.wikidata.org/wiki/Wikidata:Main_Page the “recent changes” link in the sidebar is broken [20:55:10] (goes to https://test.wikidata.org/wiki///test.wikidata.org/wiki/Special:RecentChanges) [20:55:37] if I read the config correctly, $wgEnableSidebarCache is enabled everywhere in prod, and $wgSidebarCacheExpiry defaults to 86400 seconds (one day) [20:55:59] so I think we can’t really test it in production [20:56:06] I did test it locally, FWIW ^^ [20:56:14] oh I see [20:56:29] ($wgEnableSidebarCache defaults to false, so I wasn’t affected by that on my local wiki) [20:56:49] i wonder if there's a way to force the cache to expire? or I just continue to sync [20:57:04] hm [20:57:10] apparently the cache key includes the language code [20:57:10] let me see [20:57:21] yay, https://test.wikidata.org/wiki/Wikidata:Main_Page?uselang=aa shows a fixed link [20:57:28] okay, cool [20:57:36] (nobody’s had a reason to open test wikidata in that language in the past 24 hours, I guess ^^) [20:57:49] thanks for asking that question and making me look closer at the code ^^ [20:58:02] thanks for the fix! [20:58:16] !log jhuneidi@deploy1002 jhuneidi and lucaswerkmeister: Continuing with sync [21:11:04] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]] (duration: 23m 42s) [21:11:08] T364539: Protocol-relative URL in sidebar now interpreted as title (Query Service link in Wikidata sidebar broken) - https://phabricator.wikimedia.org/T364539 [21:11:24] \o/ thanks for deploying! [21:11:40] and thank you for taking care of the post-Hackathon train <3 [21:11:50] you're welcome! [21:15:38] (03PS1) 10Ryan Kemper: sre.kafka.roll-restart-reboot-brokers: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1029712 [21:18:55] !log ryankemper@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [21:30:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [21:35:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [21:41:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [21:42:50] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:45:00] 06SRE, 10Observability-Alerting: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9784740 (10andrea.denisse) Hi @fgiunchedi , thanks for sharing your insights on this task. I'm taking a look at it again and I agree that repurposing this task to fix `prometheu... [21:47:43] !log [wdqs] Re-enabled puppet on `wdqs2023` [21:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:11] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:50] RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:02:31] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:43] (03PS1) 10JHathaway: Revert "postfix: chance acme chief cert order for Postfix" [puppet] - 10https://gerrit.wikimedia.org/r/1029565 [22:07:49] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588 (10xcollazo) 03NEW [22:08:59] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784779 (10xcollazo) [22:11:38] (03CR) 10JHathaway: [C:03+2] Revert "postfix: chance acme chief cert order for Postfix" [puppet] - 10https://gerrit.wikimedia.org/r/1029565 (owner: 10JHathaway) [22:12:33] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784798 (10xcollazo) @WDoranWMF kindly please confirm that you are my manager and that you approve of this request. [22:12:45] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 122.95 ms [22:27:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2006'] [22:28:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2006'] [22:33:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9784803 (10Jhancock.wm) [22:36:37] FIRING: [3x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:13] FIRING: [2x] ProbeDown: Service aqs1013-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:29] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784816 (10Eevans) [22:58:55] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784822 (10Eevans) [23:01:24] (03PS1) 10Santiago Faci: mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734) [23:01:54] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784824 (10Eevans) > [] - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml @KOfori you are group approver for cassandra-st... [23:03:13] (03CR) 10Santiago Faci: [C:03+2] mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734) (owner: 10Santiago Faci) [23:03:17] (03CR) 10Eevans: [C:03+2] Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558 (owner: 10Eevans) [23:04:05] (03Merged) 10jenkins-bot: mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734) (owner: 10Santiago Faci) [23:06:14] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [23:06:26] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [23:06:39] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [23:20:57] (03PS1) 10Zabe: beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727 [23:21:49] (03CR) 10Zabe: [C:03+2] beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727 (owner: 10Zabe) [23:22:33] (03Merged) 10jenkins-bot: beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727 (owner: 10Zabe) [23:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028943 [23:38:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028943 (owner: 10TrainBranchBot) [23:50:26] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed