[00:03:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:05:11] <rzl>	 looking
[00:05:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T361627)', diff saved to https://phabricator.wikimedia.org/P62132 and previous config saved to /var/cache/conftool/dbconfig/20240509-000554-marostegui.json
[00:06:04] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[00:07:04] <brett>	 marostegui: This related to the DB stuff?
[00:07:31] <rzl>	 brett: it's 2 AM for him, that's a script :)
[00:07:35] <brett>	 oh
[00:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028934 (owner: 10TrainBranchBot)
[00:07:51] <brett>	 s8 is showing an increase in errors
[00:07:52] <rzl>	 I see a traffic spike, looks like the impact has passed, just checking for any lingering effect
[00:08:16] <brett>	 Perhaps that's just related to the script
[00:08:39] <rzl>	 why do you say that?
[00:08:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:09:28] <brett>	 I thought it happened earlier than I remembered, indeed it did happen at 00:00
[00:09:36] <brett>	 *00:03
[00:10:04] <brett>	 I'm not seeing an increase in traffic to varnish via the main grafana dashboard...
[00:21:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P62133 and previous config saved to /var/cache/conftool/dbconfig/20240509-002105-marostegui.json
[00:26:28] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:27:57] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:36:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P62134 and previous config saved to /var/cache/conftool/dbconfig/20240509-003614-marostegui.json
[00:43:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:51:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T361627)', diff saved to https://phabricator.wikimedia.org/P62135 and previous config saved to /var/cache/conftool/dbconfig/20240509-005122-marostegui.json
[00:51:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[00:51:28] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[00:51:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[00:51:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62136 and previous config saved to /var/cache/conftool/dbconfig/20240509-005146-marostegui.json
[00:56:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:02:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62137 and previous config saved to /var/cache/conftool/dbconfig/20240509-010250-marostegui.json
[01:02:54] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[01:17:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P62138 and previous config saved to /var/cache/conftool/dbconfig/20240509-011758-marostegui.json
[01:28:19] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 108, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:33:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P62139 and previous config saved to /var/cache/conftool/dbconfig/20240509-013305-marostegui.json
[01:48:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T361627)', diff saved to https://phabricator.wikimedia.org/P62140 and previous config saved to /var/cache/conftool/dbconfig/20240509-014814-marostegui.json
[01:48:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[01:48:18] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[01:48:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[01:48:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T361627)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-014836-marostegui.json
[01:50:13] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:59:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T361627)', diff saved to https://phabricator.wikimedia.org/P62142 and previous config saved to /var/cache/conftool/dbconfig/20240509-015909-marostegui.json
[01:59:13] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[01:59:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P62143 and previous config saved to /var/cache/conftool/dbconfig/20240509-015942-ladsgroup.json
[01:59:48] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[02:14:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P62144 and previous config saved to /var/cache/conftool/dbconfig/20240509-021417-marostegui.json
[02:14:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P62145 and previous config saved to /var/cache/conftool/dbconfig/20240509-021452-ladsgroup.json
[02:29:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P62146 and previous config saved to /var/cache/conftool/dbconfig/20240509-022925-marostegui.json
[02:30:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P62147 and previous config saved to /var/cache/conftool/dbconfig/20240509-023000-ladsgroup.json
[02:44:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T361627)', diff saved to https://phabricator.wikimedia.org/P62148 and previous config saved to /var/cache/conftool/dbconfig/20240509-024432-marostegui.json
[02:44:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[02:44:36] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[02:44:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[02:44:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62149 and previous config saved to /var/cache/conftool/dbconfig/20240509-024455-marostegui.json
[02:45:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P62150 and previous config saved to /var/cache/conftool/dbconfig/20240509-024508-ladsgroup.json
[02:45:11] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[02:45:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[02:45:24] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[02:45:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62151 and previous config saved to /var/cache/conftool/dbconfig/20240509-024531-ladsgroup.json
[02:55:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62152 and previous config saved to /var/cache/conftool/dbconfig/20240509-025537-marostegui.json
[02:55:41] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[03:03:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:10:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P62153 and previous config saved to /var/cache/conftool/dbconfig/20240509-031045-marostegui.json
[03:25:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-032552-marostegui.json
[03:40:13] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:41:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T361627)', diff saved to https://phabricator.wikimedia.org/P62155 and previous config saved to /var/cache/conftool/dbconfig/20240509-034105-marostegui.json
[03:41:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[03:41:09] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[03:41:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[03:41:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62156 and previous config saved to /var/cache/conftool/dbconfig/20240509-034128-marostegui.json
[03:53:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62157 and previous config saved to /var/cache/conftool/dbconfig/20240509-035320-marostegui.json
[03:53:28] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[04:08:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P62158 and previous config saved to /var/cache/conftool/dbconfig/20240509-040830-marostegui.json
[04:20:29] <icinga-wm>	 RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:23:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P62159 and previous config saved to /var/cache/conftool/dbconfig/20240509-042337-marostegui.json
[04:26:28] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:38:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T361627)', diff saved to https://phabricator.wikimedia.org/P62160 and previous config saved to /var/cache/conftool/dbconfig/20240509-043845-marostegui.json
[04:38:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[04:38:49] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[04:39:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[04:39:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T361627)', diff saved to https://phabricator.wikimedia.org/P62161 and previous config saved to /var/cache/conftool/dbconfig/20240509-043908-marostegui.json
[04:39:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 62390360 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:40:37] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:43:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:43:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:51:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067
[04:51:43] <stashbot>	 T364067: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T364067
[04:51:54] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067)
[04:52:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067
[04:52:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1231 with weight 0 T364067', diff saved to https://phabricator.wikimedia.org/P62162 and previous config saved to /var/cache/conftool/dbconfig/20240509-045216-marostegui.json
[04:55:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1025916 (https://phabricator.wikimedia.org/T364067) (owner: 10Gerrit maintenance bot)
[04:58:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[04:58:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[05:06:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[05:06:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[05:07:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62163 and previous config saved to /var/cache/conftool/dbconfig/20240509-050752-root.json
[05:08:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:13:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[05:14:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[05:22:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62164 and previous config saved to /var/cache/conftool/dbconfig/20240509-052258-root.json
[05:24:35] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Promote db1231 to s6 master" [puppet] - 10https://gerrit.wikimedia.org/r/1029249
[05:25:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[05:25:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[05:26:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Promote db1231 to s6 master" [puppet] - 10https://gerrit.wikimedia.org/r/1029249 (owner: 10Marostegui)
[05:29:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P62165 and previous config saved to /var/cache/conftool/dbconfig/20240509-052912-root.json
[05:31:29] <wikibugs>	 (03PS1) 10Marostegui: db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029313
[05:32:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1180.eqiad.wmnet onto db1231.eqiad.wmnet
[05:32:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029313 (owner: 10Marostegui)
[05:34:31] <wikibugs>	 (03PS1) 10Marostegui: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029315
[05:34:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1172 T363792', diff saved to https://phabricator.wikimedia.org/P62166 and previous config saved to /var/cache/conftool/dbconfig/20240509-053442-marostegui.json
[05:34:47] <stashbot>	 T363792: Upgrade s8 to MariaDB 10.6 - https://phabricator.wikimedia.org/T363792
[05:35:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029315 (owner: 10Marostegui)
[05:37:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS bookworm
[05:38:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 10%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-053804-root.json
[05:41:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:41:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:49:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage
[05:52:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage
[05:53:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62167 and previous config saved to /var/cache/conftool/dbconfig/20240509-055314-root.json
[05:54:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[05:54:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[05:54:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62168 and previous config saved to /var/cache/conftool/dbconfig/20240509-055429-marostegui.json
[05:54:33] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[05:58:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029250
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0600).
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62169 and previous config saved to /var/cache/conftool/dbconfig/20240509-060821-root.json
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:13:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029250 (owner: 10Marostegui)
[06:14:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS bookworm
[06:14:31] <wikibugs>	 (03PS1) 10Zabe: beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433
[06:14:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T364451
[06:14:58] <stashbot>	 T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451
[06:15:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2021 with weight 0 T364451', diff saved to https://phabricator.wikimedia.org/P62170 and previous config saved to /var/cache/conftool/dbconfig/20240509-061500-root.json
[06:15:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T364451
[06:17:05] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029434 (https://phabricator.wikimedia.org/T364451)
[06:17:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62171 and previous config saved to /var/cache/conftool/dbconfig/20240509-061742-root.json
[06:17:49] <wikibugs>	 (03CR) 10Zabe: [C:03+2] beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433 (owner: 10Zabe)
[06:17:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029434 (https://phabricator.wikimedia.org/T364451) (owner: 10Marostegui)
[06:18:31] <marostegui>	 !log Starting es4 codfw failover from es2020 to es2021 - T364451
[06:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:36] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Reenable encrypted Argon2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029433 (owner: 10Zabe)
[06:19:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2021 to es4 primary and set section read-write T364451', diff saved to https://phabricator.wikimedia.org/P62172 and previous config saved to /var/cache/conftool/dbconfig/20240509-061904-marostegui.json
[06:19:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2020 T364451', diff saved to https://phabricator.wikimedia.org/P62173 and previous config saved to /var/cache/conftool/dbconfig/20240509-061957-root.json
[06:20:02] <stashbot>	 T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451
[06:20:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es4 codfw master', diff saved to https://phabricator.wikimedia.org/P62174 and previous config saved to /var/cache/conftool/dbconfig/20240509-062027-marostegui.json
[06:23:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62175 and previous config saved to /var/cache/conftool/dbconfig/20240509-062327-root.json
[06:24:13] <wikibugs>	 (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029435
[06:24:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2020.codfw.wmnet with OS bookworm
[06:24:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029435 (owner: 10Marostegui)
[06:29:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62176 and previous config saved to /var/cache/conftool/dbconfig/20240509-062926-marostegui.json
[06:29:31] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[06:32:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62177 and previous config saved to /var/cache/conftool/dbconfig/20240509-063248-root.json
[06:33:41] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029251
[06:34:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029251 (owner: 10Marostegui)
[06:35:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1180.eqiad.wmnet onto db1231.eqiad.wmnet
[06:35:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62178 and previous config saved to /var/cache/conftool/dbconfig/20240509-063514-root.json
[06:36:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1028935 (https://phabricator.wikimedia.org/T364523)
[06:36:57] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028936 (https://phabricator.wikimedia.org/T364523)
[06:38:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62179 and previous config saved to /var/cache/conftool/dbconfig/20240509-063832-root.json
[06:38:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62180 and previous config saved to /var/cache/conftool/dbconfig/20240509-063845-root.json
[06:44:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P62181 and previous config saved to /var/cache/conftool/dbconfig/20240509-064434-marostegui.json
[06:47:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2020.codfw.wmnet with reason: host reimage
[06:47:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62182 and previous config saved to /var/cache/conftool/dbconfig/20240509-064754-root.json
[06:50:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62183 and previous config saved to /var/cache/conftool/dbconfig/20240509-065020-root.json
[06:50:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2020.codfw.wmnet with reason: host reimage
[06:54:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-065355-root.json
[06:59:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P62185 and previous config saved to /var/cache/conftool/dbconfig/20240509-065941-marostegui.json
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700).
[07:00:05] <jouncebot>	 James_F and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:03:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62186 and previous config saved to /var/cache/conftool/dbconfig/20240509-070300-root.json
[07:04:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:04:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:04:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:05:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:05:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:05:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62187 and previous config saved to /var/cache/conftool/dbconfig/20240509-070526-root.json
[07:05:43] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:09:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62188 and previous config saved to /var/cache/conftool/dbconfig/20240509-070905-root.json
[07:10:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:23] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9782546 (10jcrespo) 05Open→03Resolved a:03jcrespo
[07:14:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2020.codfw.wmnet with OS bookworm
[07:14:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T364299)', diff saved to https://phabricator.wikimedia.org/P62189 and previous config saved to /var/cache/conftool/dbconfig/20240509-071449-marostegui.json
[07:14:52] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:14:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[07:14:57] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:15:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[07:15:06] <wikibugs>	 (03PS1) 10Abijeet Patro: Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522)
[07:15:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[07:15:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[07:15:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62190 and previous config saved to /var/cache/conftool/dbconfig/20240509-071527-marostegui.json
[07:15:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 9.869 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:16:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:16:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:17:08] <wikibugs>	 (03CR) 10Abijeet Patro: [C:03+1] Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro)
[07:17:26] <abijeet>	 hello deployers, there is currently a UBN! (https://phabricator.wikimedia.org/T364522) that's blocking pages from being marked for translation. We have a patch that fixes the issue, but given CI times, it'll take a while to get merged: 1029257: Fix error when marking a new page for translations | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1029257
[07:18:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62191 and previous config saved to /var/cache/conftool/dbconfig/20240509-071805-root.json
[07:18:26] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[07:18:27] <abijeet>	 We might miss the UTC morning backport window, but it would be nice to have this fix deployed given the severe impact of the issue.
[07:19:40] <zabe>	 jouncebot: nowandnext
[07:19:41] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700)
[07:19:41] <jouncebot>	 In 2 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1000)
[07:20:23] <zabe>	 abijeet: I can deploy it if you can test?
[07:20:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62192 and previous config saved to /var/cache/conftool/dbconfig/20240509-072032-root.json
[07:20:33] <abijeet>	 zabe, thanks. I'm around to test.
[07:20:39] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro)
[07:22:56] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029200 (owner: 10Zabe)
[07:23:10] <abijeet>	 zabe, added to the backport window: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T0700
[07:23:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:23:46] <wikibugs>	 (03Merged) 10jenkins-bot: Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029200 (owner: 10Zabe)
[07:24:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-072411-root.json
[07:24:36] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]]
[07:24:59] <zabe>	 thx
[07:28:33] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:29:11] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[07:33:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62194 and previous config saved to /var/cache/conftool/dbconfig/20240509-073311-root.json
[07:33:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:34:03] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:34:15] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1469 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:35:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62195 and previous config saved to /var/cache/conftool/dbconfig/20240509-073537-root.json
[07:37:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:37:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:39:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62196 and previous config saved to /var/cache/conftool/dbconfig/20240509-073922-root.json
[07:41:34] <wikibugs>	 (03Merged) 10jenkins-bot: Fix error when marking a new page for translations [extensions/Translate] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029257 (https://phabricator.wikimedia.org/T364522) (owner: 10Abijeet Patro)
[07:41:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:42:14] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1029200|Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions]] (duration: 17m 37s)
[07:42:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029258
[07:43:03] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]]
[07:43:06] <stashbot>	 T364522: Internal error when trying to mark a page for translation not yet in translation system - https://phabricator.wikimedia.org/T364522
[07:43:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Fully repool db1172', diff saved to https://phabricator.wikimedia.org/P62197 and previous config saved to /var/cache/conftool/dbconfig/20240509-074355-marostegui.json
[07:44:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029258 (owner: 10Marostegui)
[07:44:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62198 and previous config saved to /var/cache/conftool/dbconfig/20240509-074408-root.json
[07:45:42] <logmsgbot>	 !log zabe@deploy1002 zabe and abi: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:45:57] <zabe>	 abijeet: could you test?:)
[07:47:07] <abijeet>	 Sure
[07:49:27] <abijeet>	 zabe, tested. Looks good.
[07:49:57] <zabe>	 cool, syncing
[07:50:01] <logmsgbot>	 !log zabe@deploy1002 zabe and abi: Continuing with sync
[07:50:17] <James_F>	 Argh, I had the deploy window in my calendar with the wrong hour, sorry!
[07:50:41] <James_F>	 Will deploy it later instead.
[07:50:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62199 and previous config saved to /var/cache/conftool/dbconfig/20240509-075043-root.json
[07:51:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62200 and previous config saved to /var/cache/conftool/dbconfig/20240509-075118-marostegui.json
[07:51:23] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:54:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62201 and previous config saved to /var/cache/conftool/dbconfig/20240509-075429-root.json
[07:59:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62202 and previous config saved to /var/cache/conftool/dbconfig/20240509-075914-root.json
[08:02:32] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1029257|Fix error when marking a new page for translations (T364522)]] (duration: 19m 28s)
[08:02:37] <stashbot>	 T364522: Internal error when trying to mark a page for translation not yet in translation system - https://phabricator.wikimedia.org/T364522
[08:03:35] <zabe>	 abijeet: fix should be live
[08:03:53] <abijeet>	 zabe, thanks! I just verified that it works as expected
[08:03:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:04:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:04:03] <zabe>	 cool, yw
[08:04:15] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1469 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:05:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62203 and previous config saved to /var/cache/conftool/dbconfig/20240509-080549-root.json
[08:06:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P62204 and previous config saved to /var/cache/conftool/dbconfig/20240509-080627-marostegui.json
[08:07:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:07:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:08:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:09:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62205 and previous config saved to /var/cache/conftool/dbconfig/20240509-080936-root.json
[08:13:23] <godog>	 !log set batphone oncall for May 9th - T350192
[08:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:26] <stashbot>	 T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192
[08:14:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62206 and previous config saved to /var/cache/conftool/dbconfig/20240509-081422-root.json
[08:16:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:18:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P62207 and previous config saved to /var/cache/conftool/dbconfig/20240509-082135-marostegui.json
[08:26:28] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:29:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62208 and previous config saved to /var/cache/conftool/dbconfig/20240509-082927-root.json
[08:30:48] <godog>	 !log set batphone oncall for May 9th only for EMEA, not Americas - T350192
[08:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:52] <stashbot>	 T350192: On-call batphone escalation configuration holidays FY2023-24 - https://phabricator.wikimedia.org/T350192
[08:36:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T364299)', diff saved to https://phabricator.wikimedia.org/P62209 and previous config saved to /var/cache/conftool/dbconfig/20240509-083643-marostegui.json
[08:36:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[08:36:46] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[08:36:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[08:37:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62210 and previous config saved to /var/cache/conftool/dbconfig/20240509-083705-marostegui.json
[08:44:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62211 and previous config saved to /var/cache/conftool/dbconfig/20240509-084433-root.json
[08:53:41] <jynus>	 !log deploy new grants for es6, es7 backups T363812
[08:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:45] <stashbot>	 T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812
[08:54:53] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[08:59:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62212 and previous config saved to /var/cache/conftool/dbconfig/20240509-085939-root.json
[09:00:07] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812)
[09:02:25] <wikibugs>	 (03PS1) 10Fabfur: cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379)
[09:04:59] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[09:07:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[09:07:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62213 and previous config saved to /var/cache/conftool/dbconfig/20240509-090726-marostegui.json
[09:07:30] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[09:08:23] <wikibugs>	 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9782652 (10jijiki)
[09:14:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62214 and previous config saved to /var/cache/conftool/dbconfig/20240509-091413-ladsgroup.json
[09:14:18] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[09:14:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62215 and previous config saved to /var/cache/conftool/dbconfig/20240509-091445-root.json
[09:16:26] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2357/console" [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[09:22:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62216 and previous config saved to /var/cache/conftool/dbconfig/20240509-092234-marostegui.json
[09:27:34] <wikibugs>	 (03PS1) 10Marostegui: db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029482
[09:27:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1167', diff saved to https://phabricator.wikimedia.org/P62217 and previous config saved to /var/cache/conftool/dbconfig/20240509-092757-root.json
[09:28:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029482 (owner: 10Marostegui)
[09:29:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P62218 and previous config saved to /var/cache/conftool/dbconfig/20240509-092921-ladsgroup.json
[09:29:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1167.eqiad.wmnet with OS bookworm
[09:31:03] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: upgrade to 10.6
[09:31:16] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: upgrade to 10.6
[09:31:33] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade to 10.6
[09:31:47] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade to 10.6
[09:32:14] <wikibugs>	 (03CR) 10Dreamrimmer: [C:03+1] ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry)
[09:33:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo)
[09:33:14] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509)
[09:33:37] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Disable ParserMigration on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228)
[09:34:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE))
[09:35:41] <wikibugs>	 (03Merged) 10jenkins-bot: Disable ParserMigration on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE))
[09:36:08] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]]
[09:36:11] <stashbot>	 T364228: Parsoid read views show empty SDC data - https://phabricator.wikimedia.org/T364228
[09:37:05] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo)
[09:37:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62219 and previous config saved to /var/cache/conftool/dbconfig/20240509-093742-marostegui.json
[09:38:50] <logmsgbot>	 !log jforrester@deploy1002 lucaswerkmeister-wmde and jforrester: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:39:12] <logmsgbot>	 !log jforrester@deploy1002 lucaswerkmeister-wmde and jforrester: Continuing with sync
[09:40:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029261
[09:43:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage
[09:43:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1375 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:43:58] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2395 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:44:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-094431-ladsgroup.json
[09:45:40] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1470 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:45:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage
[09:48:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9782732 (10jijiki) @andrea.denisse please give me a headsup on IRC/slack to sync up, when you are planning on switching thanos-fe to cfssl, so we can kee...
[09:52:25] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1027194|Disable ParserMigration on commonswiki (T364228)]] (duration: 16m 17s)
[09:52:28] <stashbot>	 T364228: Parsoid read views show empty SDC data - https://phabricator.wikimedia.org/T364228
[09:52:31] <James_F>	 Finally!
[09:52:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T364299)', diff saved to https://phabricator.wikimedia.org/P62220 and previous config saved to /var/cache/conftool/dbconfig/20240509-095249-marostegui.json
[09:52:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:52:54] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[09:53:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:53:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62221 and previous config saved to /var/cache/conftool/dbconfig/20240509-095313-marostegui.json
[09:59:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P62222 and previous config saved to /var/cache/conftool/dbconfig/20240509-095943-ladsgroup.json
[09:59:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[09:59:48] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[09:59:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1000)
[10:00:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T352010)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-100006-ladsgroup.json
[10:01:59] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1 C:03+2] apt-staging: Add timer for gitlab package puller job [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney)
[10:03:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029261 (owner: 10Marostegui)
[10:04:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62224 and previous config saved to /var/cache/conftool/dbconfig/20240509-100405-root.json
[10:06:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1167.eqiad.wmnet with OS bookworm
[10:12:53] <wikibugs>	 (03PS1) 10Marostegui: es2038: No longer in setup [puppet] - 10https://gerrit.wikimedia.org/r/1029488
[10:13:48] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1375 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:13:58] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2395 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:14:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2038: No longer in setup [puppet] - 10https://gerrit.wikimedia.org/r/1029488 (owner: 10Marostegui)
[10:15:40] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1470 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:19:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62225 and previous config saved to /var/cache/conftool/dbconfig/20240509-101911-root.json
[10:25:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62226 and previous config saved to /var/cache/conftool/dbconfig/20240509-102512-marostegui.json
[10:25:17] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[10:31:43] <wikibugs>	 07sre-alert-triage: Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528 (10LSobanski) 03NEW
[10:32:04] <wikibugs>	 07sre-alert-triage: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529 (10LSobanski) 03NEW
[10:32:30] <wikibugs>	 (03PS1) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493
[10:34:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62227 and previous config saved to /var/cache/conftool/dbconfig/20240509-103417-root.json
[10:35:26] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] fifo-log-demux: removed unused resources [puppet] - 10https://gerrit.wikimedia.org/r/1029191 (https://phabricator.wikimedia.org/T355905) (owner: 10Fabfur)
[10:39:53] <wikibugs>	 (03PS1) 10Santiago Faci: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263
[10:40:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62228 and previous config saved to /var/cache/conftool/dbconfig/20240509-104019-marostegui.json
[10:41:33] <wikibugs>	 (03PS2) 10Btullis: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[10:46:01] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[10:47:20] <wikibugs>	 (03CR) 10Aklapper: [C:03+1] "Thanks! This looks correct and I get the same results locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery)
[10:49:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62229 and previous config saved to /var/cache/conftool/dbconfig/20240509-104922-root.json
[10:50:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[10:52:01] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Drop the deprecated dumps fetcher that pulls from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis)
[10:53:44] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529#9782931 (10LSobanski)
[10:55:09] <wikibugs>	 07sre-alert-triage, 10SRE Observability (FY2023/2024-Q4): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9782948 (10LSobanski)
[10:55:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62230 and previous config saved to /var/cache/conftool/dbconfig/20240509-105527-marostegui.json
[10:55:31] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T364529#9782946 (10LSobanski) →14Duplicate dup:03T354255
[10:56:32] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528#9782956 (10LSobanski)
[10:57:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[11:01:51] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[11:02:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mediawiki_history_reduced_snaphost automation: Updating editor-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029263 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[11:02:59] <wikibugs>	 (03PS5) 10Btullis: hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[11:03:30] <wikibugs>	 (03CR) 10Btullis: "I updated the commit message a bit to refer to the correct user/group." [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[11:04:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62231 and previous config saved to /var/cache/conftool/dbconfig/20240509-110430-root.json
[11:05:30] <wikibugs>	 (03PS2) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493
[11:05:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci)
[11:05:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[11:09:49] <wikibugs>	 (03PS1) 10Majavah: site: Move cloudnet2007/8-dev back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1029496 (https://phabricator.wikimedia.org/T358761)
[11:09:51] <wikibugs>	 (03PS1) 10Majavah: site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761)
[11:09:53] <wikibugs>	 (03PS1) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761)
[11:10:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T364299)', diff saved to https://phabricator.wikimedia.org/P62232 and previous config saved to /var/cache/conftool/dbconfig/20240509-111037-marostegui.json
[11:10:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[11:10:41] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:10:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[11:11:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62233 and previous config saved to /var/cache/conftool/dbconfig/20240509-111100-marostegui.json
[11:11:48] <wikibugs>	 (03PS6) 10Btullis: hadoop: make analytics DB password available to analytics-product user [puppet] - 10https://gerrit.wikimedia.org/r/1028814 (https://phabricator.wikimedia.org/T363437) (owner: 10Brouberol)
[11:19:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62234 and previous config saved to /var/cache/conftool/dbconfig/20240509-111936-root.json
[11:34:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62235 and previous config saved to /var/cache/conftool/dbconfig/20240509-113443-root.json
[11:35:17] <wikibugs>	 (03PS3) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493
[11:35:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci)
[11:35:42] <wikibugs>	 (03PS4) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493
[11:35:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci)
[11:36:34] <wikibugs>	 (03Abandoned) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029493 (owner: 10Santiago Faci)
[11:39:33] <wikibugs>	 (03PS1) 10Santiago Faci: edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504
[11:41:06] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good. Hopefully this will be the very last time we have to do it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci)
[11:41:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:41:44] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci)
[11:42:14] <wikibugs>	 (03PS1) 10Jforrester: Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506
[11:42:44] <wikibugs>	 (03Merged) 10jenkins-bot: edit*-analytics: Updating the mediawiki history reduced snaphost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029504 (owner: 10Santiago Faci)
[11:43:45] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[11:44:17] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[11:44:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62236 and previous config saved to /var/cache/conftool/dbconfig/20240509-114417-marostegui.json
[11:44:23] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:45:26] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[11:45:47] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[11:45:59] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[11:46:36] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[11:47:18] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[11:48:21] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[11:49:13] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[11:49:29] <wikibugs>	 (03CR) 10Majavah: [C:03+2] site: Move cloudnet2007/8-dev back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1029496 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah)
[11:49:36] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[11:50:02] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[11:50:19] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[11:50:28] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bookworm
[11:51:03] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bookworm
[11:51:27] <wikibugs>	 (03PS1) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029508 (https://phabricator.wikimedia.org/T364379)
[11:52:32] <wikibugs>	 (03PS1) 10Btullis: Move snapshot1009 to insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/1029509 (https://phabricator.wikimedia.org/T364456)
[11:59:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62237 and previous config saved to /var/cache/conftool/dbconfig/20240509-115925-marostegui.json
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1200)
[12:09:40] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage
[12:09:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192', diff saved to https://phabricator.wikimedia.org/P62239 and previous config saved to /var/cache/conftool/dbconfig/20240509-120955-root.json
[12:10:11] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage
[12:11:01] <wikibugs>	 (03PS1) 10Marostegui: db1192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029516
[12:11:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bookworm
[12:11:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029516 (owner: 10Marostegui)
[12:12:34] <wikibugs>	 (03PS1) 10Ladsgroup: Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538)
[12:12:57] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage
[12:13:12] <Amir1>	 jouncebot: nowandnext
[12:13:13] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1200)
[12:13:13] <jouncebot>	 In 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300)
[12:13:23] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538) (owner: 10Ladsgroup)
[12:13:27] <zabe>	 oh, securepoll beeing broken during an election
[12:13:34] <zabe>	 this never happened before
[12:14:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62240 and previous config saved to /var/cache/conftool/dbconfig/20240509-121433-marostegui.json
[12:16:05] <wikibugs>	 (03Merged) 10jenkins-bot: Return array from LocalAuth::getCentralLists [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029265 (https://phabricator.wikimedia.org/T364538) (owner: 10Ladsgroup)
[12:16:21] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage
[12:17:13] <taavi>	 zabe: did we ever get to removing the labtestwikitech hack from there?
[12:18:06] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]]
[12:18:10] <stashbot>	 T364538: Voting in U4C election is not possible anymore - https://phabricator.wikimedia.org/T364538
[12:18:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:18:46] <zabe>	 yeah I actually think it got removed a few months ago (but I wasn't involved)
[12:19:21] <zabe>	 ok
[12:19:28] <zabe>	 6 weeks ago
[12:19:29] <zabe>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/854528
[12:19:52] <zabe>	 and actually also in a rather hacky way
[12:20:48] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:21:47] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[12:22:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2362/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[12:24:09] <wikibugs>	 (03PS2) 10Majavah: site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761)
[12:24:09] <wikibugs>	 (03PS2) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761)
[12:24:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
[12:25:32] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2363/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah)
[12:26:02] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:26:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029546
[12:26:28] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:27:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
[12:27:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM, PCC needs to run on titan hosts which now do show a diff: https://puppet-compiler.wmflabs.org/output/1028876/2362/" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[12:28:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1419 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:28:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:28:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1453 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:29:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62241 and previous config saved to /var/cache/conftool/dbconfig/20240509-122941-marostegui.json
[12:29:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance
[12:29:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1463 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:29:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance
[12:30:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T364299)', diff saved to https://phabricator.wikimedia.org/P62242 and previous config saved to /var/cache/conftool/dbconfig/20240509-123004-marostegui.json
[12:30:50] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bookworm
[12:31:33] <wikibugs>	 (03PS14) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:32:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:33:21] <wikibugs>	 (03PS15) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:33:45] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bookworm
[12:33:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:34:17] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] site: Move cloudnet2005-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029497 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah)
[12:34:48] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1029265|Return array from LocalAuth::getCentralLists (T364538)]] (duration: 16m 41s)
[12:35:38] <wikibugs>	 (03PS16) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:36:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:37:21] <wikibugs>	 (03PS17) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:37:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:38:14] <wikibugs>	 (03PS18) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:38:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:39:56] <wikibugs>	 (03PS19) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:40:20] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Create cookbook to rebuild an MD RAID array upon disk replacement - https://phabricator.wikimedia.org/T364540 (10Volans) 03NEW p:05Triage→03Medium
[12:40:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:44:22] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[12:44:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1029546 (owner: 10Marostegui)
[12:44:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62243 and previous config saved to /var/cache/conftool/dbconfig/20240509-124449-root.json
[12:45:37] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1192 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1028939 (https://phabricator.wikimedia.org/T364541)
[12:45:41] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541)
[12:48:23] <wikibugs>	 (03PS2) 10Elukey: role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412)
[12:49:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1192.eqiad.wmnet with OS bookworm
[12:50:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[12:50:55] <elukey>	 !log depool/upgrade/repool ms-fe20[09-14] to upgrade envoy to TLS PKI certs 
[12:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:29] <wikibugs>	 (03PS1) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529
[12:52:32] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2009.codfw.wmnet
[12:52:38] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::swift::proxy: move codfw envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028860 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey)
[12:52:47] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:52:47] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:53:05] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:56:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:58:16] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2009.codfw.wmnet
[12:58:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1419 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:58:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:58:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1453 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:58:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364299)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240509-125843-marostegui.json
[12:58:55] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[12:58:57] <wikibugs>	 (03PS20) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[12:59:11] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2010.codfw.wmnet
[12:59:12] <wikibugs>	 (03PS2) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529
[12:59:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[12:59:48] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1463 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:59:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62244 and previous config saved to /var/cache/conftool/dbconfig/20240509-125955-root.json
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300)
[13:00:05] <jouncebot>	 DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:26] <DreamRimmer>	 I am around
[13:00:43] <wikibugs>	 (03PS21) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918)
[13:01:33] <wikibugs>	 (03CR) 10Paladox: Allow users to recheck tests in checkers (036 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox)
[13:03:25] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
[13:03:32] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2010.codfw.wmnet
[13:04:01] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2011.codfw.wmnet
[13:04:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:04] <wikibugs>	 (03PS1) 10Elukey: Add fake TLS keystore password for Cassandra clusters [labs/private] - 10https://gerrit.wikimedia.org/r/1029538 (https://phabricator.wikimedia.org/T352647)
[13:05:08] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:53] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:06:09] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
[13:07:46] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2011.codfw.wmnet
[13:08:07] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2012.codfw.wmnet
[13:12:02] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2012.codfw.wmnet
[13:13:55] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2013.codfw.wmnet
[13:13:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62245 and previous config saved to /var/cache/conftool/dbconfig/20240509-131355-marostegui.json
[13:15:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62246 and previous config saved to /var/cache/conftool/dbconfig/20240509-131501-root.json
[13:16:21] <DreamRimmer>	 who is the deployer today?
[13:17:29] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2013.codfw.wmnet
[13:17:45] <wikibugs>	 (03CR) 10CDanis: [C:03+1] cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[13:19:02] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2014.codfw.wmnet
[13:20:09] <JSherman>	 DreamRimmer: I'm a WMF employee (but not a deployer), I'll see if I can raise somebody on slack.
[13:23:04] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2014.codfw.wmnet
[13:24:20] <JSherman>	 DreamRimmer: TheresNoTime will be here in 5 minutes
[13:24:49] <TheresNoTime>	 (o/ one moment)
[13:24:59] <cdanis>	 jouncebot: nowandnext
[13:24:59] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1300)
[13:24:59] <jouncebot>	 In 2 hour(s) and 35 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1600)
[13:25:43] <DreamRimmer>	 thanks 
[13:26:02] <TheresNoTime>	 DreamRimmer: starting now
[13:26:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029237 (https://phabricator.wikimedia.org/T355129) (owner: 10Dreamrimmer)
[13:26:49] <wikibugs>	 (03PS1) 10Btullis: Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542)
[13:26:58] <wikibugs>	 (03Merged) 10jenkins-bot: quwiki: Set MetaNamespaceName to Wikipidiya [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029237 (https://phabricator.wikimedia.org/T355129) (owner: 10Dreamrimmer)
[13:27:34] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]]
[13:27:34] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[13:27:37] <stashbot>	 T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129
[13:29:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62247 and previous config saved to /var/cache/conftool/dbconfig/20240509-132905-marostegui.json
[13:30:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62248 and previous config saved to /var/cache/conftool/dbconfig/20240509-133009-root.json
[13:30:13] <logmsgbot>	 !log samtar@deploy1002 dreamrimmer and samtar: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:30:18] <TheresNoTime>	 DreamRimmer: patch is live on mwdebug, can you test?
[13:30:26] <DreamRimmer>	 doing
[13:32:18] <DreamRimmer>	 looks good
[13:33:43] <DreamRimmer>	 TheresNoTime: good to go
[13:34:05] <logmsgbot>	 !log samtar@deploy1002 dreamrimmer and samtar: Continuing with sync
[13:38:35] <wikibugs>	 (03PS3) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543)
[13:38:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1485 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:38:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur)
[13:39:38] <TheresNoTime>	 (sync is a little slow..)
[13:41:00] <wikibugs>	 (03PS1) 10Elukey: services: move Swift config in staging to local envoy proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324)
[13:41:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:41:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:42:21] <wikibugs>	 (03CR) 10Elukey: "I know that in the task Joe suggested otherwise, and for good reasons, but the ML team used the local proxy for recommendation-api-ng and " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029544 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:42:24] <wikibugs>	 (03PS4) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543)
[13:44:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364299)', diff saved to https://phabricator.wikimedia.org/P62249 and previous config saved to /var/cache/conftool/dbconfig/20240509-134412-marostegui.json
[13:44:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance
[13:44:17] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[13:44:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance
[13:45:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62250 and previous config saved to /var/cache/conftool/dbconfig/20240509-134514-root.json
[13:47:15] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1029237|quwiki: Set MetaNamespaceName to Wikipidiya (T355129)]] (duration: 19m 41s)
[13:47:18] <stashbot>	 T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129
[13:50:16] <DreamRimmer>	 TheresNoTime: Thanks for your valuable time, I appreciate it:)
[13:50:49] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Reimage aqs1013 w/o preserving data [puppet] - 10https://gerrit.wikimedia.org/r/1029206 (https://phabricator.wikimedia.org/T364422) (owner: 10Eevans)
[13:51:14] <TheresNoTime>	 DreamRimmer: just need to run the dedupe script (I think)
[13:52:00] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur)
[13:55:47] <wikibugs>	 (03PS1) 10Elukey: Delete the Cassandra directory in secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1029567 (https://phabricator.wikimedia.org/T352647)
[13:57:10] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1013.eqiad.wmnet with OS bullseye
[13:57:20] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye
[13:57:51] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column pl_namespace in where clause on query. Default database: quwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:59:02] <marostegui>	 ^ checking that 
[13:59:14] <marostegui>	 jynus Amir1 ^ 
[13:59:30] <jynus>	 quwiki
[13:59:42] <jynus>	 is it a missing schema change?
[13:59:50] <marostegui>	 No, I know what it is
[13:59:56] <jynus>	 ?
[14:00:19] <marostegui>	 ah yes
[14:00:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62252 and previous config saved to /var/cache/conftool/dbconfig/20240509-140020-root.json
[14:00:27] <marostegui>	 Sorry I misread the error message
[14:00:42] <Amir1>	 sigh
[14:00:43] <TheresNoTime>	 hi, running namespaceDupes
[14:00:47] <TheresNoTime>	 on that wiki
[14:00:52] <marostegui>	 It is part of https://phabricator.wikimedia.org/T352010
[14:00:53] <TheresNoTime>	 is now stalled it seems?
[14:00:56] <Amir1>	 TheresNoTime: do not do that
[14:01:09] <wikibugs>	 (03PS1) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126)
[14:01:13] <Amir1>	 which wiki is that
[14:01:15] <marostegui>	 lots of hosts broken in s3
[14:01:19] <marostegui>	 possible same error
[14:01:20] <TheresNoTime>	 Amir1: shall I cancel the running, `quwiki`
[14:01:22] <jynus>	 quwiki.pagelinks
[14:01:27] <Amir1>	 I fix this
[14:01:29] <marostegui>	 yep, same error on all the broken hosts
[14:01:53] <TheresNoTime>	 (cancelled running)
[14:02:07] <jynus>	 I am going to start logging in on the status page
[14:02:08] <marostegui>	 TheresNoTime: Yes, stop it for now
[14:02:13] <TheresNoTime>	 ack
[14:02:29] <marostegui>	 jynus: Not sure if it is really needed, just 4 hosts affeted
[14:02:34] <jynus>	 only 4?
[14:02:35] <TheresNoTime>	 (if useful, https://phabricator.wikimedia.org/T355129#9783354 is the result of the dry run of `mwscript namespaceDupes.php --wiki quwiki`)
[14:02:40] <marostegui>	 jynus: yep
[14:02:40] <jynus>	 ok then standing by
[14:02:49] <marostegui>	 Amir1:  need help?
[14:02:51] <jynus>	 I thougut it was a widespread breakage
[14:03:11] <Amir1>	 jynus: because the schema change is running
[14:03:12] <jynus>	 in any case taking IC in case it is needed
[14:03:46] <Amir1>	 marostegui: I'm doing db1157, let me give you a schema change to run on quwiki
[14:03:51] <marostegui>	 downtimed the hosts to avoid paging
[14:04:03] <marostegui>	 Amir1: if it is just adding the column, I can do that right now
[14:04:29] <Amir1>	 yeah, pl_namespace and pl_title
[14:04:34] <marostegui>	 ok doing
[14:05:13] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "That makes sense, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[14:05:16] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1 C:03+2] thanos: Update TLS certificate in Envoy config to match CFSSL provisioning [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[14:05:22] <jynus>	 wiki edits looking good, so no apparent user impact
[14:05:41] <marostegui>	 doing db1166
[14:05:42] <marostegui>	 done
[14:05:44] <marostegui>	 doing db1175 now
[14:05:53] <jynus>	 but please anyone speak up if you see any weird wiki errors
[14:05:59] <TheresNoTime>	 !log ftr, did run `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki quwiki --fix` for T355129, cancelled before complete due to outage
[14:06:08] <marostegui>	 db1175 done, going for db1189
[14:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:13] <stashbot>	 T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129
[14:06:20] <Amir1>	 ALTER TABLE pagelinks ADD pl_namespace INT DEFAULT 0 NOT NULL, ADD pl_title VARBINARY(255) DEFAULT '' NOT NULL;
[14:06:23] <jynus>	 TheresNoTime: I don't think it is that per se, but the interaction of that and something else
[14:06:37] <marostegui>	 fixed db1189
[14:06:41] <marostegui>	 going for the backup source now
[14:06:46] <jynus>	 TheresNoTime: please stand by until dbas give the green light
[14:06:54] <marostegui>	 all done, all hosts replicating now
[14:06:54] <TheresNoTime>	 ack :)
[14:07:43] <Amir1>	 TheresNoTime: simply don't run it again, until the code is fixed to respect migration stage
[14:07:43] <marostegui>	 Amir1: db1189 is running the optimize now, that host is depooled anyway
[14:07:50] <marostegui>	 So all fixed
[14:07:51] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 on db1150 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:07:56] <jynus>	 I see db2201 s5 lagging, but I am guessing that is unrelated
[14:07:56] <TheresNoTime>	 Amir1: ack, will not run it again unless told otherwise
[14:08:06] <marostegui>	 jynus: yeah, that's a different schema change
[14:08:23] <Amir1>	 this is the like fourth time the namespaceDupe is breaking stuff exactly because it bypasses links table abstraction
[14:08:24] <jynus>	 let's make sure icinga looks all green before continuing
[14:08:25] <marostegui>	 Amir1: you will need to re-add the columns in db1189 as the schema change is running there
[14:08:29] <marostegui>	 dropping them :)
[14:08:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance
[14:08:43] <Amir1>	 marostegui: I'll do it
[14:08:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1485 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:08:50] <marostegui>	 icinga looks good to me now
[14:08:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance
[14:08:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62253 and previous config saved to /var/cache/conftool/dbconfig/20240509-140858-marostegui.json
[14:09:02] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[14:09:18] <TheresNoTime>	 is there a task for "code is fixed to respect migration stage" ?
[14:09:18] <denisse>	 !log Restarting envoyproxy on titan* hosts as part of the CFSSL migration - T360414
[14:09:20] <marostegui>	 Amir1: should be disable or do something to namespaceDupe to avoid breaking it again?
[14:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:22] <jynus>	 I still see lag on db1157 and db1189
[14:09:22] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[14:09:34] <Amir1>	 we should disable it again
[14:09:37] <marostegui>	 jynus: orchestrator doesn't show it https://orchestrator.wikimedia.org/web/cluster/alias/s3
[14:09:42] <marostegui>	 only db1189 which is depooled
[14:09:46] <Amir1>	 db1189 is expected
[14:09:48] <jynus>	 ok
[14:09:51] <Amir1>	 it's running the schema change
[14:10:04] <jynus>	 so issue adverted, any followup needed?
[14:10:06] <Amir1>	 I want to drop the columns again 
[14:10:21] <jynus>	 or maybe you can coordinate directly with TheresNoTime
[14:10:55] <jynus>	 e.g. telling him to reschedule his maintenance
[14:11:21] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:11:23] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:11:26] <Amir1>	 that's a different thing
[14:11:43] <Amir1>	 I dropped it on quwiki on db1150 again
[14:11:58] <Amir1>	 so far works normally
[14:12:26] <Amir1>	 db1157 as well
[14:12:28] <marostegui>	 right
[14:12:44] <marostegui>	 Amir1: so i guess db1189 will need to get them re-added, let the data go, and then dropped them
[14:12:54] <Amir1>	 because the maint script is not running, the normal writes shouldn't affect it
[14:13:00] <Amir1>	 yeah, fun stuff
[14:13:28] <wikibugs>	 (03PS1) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[14:13:48] <wikibugs>	 (03CR) 10BBlack: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:14:04] <Amir1>	 T364546
[14:14:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[14:14:10] <stashbot>	 T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546
[14:14:11] <jynus>	 So should wiki maintenance be stopped for now?
[14:14:21] <jynus>	 what is the right approach?
[14:14:28] <wikibugs>	 (03CR) 10Ssingh: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:14:40] <jynus>	 or maybe you can coordinate on that ticket?
[14:14:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9783406 (10MatthewVernon) Can I very tentatively ask if you have thoughts about timescales for this, please? It seems likely to be a non-trivial bi...
[14:14:43] <jynus>	 to close the issue
[14:15:07] <jynus>	 ^ TheresNoTime Amir1would that work?
[14:15:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62254 and previous config saved to /var/cache/conftool/dbconfig/20240509-141526-root.json
[14:15:41] <TheresNoTime>	 jynus: I've commented on T355129 and do not intend to run that script until otherwise told :) 
[14:15:41] <stashbot>	 T355129: Localised name for quwiki - https://phabricator.wikimedia.org/T355129
[14:15:46] <wikibugs>	 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9783414 (10andrea.denisse) Hi @jijiki , I've abandoned the patch for the thanos-fe hosts ([[ https://gerrit.wikimedia.org/r/1028546 | #1028546 ]]) but feel free to restore it...
[14:15:50] <wikibugs>	 (03PS2) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126)
[14:16:04] <jynus>	 thank you, sorry for the impact
[14:16:27] <logmsgbot>	 !log eevans@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1013.eqiad.wmnet with OS bullseye
[14:16:33] <TheresNoTime>	 no problem for me, *I* was the one who broke things :D
[14:16:45] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye executed with errors: - aqs1013 (**FAIL**)   - Downt...
[14:16:51] <jynus>	 as far as I understood, you werent', you only hit a bug
[14:17:11] <TheresNoTime>	  /j
[14:17:22] <jynus>	 but better be safe as this was not very impacting but quite scarey if it got more widespread
[14:17:35] <jynus>	 thank you for your understanding
[14:17:54] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9783442 (10andrea.denisse) 05In progress→03Resolved
[14:18:03] <Amir1>	 I have said multiple times, this is the only place that writes to links tables bypassing the abstraction in place. I asked multiple times to actually use the abstraction and every time the response I got was that "it's too much work, we fix this breakage to unlock the work"
[14:18:23] <wikibugs>	 (03PS2) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[14:18:54] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1013.eqiad.wmnet with OS bullseye
[14:19:09] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 13Patch-For-Review: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye
[14:19:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[14:20:13] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:23:20] <wikibugs>	 (03PS1) 10Ladsgroup: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546)
[14:23:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup)
[14:24:04] <wikibugs>	 (03PS1) 10Ladsgroup: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546)
[14:24:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup)
[14:26:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:28:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2006 to codfw - jhancock@cumin2002"
[14:29:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2006 to codfw - jhancock@cumin2002"
[14:29:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:31:34] <wikibugs>	 (03PS3) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126)
[14:32:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:22] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED
[14:32:30] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:33:29] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks Scott and Riccardo" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[14:33:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:33:48] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:33:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:33:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:34:00] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:34:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:36:50] <wikibugs>	 (03CR) 10Fabfur: [V:03+1 C:03+2] cache:benthos: move processors in the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1029480 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[14:37:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[14:37:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9783503 (10Jclark-ctr) @Marostegui  you can put server back in rotation  even though i uploaded multiple photos yesterday to Dell.  They replied this morning requesting part number to send correct part {F51422232} I attache...
[14:37:41] <wikibugs>	 (03CR) 10Volans: reqconfig: add command to search IP in ipblocks (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[14:39:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62255 and previous config saved to /var/cache/conftool/dbconfig/20240509-143938-marostegui.json
[14:39:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9783539 (10Marostegui) Thanks John, I will create a subtask for us to work on the formatting, reimage and recloning. Will leave this open until you've finished your side.
[14:39:43] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[14:41:24] <Amir1>	 marostegui: do you have the query that broke it handy?
[14:41:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:42:09] <wikibugs>	 (03PS4) 10Ssingh: varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126)
[14:43:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:43:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:44:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:45:05] <wikibugs>	 (03Merged) 10jenkins-bot: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029547 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup)
[14:45:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:45:52] <wikibugs>	 (03CR) 10Elukey: "CI reports this:" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans)
[14:46:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:46:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:46:35] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[14:48:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup)
[14:48:52] <wikibugs>	 (03CR) 10Ssingh: "#    top  TEST varnish/text/51-chrome-private-prefetch-proxy.vtc passed (7.040)" [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:48:54] <wikibugs>	 (03CR) 10Volans: "Yes I know, thanks. I'm looking for a fix that does make it build properly both in CI and in debian upstream" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans)
[14:49:38] <wikibugs>	 (03CR) 10BBlack: [C:03+1] varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:50:49] <wikibugs>	 (03CR) 10Ssingh: varnish: disable Chrome's private prefetch proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:51:40] <wikibugs>	 (03Merged) 10jenkins-bot: Disable namespaceDupes again [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1029548 (https://phabricator.wikimedia.org/T364546) (owner: 10Ladsgroup)
[14:51:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998#9783587 (10Andrew) 05Stalled→03Invalid I'm closing this as invalid since those hosts have come and gone :)
[14:52:10] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]]
[14:52:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED
[14:52:11] <wikibugs>	 (03PS4) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423)
[14:52:13] <stashbot>	 T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546
[14:52:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:53:01] <wikibugs>	 (03CR) 10Volans: "addressed comment" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[14:53:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: disable Chrome's private prefetch proxy [puppet] - 10https://gerrit.wikimedia.org/r/1029570 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[14:53:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED
[14:54:07] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage
[14:54:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:54:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2006.mgmt.codfw.wmnet with reboot policy FORCED
[14:54:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62256 and previous config saved to /var/cache/conftool/dbconfig/20240509-145445-marostegui.json
[14:54:49] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:55:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED
[14:55:02] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[14:55:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2006']
[14:55:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2006']
[14:56:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED
[14:56:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2008.mgmt.codfw.wmnet with reboot policy FORCED
[14:57:10] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage
[14:57:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2009.mgmt.codfw.wmnet with reboot policy FORCED
[14:58:17] <wikibugs>	 (03PS5) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543)
[14:58:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I see where you are going with this, let me know what you think of these alternatives:" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur)
[14:59:33] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1361 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:59:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:59:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1496 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:00:17] <sukhe>	 !log sudo cumin 'A:cp' 'run-puppet-agent --enable "merging CR 1029570"'
[15:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:33] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:00:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:00:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:01:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2007.mgmt.codfw.wmnet with reboot policy FORCED
[15:01:50] <wikibugs>	 (03CR) 10Scott French: "Thanks, Riccardo!" [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[15:03:09] <wikibugs>	 (03PS8) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029508 (https://phabricator.wikimedia.org/T364379)
[15:05:24] <volans>	 sukhe: I hope you added some batching,that's 112 hosts all running puppet together otherwise :-P
[15:06:11] <sukhe>	 volans: yeah, I usually add it but didn't this time since I tested it before. but maybe I should have.
[15:06:39] <sukhe>	 even something like -s10 could have been nice, yeah
[15:08:12] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1029547|Disable namespaceDupes again (T364546)]], [[gerrit:1029548|Disable namespaceDupes again (T364546)]] (duration: 16m 02s)
[15:08:16] <stashbot>	 T364546: namespaceDupes is not respecting links migration stage (again) - https://phabricator.wikimedia.org/T364546
[15:08:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2007']
[15:08:32] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet
[15:08:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2009']
[15:09:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2007']
[15:09:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2008']
[15:09:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62257 and previous config saved to /var/cache/conftool/dbconfig/20240509-150953-marostegui.json
[15:11:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2009']
[15:11:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2008']
[15:11:18] <volans>	 oh you can surely do -b 30 and even more, we didn't test the max concurrency yet with the new puppetservers
[15:11:36] <volans>	 but it seems that they survived the 64 parallel runs pretty fine
[15:11:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:14:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[15:14:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye
[15:14:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye
[15:14:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[15:14:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye
[15:14:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002"
[15:14:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye
[15:14:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye
[15:14:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye
[15:14:35] <wikibugs>	 (03CR) 10Eevans: [C:03+1] Deprecate system::role for Cassandra services [puppet] - 10https://gerrit.wikimedia.org/r/1026940 (owner: 10Muehlenhoff)
[15:15:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002"
[15:15:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:15:30] <volans>	 sukhe: https://grafana.wikimedia.org/d/000000477/puppetdb?orgId=1&from=now-1h&to=now doesn't seem too bad and https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-1h&to=now too
[15:15:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet
[15:15:50] <volans>	 so yeah potentially we might start to not care about batches for puppet runs (to be verified)
[15:16:14] <sukhe>	 volans: also depends on the change, this one was fairly light at least in Puppet resources related stuf
[15:16:17] <sukhe>	 f
[15:16:44] <volans>	 in the past catalog compilation was the failing bit
[15:17:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:18:26] <wikibugs>	 (03CR) 10CDanis: [C:03+1] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[15:19:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002"
[15:20:13] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kafka-main2010 to codfw - jhancock@cumin2002"
[15:20:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:21:05] <wikibugs>	 (03CR) 10Elukey: "I checked the deployment-prep config for deployment-ms-fe04.deployment-prep.eqiad1.wikimedia.cloud:" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[15:21:30] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.83.0" for 308 hosts
[15:21:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:22:02] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.83.0" for 307 hosts
[15:22:42] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.83.0" completed for 307 hosts
[15:23:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[15:25:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62258 and previous config saved to /var/cache/conftool/dbconfig/20240509-152501-marostegui.json
[15:25:05] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[15:27:11] <wikibugs>	 (03CR) 10Scott French: [C:03+2] confd: clean up confd-lint-wrap after error file fixes [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French)
[15:27:21] <logmsgbot>	 !log eevans@deploy1002 Started deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): (no justification provided)
[15:27:55] <logmsgbot>	 !log eevans@deploy1002 Finished deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): (no justification provided) (duration: 00m 33s)
[15:29:17] <wikibugs>	 (03CR) 10Fabfur: "I think I'll go with alternative #2: I'll drop this CR and do all the work (socket unit, service unit override for StandardInput) in the o" [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur)
[15:29:27] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1013.eqiad.wmnet with OS bullseye
[15:29:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1361 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye completed: - aqs1013 (**PASS**)   - Removed from Puppet and PuppetDB if pr...
[15:29:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1496 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:30:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:30:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:31:04] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422
[15:31:08] <stashbot>	 T364422: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422
[15:31:18] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422
[15:31:27] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9783703 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e110d57c-bacd-48ee-8333-fae55b264d8c) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrappin...
[15:31:40] <wikibugs>	 (03PS1) 10EoghanGaffney: apt: Update gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/1029608 (https://phabricator.wikimedia.org/T364481)
[15:34:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main2010.mgmt.codfw.wmnet with reboot policy FORCED
[15:34:59] <wikibugs>	 (03PS2) 10Btullis: Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542)
[15:35:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2010']
[15:35:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-main2010']
[15:35:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2010']
[15:36:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2010']
[15:36:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye
[15:36:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye
[15:37:03] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis)
[15:39:49] <wikibugs>	 (03Abandoned) 10Fabfur: benthos: allow stdin/stdout/stderr in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/1029529 (https://phabricator.wikimedia.org/T364543) (owner: 10Fabfur)
[15:39:54] <wikibugs>	 (03CR) 10Volans: [C:03+2] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[15:43:00] <wikibugs>	 (03Merged) 10jenkins-bot: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans)
[16:00:05] <jouncebot>	 jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1600).
[16:00:05] <jouncebot>	 thedj: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:17] <jhathaway>	 o/
[16:01:52] <wikibugs>	 (03PS1) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126)
[16:03:06] <wikibugs>	 (03PS1) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379)
[16:03:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[16:03:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudcontrol2001-dev with cloudcontrol2006-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029616 (https://phabricator.wikimedia.org/T354206)
[16:05:13] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:07:11] <wikibugs>	 (03CR) 10Eevans: [C:03+1] Add fake TLS keystore password for Cassandra clusters [labs/private] - 10https://gerrit.wikimedia.org/r/1029538 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:07:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9783825 (10Miriam) Hello, sorry for the delay! Approved on my end. Thank you!
[16:08:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Move rabbitmq01.codfw1dev to cloudcontrol2006-dev [dns] - 10https://gerrit.wikimedia.org/r/1029618
[16:09:21] <wikibugs>	 (03CR) 10Eevans: [C:03+1] Delete the Cassandra directory in secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1029567 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:10:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Replace cloudcontrol2001-dev with cloudcontrol2006-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029616 (https://phabricator.wikimedia.org/T354206) (owner: 10Andrew Bogott)
[16:13:12] <wikibugs>	 (03PS1) 10Elukey: ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984)
[16:13:58] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:14:52] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update Docker image for nllb-gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029619 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:14:54] <wikibugs>	 (03PS2) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126)
[16:16:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9783851 (10andrea.denisse)
[16:18:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:20:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[16:22:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9783853 (10cmooney) >>! In T354872#9529469, @MatthewVernon wrote: > Sorry, I think object stores are often not really written with renumbering in m...
[16:22:33] <wikibugs>	 (03PS2) 10Fabfur: cache:benthos: test for socket based activation in Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379)
[16:23:54] <wikibugs>	 10ops-codfw, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559 (10Andrew) 03NEW
[16:26:28] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:29:16] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[16:31:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new codfw cloudcontrol nodes - cmooney@cumin1002"
[16:32:33] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new codfw cloudcontrol nodes - cmooney@cumin1002"
[16:32:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:32:55] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2007.codfw.wmnet with OS bullseye
[16:32:59] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2006.codfw.wmnet with OS bullseye
[16:32:59] <wikibugs>	 (03PS3) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126)
[16:33:55] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559#9783893 (10cmooney) Hey Andrew,  Yeah this is on me, I'd not completed the work to ma...
[16:34:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye
[16:34:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed...
[16:35:43] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cloudcontrol2006-dev.private.codfw.wikimedia.cloud on all recursors
[16:35:47] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2006-dev.private.codfw.wikimedia.cloud on all recursors
[16:35:49] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2008.codfw.wmnet with OS bullseye
[16:35:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9783909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye executed...
[16:40:25] <wikibugs>	 (03PS1) 10Elukey: ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984)
[16:41:38] <wikibugs>	 (03PS4) 10Ssingh: varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126)
[16:43:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:43:48] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] ml-services: add nllb-gpu to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029625 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey)
[16:47:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:49:02] <wikibugs>	 (03CR) 10BBlack: [C:03+1] varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[16:49:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: update handling Chrome disable private fetch response [puppet] - 10https://gerrit.wikimedia.org/r/1029614 (https://phabricator.wikimedia.org/T364126) (owner: 10Ssingh)
[16:49:29] <sukhe>	 !log sudo cumin 'A:cp' 'disable-puppet "merging CR 1029614"'
[16:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:13] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:51:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1029551
[16:51:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1029551 (owner: 10Andrew Bogott)
[16:51:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P62259 and previous config saved to /var/cache/conftool/dbconfig/20240509-165141-ladsgroup.json
[16:51:48] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:53:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol2006-dev.codfw.wmnet with OS bookworm
[16:55:32] <sukhe>	 !log sudo cumin -b30 'A:cp' 'run-puppet-agent --enable "merging CR 1029614"'
[16:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:41] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye
[17:00:05] <jouncebot>	 bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700)
[17:06:30] <icinga-wm>	 PROBLEM - SSH on ncmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:06:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P62260 and previous config saved to /var/cache/conftool/dbconfig/20240509-170649-ladsgroup.json
[17:09:20] <icinga-wm>	 RECOVERY - SSH on ncmonitor1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:13:19] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage
[17:16:37] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage
[17:18:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:21:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P62261 and previous config saved to /var/cache/conftool/dbconfig/20240509-172157-ladsgroup.json
[17:23:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:27:21] <wikibugs>	 (03PS1) 10Lucas Werkmeister: Skin: Fix UrlUtils calls [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539)
[17:30:55] <wikibugs>	 (03CR) 10Lucas Werkmeister: "Deployment can be tested on Test Wikidata, because https://test.wikidata.org/wiki/MediaWiki:Recentchanges-url is a protocol-relative URL (" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister)
[17:33:52] <wikibugs>	 (03PS1) 10Xcollazo: Dumps: Include wikis with underscores in the list of folders to be mirrored. [puppet] - 10https://gerrit.wikimedia.org/r/1029633 (https://phabricator.wikimedia.org/T354687)
[17:34:03] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1029634
[17:34:04] <wikibugs>	 (03PS1) 10Jforrester: Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567)
[17:34:13] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS bookworm
[17:34:39] <James_F>	 jouncebot: nowandnext
[17:34:39] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700)
[17:34:40] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1700)
[17:34:40] <jouncebot>	 In 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1800)
[17:34:45] <James_F>	 I'm going to do an emergency deploy to unbreak Wikifunctions editing.
[17:34:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567) (owner: 10Jforrester)
[17:35:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1029557
[17:36:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "Replace cloudcontrol2001-dev with cloudcontrol2006-dev"" [puppet] - 10https://gerrit.wikimedia.org/r/1029557 (owner: 10Andrew Bogott)
[17:37:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P62262 and previous config saved to /var/cache/conftool/dbconfig/20240509-173705-ladsgroup.json
[17:37:08] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[17:37:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:37:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[17:37:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P62263 and previous config saved to /var/cache/conftool/dbconfig/20240509-173728-ladsgroup.json
[17:40:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" [extensions/WikiLambda] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029556 (https://phabricator.wikimedia.org/T364567) (owner: 10Jforrester)
[17:40:50] <James_F>	 Finally.
[17:41:15] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]]
[17:41:20] <stashbot>	 T364567: Editing labels in Wikifunctions' Objects doesn't get reflected in the API response because it's cached - https://phabricator.wikimedia.org/T364567
[17:43:53] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:45:22] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[17:53:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:58:32] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1029556|Revert "Action APIs: Set most of our APIs to emit a cache header for 24 hours" (T364567)]] (duration: 17m 17s)
[17:58:38] <stashbot>	 T364567: Editing labels in Wikifunctions' Objects doesn't get reflected in the API response because it's cached - https://phabricator.wikimedia.org/T364567
[18:00:05] <jouncebot>	 jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T1800).
[18:10:24] <wikibugs>	 (03PS5) 10Herron: pyrra: onboard haproxy slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1029634 (https://phabricator.wikimedia.org/T302995)
[18:20:18] <wikibugs>	 (03PS2) 10Scott French: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978)
[18:20:18] <wikibugs>	 (03PS2) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978)
[18:21:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Move rabbitmq01.codfw1dev to cloudcontrol2006-dev [dns] - 10https://gerrit.wikimedia.org/r/1029618 (owner: 10Andrew Bogott)
[18:27:10] <wikibugs>	 (03PS3) 10Scott French: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423)
[18:27:13] <wikibugs>	 (03PS3) 10Scott French: blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978)
[18:28:50] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1029654
[18:31:05] <wikibugs>	 (03CR) 10Scott French: "Ah, right! I think I've got it right based on the diffs. Please take a look when you have a chance." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[18:38:44] <wikibugs>	 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9784309 (10Scott_French) 05Open→03Resolved The last two patches have been merged and subsequent confd checks commands show no issues. I believe there's nothing...
[18:47:16] <wikibugs>	 (03PS1) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[18:47:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse)
[18:49:29] <wikibugs>	 (03PS4) 10Herron: pyrra: varnish: add cluster [puppet] - 10https://gerrit.wikimedia.org/r/1029654 (https://phabricator.wikimedia.org/T302995)
[18:51:38] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Reimage aqs1013 - https://phabricator.wikimedia.org/T364422#9784341 (10Eevans)
[18:52:55] <wikibugs>	 (03PS1) 10Eevans: Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558
[18:53:38] <wikibugs>	 (03PS2) 10Eevans: Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558
[18:53:50] <wikibugs>	 (03PS2) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[18:55:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:55:51] <wikibugs>	 (03CR) 10Andrea Denisse: "Please review my patch if you can, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse)
[19:00:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:03:48] <wikibugs>	 (03PS1) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:04:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:08:58] <denisse>	 !log Reset failed `pyrra-filesystem-notify-thanos.path`, and `reset-failed thanos-rule-reload.service` units on titan1001 
[19:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:01] <wikibugs>	 (03PS2) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:09:15] <denisse>	 !log Restarting `pyrra-filesystem-notify-thanos.path`, and `reset-failed thanos-rule-reload.service` units on titan1001 
[19:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:13:22] <wikibugs>	 (03PS3) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:17:03] <wikibugs>	 (03PS1) 10Zabe: Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418)
[19:17:16] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418) (owner: 10Zabe)
[19:19:53] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2001-dev.codfw.wmnet
[19:20:55] <MatmaRex>	 zabe: hi, are you planning to deploy? i was just talking about deploying that change with jeena
[19:21:48] <zabe>	 i don't really care who deploys, jeena: feel free to do it :)
[19:22:17] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove references to cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029676 (https://phabricator.wikimedia.org/T364577)
[19:22:20] <jeena>	 I can deploy if you like, was just waiting for the changes to merge to master as well
[19:23:10] <zabe>	 alrigt
[19:24:04] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Create (or teach Andrew how to create) private connections+dns entries for new cloudcontrols - https://phabricator.wikimedia.org/T364559#9784420 (10Andrew) Reimaging cloudcontrol2006-dev works now, thanks!  Bonus points: I...
[19:25:21] <wikibugs>	 (03PS1) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[19:25:34] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:25:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove references to cloudcontrol2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1029676 (https://phabricator.wikimedia.org/T364577) (owner: 10Andrew Bogott)
[19:26:20] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" [extensions/Flow] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029562 (https://phabricator.wikimedia.org/T312418) (owner: 10Zabe)
[19:26:25] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[19:28:28] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[19:29:02] <wikibugs>	 (03PS2) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[19:29:35] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[19:29:35] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:29:36] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2001-dev.codfw.wmnet
[19:29:38] <wikibugs>	 (03PS4) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:29:55] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:34:49] <wikibugs>	 (03PS3) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[19:35:01] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:41:33] <wikibugs>	 (03PS4) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[19:41:57] <wikibugs>	 (03PS5) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:42:03] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]]
[19:42:08] <stashbot>	 T312418: Migrate usage of Database::select to SelectQueryBuilder in Flow - https://phabricator.wikimedia.org/T312418
[19:42:09] <stashbot>	 T364499: Flow\Exception\WikitextException: Conversion from 'wikitext' to 'topic-title-wikitext' was requested, but this is not supported. - https://phabricator.wikimedia.org/T364499
[19:43:49] <wikibugs>	 (03CR) 10CDanis: [C:03+1] cache:benthos: test for socket based activation in Benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029615 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur)
[19:44:43] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and zabe: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:45:14] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:45:35] <jeena>	 MatmaRex: are there any checks you need to do before I sync?
[19:46:14] <MatmaRex>	 jeena: no
[19:46:30] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and zabe: Continuing with sync
[19:46:35] <jeena>	 thanks!
[19:46:35] <MatmaRex>	 the test plan is to see if the affected pages appear without errors now
[19:47:14] <wikibugs>	 (03PS6) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[19:47:51] <wikibugs>	 (03PS5) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[19:50:29] <wikibugs>	 (03CR) 10Cwhite: gitlab: enable custom exporter on all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[19:52:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[19:52:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1424 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:53:08] <wikibugs>	 (03CR) 10Cwhite: prometheus::ops: scrape custom gitlab exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[19:53:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:54:02] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:59:41] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1029562|Revert "Migrate to IReadableDatabase::newSelectQueryBuilder" (T312418 T364499)]] (duration: 17m 37s)
[19:59:48] <stashbot>	 T312418: Migrate usage of Database::select to SelectQueryBuilder in Flow - https://phabricator.wikimedia.org/T312418
[19:59:48] <stashbot>	 T364499: Flow\Exception\WikitextException: Conversion from 'wikitext' to 'topic-title-wikitext' was requested, but this is not supported. - https://phabricator.wikimedia.org/T364499
[19:59:50] <jeena>	 MatmaRex: done
[20:00:02] <MatmaRex>	 thanks
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240509T2000).
[20:00:05] <jouncebot>	 lucaswerkmeister: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:08] <lucaswerkmeister>	 o/
[20:00:39] <MatmaRex>	 things are working as expected for me
[20:00:41] <jeena>	 hi lucaswerkmeister, I need to delay the backport window a bit, since I need to deploy the train to all wikis
[20:00:45] <jeena>	 thanks MatmaRex!
[20:00:52] <lucaswerkmeister>	 ok, good luck with the train!
[20:01:01] <jeena>	 thank you, I'll ping you when done
[20:01:43] <jeena>	 MatmaRex: the errors are also going down now 👍
[20:02:22] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398)
[20:02:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[20:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029704 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot)
[20:06:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:11:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:15:13] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:18:13] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.4  refs T361398
[20:18:17] <stashbot>	 T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398
[20:18:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:22:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1424 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:22:44] <jeena>	 lucaswerkmeister: I can backport your change now
[20:22:50] <lucaswerkmeister>	 okay! I’m ready to test it
[20:23:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:23:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister)
[20:23:59] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:24:58] <wikibugs>	 (03PS1) 10Zabe: wikireplicas: Drop gu_salt from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1029709 (https://phabricator.wikimedia.org/T364435)
[20:28:59] <wikibugs>	 (03PS2) 10Zabe: wikireplicas: Drop gu_salt from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1029709 (https://phabricator.wikimedia.org/T364435)
[20:33:49] <lucaswerkmeister>	 I was confused for a second why CI was taking so long and then I remembered this is a core patch and not a config change ;)
[20:37:04] <wikibugs>	 (03PS1) 10Umherirrender: specials: Fix "include templates" query builder for Special:Export [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029564 (https://phabricator.wikimedia.org/T364554)
[20:37:27] <jeena>	 hehe
[20:39:13] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:43:15] <wikibugs>	 (03PS6) 10JHathaway: postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395)
[20:43:24] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:47:04] <wikibugs>	 (03Merged) 10jenkins-bot: Skin: Fix UrlUtils calls [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1029552 (https://phabricator.wikimedia.org/T364539) (owner: 10Lucas Werkmeister)
[20:47:13] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: bump postfix version [puppet] - 10https://gerrit.wikimedia.org/r/1029677 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:47:22] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]]
[20:47:29] <stashbot>	 T364539: Protocol-relative URL in sidebar now interpreted as title (Query Service link in Wikidata sidebar broken) - https://phabricator.wikimedia.org/T364539
[20:48:37] <wikibugs>	 (03PS7) 10JHathaway: postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395)
[20:49:17] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:49:51] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and lucaswerkmeister: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:49:58] <lucaswerkmeister>	 testing…
[20:51:07] <lucaswerkmeister>	 hm, I’m not seeing it working quite yet
[20:51:14] <lucaswerkmeister>	 I wonder if some cache is involved
[20:51:35] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: chance acme chief cert order for Postfix [puppet] - 10https://gerrit.wikimedia.org/r/1029670 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[20:54:44] <jeena>	 lucaswerkmeister: I'm not sure...what/how are we testing?
[20:55:05] <lucaswerkmeister>	 if you look at https://test.wikidata.org/wiki/Wikidata:Main_Page the “recent changes” link in the sidebar is broken
[20:55:10] <lucaswerkmeister>	 (goes to https://test.wikidata.org/wiki///test.wikidata.org/wiki/Special:RecentChanges)
[20:55:37] <lucaswerkmeister>	 if I read the config correctly, $wgEnableSidebarCache is enabled everywhere in prod, and $wgSidebarCacheExpiry defaults to 86400 seconds (one day)
[20:55:59] <lucaswerkmeister>	 so I think we can’t really test it in production
[20:56:06] <lucaswerkmeister>	 I did test it locally, FWIW ^^
[20:56:14] <jeena>	 oh I see
[20:56:29] <lucaswerkmeister>	 ($wgEnableSidebarCache defaults to false, so I wasn’t affected by that on my local wiki)
[20:56:49] <jeena>	 i wonder if there's a way to force the cache to expire? or I just continue to sync
[20:57:04] <lucaswerkmeister>	 hm
[20:57:10] <lucaswerkmeister>	 apparently the cache key includes the language code
[20:57:10] <lucaswerkmeister>	 let me see
[20:57:21] <lucaswerkmeister>	 yay, https://test.wikidata.org/wiki/Wikidata:Main_Page?uselang=aa shows a fixed link
[20:57:28] <jeena>	 okay, cool
[20:57:36] <lucaswerkmeister>	 (nobody’s had a reason to open test wikidata in that language in the past 24 hours, I guess ^^)
[20:57:49] <lucaswerkmeister>	 thanks for asking that question and making me look closer at the code ^^
[20:58:02] <jeena>	 thanks for the fix!
[20:58:16] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and lucaswerkmeister: Continuing with sync
[21:11:04] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1029552|Skin: Fix UrlUtils calls (T364539)]] (duration: 23m 42s)
[21:11:08] <stashbot>	 T364539: Protocol-relative URL in sidebar now interpreted as title (Query Service link in Wikidata sidebar broken) - https://phabricator.wikimedia.org/T364539
[21:11:24] <lucaswerkmeister>	 \o/ thanks for deploying!
[21:11:40] <lucaswerkmeister>	 and thank you for taking care of the post-Hackathon train <3
[21:11:50] <jeena>	 you're welcome!
[21:15:38] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.kafka.roll-restart-reboot-brokers: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1029712
[21:18:55] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad
[21:30:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[21:35:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[21:41:56] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad
[21:42:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:45:00] <wikibugs>	 06SRE, 10Observability-Alerting: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9784740 (10andrea.denisse) Hi @fgiunchedi , thanks for sharing your insights on this task. I'm taking a look at it again and I agree that repurposing this task to fix `prometheu...
[21:47:43] <ryankemper>	 !log [wdqs] Re-enabled puppet on `wdqs2023`
[21:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:57:50] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:02:31] <icinga-wm>	 PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[22:07:43] <wikibugs>	 (03PS1) 10JHathaway: Revert "postfix: chance acme chief cert order for Postfix" [puppet] - 10https://gerrit.wikimedia.org/r/1029565
[22:07:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588 (10xcollazo) 03NEW
[22:08:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784779 (10xcollazo)
[22:11:38] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "postfix: chance acme chief cert order for Postfix" [puppet] - 10https://gerrit.wikimedia.org/r/1029565 (owner: 10JHathaway)
[22:12:33] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784798 (10xcollazo) @WDoranWMF kindly please confirm that you are my manager and that you approve of this request.
[22:12:45] <icinga-wm>	 RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 122.95 ms
[22:27:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2006']
[22:28:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2006']
[22:33:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9784803 (10Jhancock.wm)
[22:36:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:55:13] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1013-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:55:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784816 (10Eevans)
[22:58:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784822 (10Eevans)
[23:01:24] <wikibugs>	 (03PS1) 10Santiago Faci: mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734)
[23:01:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9784824 (10Eevans) > [] - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml  @KOfori you are group approver for cassandra-st...
[23:03:13] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734) (owner: 10Santiago Faci)
[23:03:17] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Revert "Reimage aqs1013 w/o preserving data" [puppet] - 10https://gerrit.wikimedia.org/r/1029558 (owner: 10Eevans)
[23:04:05] <wikibugs>	 (03Merged) 10jenkins-bot: mpic-next: New release for staging environment with some fixes: v0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029725 (https://phabricator.wikimedia.org/T360734) (owner: 10Santiago Faci)
[23:06:14] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[23:06:26] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[23:06:39] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[23:20:57] <wikibugs>	 (03PS1) 10Zabe: beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727
[23:21:49] <wikibugs>	 (03CR) 10Zabe: [C:03+2] beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727 (owner: 10Zabe)
[23:22:33] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Disable Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029727 (owner: 10Zabe)
[23:38:43] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028943
[23:38:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028943 (owner: 10TrainBranchBot)
[23:50:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed