[00:09:28] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465)
[00:11:00] <wikibugs>	 (03CR) 10Krinkle: "Confirmed against prod:" [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[00:11:32] <icinga-wm>	 PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:42] <icinga-wm>	 PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:24] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:18:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:18] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[00:25:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:36] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:08] <icinga-wm>	 PROBLEM - Check systemd state on elastic1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:54] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:54] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:09:08] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[01:09:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:11:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:11:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:14:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:06] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:21:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:36] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic1078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:41:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:00] <icinga-wm>	 RECOVERY - Check systemd state on elastic1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10RLazarus) @MoritzMuehlenhoff Checking in -- have you had any time to take a look at this?
[01:59:00] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:07:34] <icinga-wm>	 RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:15:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:38] <icinga-wm>	 PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:23:18] <icinga-wm>	 RECOVERY - Check systemd state on elastic1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:52] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:18] <icinga-wm>	 RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: security updates - bking@cumin1001 - T304938
[02:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:45:28] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:46:08] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:56:50] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:58:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:28] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:50] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:40] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:56:56] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:17:14] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:20:04] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:26:12] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 67 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:43:22] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:46:32] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:54:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[04:54:48] <wikibugs>	 (03Merged) 10jenkins-bot: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[04:57:48] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:40] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:10:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P24272 and previous config saved to /var/cache/conftool/dbconfig/20220408-051044-root.json
[05:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:48] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:19:01] <wikibugs>	 (03PS1) 10Marostegui: db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/778380
[05:19:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/778380 (owner: 10Marostegui)
[05:25:39] <wikibugs>	 (03PS1) 10KartikMistry: Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029)
[05:35:32] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:45:04] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:02:19] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:10:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto)
[06:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto)
[06:19:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24273 and previous config saved to /var/cache/conftool/dbconfig/20220408-061922-root.json
[06:19:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:36] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778237
[06:21:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778237 (owner: 10Marostegui)
[06:25:47] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24274 and previous config saved to /var/cache/conftool/dbconfig/20220408-063426-root.json
[06:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[06:38:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[06:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:19] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24275 and previous config saved to /var/cache/conftool/dbconfig/20220408-064930-root.json
[06:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:21] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:59:05] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220408T0700)
[07:04:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24276 and previous config saved to /var/cache/conftool/dbconfig/20220408-070434-root.json
[07:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:10:01] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:12:40] <mmandere>	 !log depool cp6011 for reimage - T290005
[07:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:45] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[07:14:59] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[07:15:21] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24277 and previous config saved to /var/cache/conftool/dbconfig/20220408-071938-root.json
[07:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:02] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS buster
[07:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster
[07:21:47] <mmandere>	 !log depool cp6003 for reimage - T290005
[07:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:50] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[07:23:40] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005)
[07:24:34] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[07:26:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[07:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[07:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24278 and previous config saved to /var/cache/conftool/dbconfig/20220408-072615-ladsgroup.json
[07:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:18] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[07:27:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Doh! Thank you for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite)
[07:28:53] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS buster
[07:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster
[07:31:04] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS bullseye
[07:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24279 and previous config saved to /var/cache/conftool/dbconfig/20220408-073442-root.json
[07:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:12] <wikibugs>	 (03CR) 10JMeybohm: "Make sure to also include a checksum annotation for the new configmap into your deployment spec to have it restarted on config changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis)
[07:36:46] <wikibugs>	 (03PS1) 10Phedenskog: grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583)
[07:36:58] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2001.codfw.wmnet with OS bullseye
[07:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:14] <wikibugs>	 (03CR) 10Phedenskog: [C: 04-1] grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog)
[07:37:49] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:38:31] <wikibugs>	 (03CR) 10Phedenskog: [C: 04-1] grafana: double-proxy for performance JSON meta data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog)
[07:39:10] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[07:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:49] <wikibugs>	 (03CR) 10JMeybohm: "Not 100% sure if it makes sense, but maybe you could fail rendering of the chart "if .Values.datahub-gms.tls.enabled && !.Values.global.da" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[07:41:11] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10fgiunchedi)
[07:42:07] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[07:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:09] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage
[07:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:14] <wikibugs>	 (03CR) 10Phedenskog: [C: 04-1] "Hi Chris, I tried to follow your pattern as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/547030 to add a new proxy for other da" [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog)
[07:43:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[07:44:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10fgiunchedi) >>! In T299462#7839117, @Cmjohnson wrote: > @fgiunchedi Do you recall how the disks are supposed to be set up and I can fix  All raid0 from the h...
[07:45:26] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage
[07:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:46] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage
[07:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[07:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[07:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24280 and previous config saved to /var/cache/conftool/dbconfig/20220408-074723-ladsgroup.json
[07:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:27] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[07:47:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24281 and previous config saved to /var/cache/conftool/dbconfig/20220408-074829-ladsgroup.json
[07:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:07] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage
[07:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:59] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2001.codfw.wmnet with reason: host reimage
[07:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[07:51:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "+1 to the idea, although IMHO this should be a property of the cluster(s). That way operators don't have to remember/know which clusters n" [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron)
[07:54:26] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2001.codfw.wmnet with reason: host reimage
[07:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:02] <icinga-wm>	 RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-04-07 09:54:28 (2917 GiB, +1.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:55:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite)
[07:59:59] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS bullseye
[08:00:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:28] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2151.codfw.wmnet with OS bullseye
[08:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24282 and previous config saved to /var/cache/conftool/dbconfig/20220408-080335-ladsgroup.json
[08:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:14] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:05] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2001.codfw.wmnet with OS bullseye
[08:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:11] <jynus>	 !log restart db1133 T299876
[08:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:14] <stashbot>	 T299876: Upgrade database backup sources and dbprov* hosts to Bullseye + MariaDB 10.4 - https://phabricator.wikimedia.org/T299876
[08:15:06] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: host reimage
[08:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:17:58] <wikibugs>	 (03CR) 10David Caro: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[08:18:01] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: host reimage
[08:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:27] <wikibugs>	 (03PS1) 10TheDJ: Older browser do not return a promise from .play() [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778238 (https://phabricator.wikimedia.org/T304705)
[08:18:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24283 and previous config saved to /var/cache/conftool/dbconfig/20220408-081840-ladsgroup.json
[08:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:46] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1001.eqiad.wmnet with OS bullseye
[08:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:52] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS buster
[08:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster com...
[08:26:06] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:26:34] <mmandere>	 !log pool cp6011 with HAProxy as TLS termination layer - T290005
[08:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:37] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:29:45] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2102.codfw.wmnet with OS bullseye
[08:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:33:42] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2151.codfw.wmnet with OS bullseye
[08:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24284 and previous config saved to /var/cache/conftool/dbconfig/20220408-083345-ladsgroup.json
[08:33:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:33:49] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[08:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24285 and previous config saved to /var/cache/conftool/dbconfig/20220408-083353-ladsgroup.json
[08:33:55] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1001.eqiad.wmnet with reason: host reimage
[08:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:06] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:35:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Use wgRestAPIAdditionalRouteFiles for WB REST API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob)
[08:36:47] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1001.eqiad.wmnet with reason: host reimage
[08:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:43] <wikibugs>	 (03PS3) 10Jakob: Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901
[08:37:45] <wikibugs>	 (03CR) 10Jakob: Use wgRestAPIAdditionalRouteFiles for WB REST API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob)
[08:37:46] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS buster
[08:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster com...
[08:38:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:40:07] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2002.codfw.wmnet with OS bullseye
[08:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:24] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2102.codfw.wmnet with reason: host reimage
[08:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be okay to deploy on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob)
[08:41:55] <mmandere>	 !log pool cp6003 with HAProxy as TLS termination layer - T290005
[08:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:58] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:43:53] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2102.codfw.wmnet with reason: host reimage
[08:43:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:03] <mmandere>	 !log depool cp6010 for reimage - T290005
[08:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:08] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:48:15] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:48:36] <jynus>	 ^that is me doing a reimage and will fix itself soon
[08:48:48] <jynus>	 ah, actually, it is the resolution
[08:49:27] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1001.eqiad.wmnet with OS bullseye
[08:49:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:03] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[08:50:13] <wikibugs>	 (03PS3) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005)
[08:50:45] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:20] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2002.codfw.wmnet with reason: host reimage
[08:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:55:20] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:44] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2002.codfw.wmnet with reason: host reimage
[08:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:05] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS buster
[08:57:05] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2102.codfw.wmnet with OS bullseye
[08:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster
[08:58:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24286 and previous config saved to /var/cache/conftool/dbconfig/20220408-085810-ladsgroup.json
[08:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:16] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[08:59:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:01:27] <wikibugs>	 (03PS4) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462)
[09:02:10] <mmandere>	 !log depool cp6002 for reimage - T290005
[09:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:14] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:04:35] <wikibugs>	 (03CR) 10Btullis: Configure LDAP authentication for DataHub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis)
[09:05:50] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005)
[09:07:36] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[09:08:01] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[09:08:08] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2002.codfw.wmnet with OS bullseye
[09:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:47] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1002.eqiad.wmnet with OS bullseye
[09:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24287 and previous config saved to /var/cache/conftool/dbconfig/20220408-090943-ladsgroup.json
[09:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:48] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[09:13:15] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage
[09:13:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24288 and previous config saved to /var/cache/conftool/dbconfig/20220408-091315-ladsgroup.json
[09:13:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:29] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS buster
[09:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster
[09:14:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:06] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage
[09:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:31] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2003.codfw.wmnet with OS bullseye
[09:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:27] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1002.eqiad.wmnet with reason: host reimage
[09:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:45] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1002.eqiad.wmnet with reason: host reimage
[09:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24289 and previous config saved to /var/cache/conftool/dbconfig/20220408-092448-ladsgroup.json
[09:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:26] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:25:51] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS buster
[09:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster
[09:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24290 and previous config saved to /var/cache/conftool/dbconfig/20220408-092820-ladsgroup.json
[09:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:33] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage
[09:29:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:41] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2003.codfw.wmnet with reason: host reimage
[09:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis)
[09:32:55] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage
[09:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:04] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jgiannelos)
[09:33:44] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jgiannelos)
[09:35:47] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2003.codfw.wmnet with reason: host reimage
[09:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:54] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1002.eqiad.wmnet with OS bullseye
[09:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24291 and previous config saved to /var/cache/conftool/dbconfig/20220408-093953-ladsgroup.json
[09:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24292 and previous config saved to /var/cache/conftool/dbconfig/20220408-094325-ladsgroup.json
[09:43:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[09:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[09:43:29] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[09:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:47:30] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2003.codfw.wmnet with OS bullseye
[09:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:08] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1003.eqiad.wmnet with OS bullseye
[09:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:54:03] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS buster
[09:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with errors...
[09:54:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24293 and previous config saved to /var/cache/conftool/dbconfig/20220408-095458-ladsgroup.json
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:02] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[10:00:37] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1003.eqiad.wmnet with reason: host reimage
[10:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:11] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1003.eqiad.wmnet with reason: host reimage
[10:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:26] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS buster
[10:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster com...
[10:07:25] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS buster
[10:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster com...
[10:08:15] <wikibugs>	 (03PS1) 10David Caro: Add debian 11 testing support [puppet] - 10https://gerrit.wikimedia.org/r/778477
[10:11:20] <mmandere>	 !log pool cp6010 with HAProxy as TLS termination layer - T290005
[10:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:23] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:12:44] <wikibugs>	 (03CR) 10David Caro: "After (if) https://gerrit.wikimedia.org/r/c/operations/puppet/+/778477 gets in (enabling proper facts for bullseye for testing), this woul" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[10:15:18] <wikibugs>	 (03CR) 10David Caro: Add debian 11 testing support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778477 (owner: 10David Caro)
[10:15:38] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1003.eqiad.wmnet with OS bullseye
[10:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:40] <mmandere>	 !log pool cp6002 with HAProxy as TLS termination layer - T290005
[10:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:43] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:24:36] <wikibugs>	 (03PS1) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[10:24:38] <wikibugs>	 (03PS1) 10Zabe: swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673)
[10:25:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:25:20] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:26:18] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:26:44] <wikibugs>	 (03PS2) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[10:29:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney)
[10:30:03] <wikibugs>	 (03Merged) 10jenkins-bot: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney)
[10:31:02] <wikibugs>	 (03PS3) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[10:33:04] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe)
[10:34:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[10:38:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[10:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:18] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:04] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:17] <wikibugs>	 (03PS4) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[10:58:56] <wikibugs>	 (03PS5) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[10:59:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 62 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:03:34] <topranks>	 ^^ not really sure what the cause of this was, been digging here, widely distributed geographically and across ASNs.
[11:04:10] <topranks>	 which points to something close to us, but not found any smoking gun.  Don't expect there was much user impact.
[11:04:27] <wikibugs>	 (03PS6) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673)
[11:08:44] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34752/" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:10:06] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:11:21] <mmandere>	 !log depool cp6009 for reimage - T290005
[11:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:25] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:16:24] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178)
[11:24:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:24:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance
[11:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance
[11:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:38] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:07] <icinga-wm>	 PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 18791 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail
[11:32:05] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778490 (https://phabricator.wikimedia.org/T290005)
[11:32:07] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778491 (https://phabricator.wikimedia.org/T290005)
[11:34:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1184', diff saved to https://phabricator.wikimedia.org/P24294 and previous config saved to /var/cache/conftool/dbconfig/20220408-113452-root.json
[11:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 57 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:37:37] <Emperor>	 anyone looking at the mail queue?
[11:38:03] <_joe_>	 Emperor: if I can get into mx1001, maybe
[11:38:09] * Emperor is there
[11:38:12] <_joe_>	 but looks like I can't
[11:38:18] <Emperor>	 the entire queue is to wikimedia.org
[11:38:34] <Emperor>	 27453 of 29187
[11:38:44] <topranks>	 _joe_: you can't get there network-wise?
[11:38:57] <Emperor>	 looks like fr-tech's gone pop again
[11:39:22] <_joe_>	 yep
[11:39:32] <_joe_>	 topranks: did it on the second attempt
[11:39:44] <_joe_>	 Emperor: I would tell them in fr-tech and nuke their emails
[11:39:57] <_joe_>	 the ones in the queue I mean
[11:40:07] <topranks>	 hmm ok.  I'm still digging around on the back of that RIPE atlas result, if you continue to have problems let me know.
[11:40:31] <_joe_>	 topranks: I am indeed using ipv6
[11:40:33] <Emperor>	 just working on an exiqgrep rune
[11:41:19] <_joe_>	 Emperor: cheers to becoming an eximmaster
[11:42:13] <Emperor>	 _joe_: as root: exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm
[11:42:16] <Emperor>	 look OK?
[11:42:33] <_joe_>	 Emperor: let's tell them first, but yes it does
[11:43:35] <_joe_>	 Emperor: i told them
[11:43:37] <_joe_>	 go on
[11:44:06] <Emperor>	 doing so; please hold
[11:45:24] <_joe_>	 log your action here for posterity
[11:45:42] <_joe_>	 also for twitter purposes
[11:45:46] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:51] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001
[11:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:04] <Emperor>	 queue back to a modest 2337 mails
[11:46:15] <_joe_>	 "modest" :P
[11:46:45] <_joe_>	 still all their mails
[11:46:59] <Emperor>	 I've acked the VO incident, it should resolve once the queue checker has another look
[11:47:18] <_joe_>	 yep
[11:47:46] <Emperor>	 modest> I do some volunteer sysadmin for a popular fandom non-profit, we send a couple of million emails a day, 2000 in the queue is nothing ;-)
[11:48:03] <Emperor>	 Anyhow, back to that lunchtime walk I was about to do...
[11:48:54] <_joe_>	 me too!
[11:51:06] <topranks>	 Same!  After filtering for the v6 Atlas IPs that are still unreachable from eqiad LVS VIP, the pattern right now to those ASNs is no different than normal: 
[11:51:07] <topranks>	 https://w.wiki/52jm
[11:51:16] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[11:51:42] <topranks>	 something is making us trip the limit, but there is a background level always, I can't see any signs of a general problem our side.
[11:51:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:53:36] <wikibugs>	 (03PS3) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005)
[11:56:01] <jynus>	 was fr tech notified- if not I will send an email?
[11:56:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm, I will deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984) (owner: 10Jgiannelos)
[11:57:00] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:19] <RhinosF1>	 jynus: Joe left a message in #wikimedia-fundraising
[11:57:26] <jynus>	 ah, cool
[11:58:22] <wikibugs>	 (03PS1) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673)
[11:58:24] <wikibugs>	 (03PS1) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673)
[11:58:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:59:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[12:02:07] <wikibugs>	 (03PS2) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673)
[12:04:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:08:21] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[12:10:03] <wikibugs>	 (03PS2) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673)
[12:10:13] <wikibugs>	 (03PS2) 10Zabe: swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673)
[12:11:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:11:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24295 and previous config saved to /var/cache/conftool/dbconfig/20220408-121138-ladsgroup.json
[12:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:42] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[12:11:45] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS buster
[12:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster
[12:15:45] <mmandere>	 !log depool cp6001 for reimage - T290005
[12:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:49] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:17:20] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:24] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005)
[12:19:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[12:22:55] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS buster
[12:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster
[12:24:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:27:59] <Emperor>	 I can't help but notice that there are 17k messages in the queue to fr-tech again
[12:29:41] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage
[12:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:09] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage
[12:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:55] <Emperor>	 which is well above the p.age threshold
[12:35:50] <Emperor>	 (am poking on the fundraising channel)
[12:39:43] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:40:44] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage
[12:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:50] <Emperor>	 I'm a bit worried about the disk space on mx1001 too, not just the raw queue numbers
[12:43:34] <wikibugs>	 (03Abandoned) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778490 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[12:43:44] <Emperor>	 absent further input from #wikimedia-fundraising I think I should bin another pile of the fr-tech-failmail pileup. Could I get a +1 from someone before I do so, please? I'm a bit wary of just binning 22k emails on a whim...
[12:44:14] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage
[12:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:20] <wikibugs>	 (03Abandoned) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778491 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[12:49:51] <RhinosF1>	 Emperor: there is a task for disk
[12:50:37] <RhinosF1>	 Emperor: https://phabricator.wikimedia.org/T305567
[12:54:07] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again)
[12:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:26] <wikibugs>	 (03PS2) 10JMeybohm: Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435)
[12:57:26] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster1002.eqiad.wmnet with reason: reimage
[12:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[12:57:28] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster1002.eqiad.wmnet with reason: reimage
[12:57:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10MatthewVernon) We've just had a repeat, and again exim mainlog is I think too big for logrotate to succeed with :-/ ` mvernon@mx1001:~$ ls -lsh /var/log/exim4/mainlog 3.9G -rw-r----- 1 Debian...
[12:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:28] <logmsgbot>	 !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided)
[12:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:32] <wikibugs>	 (03PS1) 10Zabe: admin: fix a few indentation issues [puppet] - 10https://gerrit.wikimedia.org/r/778498
[12:59:11] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:59:15] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again)
[12:59:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:15] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:00:31] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:00:40] <logmsgbot>	 !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 02m 11s)
[13:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:00] <icinga-wm>	 RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail
[13:03:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:08:25] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:08:41] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:09:11] <wikibugs>	 (03PS2) 10JMeybohm: Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435)
[13:09:43] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:11:24] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS buster
[13:11:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:33] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster com...
[13:13:17] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6001.drmrs.wmnet with OS buster
[13:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:26] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster com...
[13:13:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504
[13:14:07] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:14:09] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:15:29] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:01] <mmandere>	 !log pool cp6009 with HAProxy as TLS termination layer - T290005
[13:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:06] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[13:16:07] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:18:03] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again; keeping queue below the p.age threshold while fr-tech work)
[13:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:13] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch)
[13:18:33] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:20:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24296 and previous config saved to /var/cache/conftool/dbconfig/20220408-132024-root.json
[13:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:54] <mmandere>	 !log pool cp6001 with HAProxy as TLS termination layer - T290005
[13:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:09] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster1001.eqiad.wmnet with reason: reimage
[13:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:11] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster1001.eqiad.wmnet with reason: reimage
[13:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[13:26:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:26:52] <wikibugs>	 (03CR) 10Awight: [C: 03+2] [beta] Enable colorblind-friendly color scheme on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch)
[13:28:11] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Enable colorblind-friendly color scheme on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch)
[13:29:54] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:30:19] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2008.codfw.wmnet with OS bullseye
[13:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:40] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:30:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:31:14] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:34:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:35:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24297 and previous config saved to /var/cache/conftool/dbconfig/20220408-133528-root.json
[13:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:35:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:00] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:37:00] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS bullseye
[13:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:02] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24298 and previous config saved to /var/cache/conftool/dbconfig/20220408-133715-ladsgroup.json
[13:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:18] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[13:37:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509
[13:37:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510
[13:37:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:38:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509 (owner: 10Giuseppe Lavagetto)
[13:39:22] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:39:50] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:59] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509 (owner: 10Giuseppe Lavagetto)
[13:40:12] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:40:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510 (owner: 10Giuseppe Lavagetto)
[13:41:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) ok, thanks, I'll rotate it manually and plan on embiggening the existing hosts.
[13:42:20] <wikibugs>	 (03Merged) 10jenkins-bot: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510 (owner: 10Giuseppe Lavagetto)
[13:43:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:43:36] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2008.codfw.wmnet with reason: host reimage
[13:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:56] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:46:26] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2008.codfw.wmnet with reason: host reimage
[13:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:59] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1008.eqiad.wmnet with reason: host reimage
[13:48:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:32] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again again; keeping queue below the p.age threshold while fr-tech work)
[13:50:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24299 and previous config saved to /var/cache/conftool/dbconfig/20220408-135032-root.json
[13:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:20] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1008.eqiad.wmnet with reason: host reimage
[13:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24300 and previous config saved to /var/cache/conftool/dbconfig/20220408-135220-ladsgroup.json
[13:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:07] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:57:43] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2008.codfw.wmnet with OS bullseye
[13:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:35] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:02:44] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1008.eqiad.wmnet with OS bullseye
[14:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24302 and previous config saved to /var/cache/conftool/dbconfig/20220408-140536-root.json
[14:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:37] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:07:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24303 and previous config saved to /var/cache/conftool/dbconfig/20220408-140725-ladsgroup.json
[14:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:13] <wikibugs>	 (03CR) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron)
[14:08:49] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:23] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:05] <wikibugs>	 (03PS1) 10Herron: sre.kafka.reboot-workers: remove systemctl stop calls [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652)
[14:19:57] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:19:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24304 and previous config saved to /var/cache/conftool/dbconfig/20220408-142041-root.json
[14:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:13] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Disable wiping of backup[12]008 after bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/778518 (https://phabricator.wikimedia.org/T305446)
[14:21:44] <Emperor>	 !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again again; keeping queue below the p.age threshold while fr-tech work)
[14:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24305 and previous config saved to /var/cache/conftool/dbconfig/20220408-142230-ladsgroup.json
[14:22:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[14:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[14:22:35] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[14:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24306 and previous config saved to /var/cache/conftool/dbconfig/20220408-142239-ladsgroup.json
[14:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] install_server: Disable wiping of backup[12]008 after bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/778518 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo)
[14:26:12] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:54] <wikibugs>	 (03CR) 10Herron: "This is a simple alternate approach to I2d96ceee31ac3f1029bc45395cebe4dac47d5a4a" [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron)
[14:27:04] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:27:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:28:52] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:31:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[14:31:58] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:34:28] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:34:48] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:35:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24307 and previous config saved to /var/cache/conftool/dbconfig/20220408-143545-root.json
[14:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:58] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:38:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite)
[14:39:32] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:39:50] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, 10Kubernetes: service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm)
[14:40:20] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:40:48] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:41:12] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:44:22] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:45:02] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:40] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:46:52] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:57:56] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:16] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:08:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:10:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:11:26] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:14:34] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:16:06] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:48] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:21:34] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:00] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:25:06] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:05] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34755/console" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[15:28:22] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:20] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:30:28] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:30:29] <wikibugs>	 (03CR) 10Andrew Bogott: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[15:31:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:31:54] <icinga-wm>	 RECOVERY - Host logstash2024 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms
[15:32:04] <icinga-wm>	 PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:28] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[15:34:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:39:32] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:40:10] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:40:56] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:41:06] <icinga-wm>	 RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:44:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24308 and previous config saved to /var/cache/conftool/dbconfig/20220408-154414-ladsgroup.json
[15:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:19] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[15:45:28] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:46] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:12] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:51:11] <wikibugs>	 (03PS1) 10Vgutierrez: requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536
[15:51:17] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:20] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:39] <logmsgbot>	 !log dancy@deploy1002 Started scap: (no justification provided)
[15:52:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:02] <dancy>	 !log dancy@deploy1002: Testing mw container image build 
[15:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:40] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:58:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Requestctl VCL/VSL fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537
[15:59:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24309 and previous config saved to /var/cache/conftool/dbconfig/20220408-155919-ladsgroup.json
[15:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez)
[16:01:39] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez)
[16:02:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Requestctl VCL translation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537
[16:05:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Requestctl VCL translation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 (owner: 10Giuseppe Lavagetto)
[16:05:32] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:05:55] <wikibugs>	 (03PS1) 10Majavah: hieradata: add new eqiad1 enc servers [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247)
[16:06:51] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34756/console" [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[16:06:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:06:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:48] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:08:51] <wikibugs>	 (03CR) 10Vivian Rook: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:11:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:11:38] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:13:18] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:52] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:14:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24310 and previous config saved to /var/cache/conftool/dbconfig/20220408-161425-ladsgroup.json
[16:14:32] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:32] <_joe_>	 jayme: can you check what went wrong with deploy_to_mwdebug ?
[16:15:32] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:16:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:17:48] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:19:12] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:21:19] <dancy>	 _joe_ I'll try the helmfile apply under my account to see what happens.
[16:21:33] <_joe_>	 dancy: thanks!
[16:21:45] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:21:48] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:21:57] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:22:01] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:22:13] <dancy>	 hmm..
[16:22:54] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:22:58] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:23:48] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] deployment-prep: re-point to new bullseye hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke)
[16:24:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606)
[16:24:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606)
[16:24:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606)
[16:24:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606)
[16:24:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[16:25:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[16:25:54] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[16:27:32] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:27:32] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:29:12] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:29:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24311 and previous config saved to /var/cache/conftool/dbconfig/20220408-162930-ladsgroup.json
[16:29:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[16:29:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[16:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:35] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[16:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24312 and previous config saved to /var/cache/conftool/dbconfig/20220408-162938-ladsgroup.json
[16:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:29:50] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:32:50] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:34:52] <wikibugs>	 (03PS1) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551
[16:37:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: add new eqiad1 enc servers [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[16:38:02] <wikibugs>	 (03PS2) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551
[16:38:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:39:22] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:39:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34758/console" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah)
[16:41:30] <wikibugs>	 (03PS3) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551
[16:41:32] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:43:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:43:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34759/console" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah)
[16:45:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:28] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:49:04] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:51:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:52:55] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606)
[16:52:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606)
[16:52:59] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606)
[16:53:01] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606)
[16:53:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:54:46] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[16:55:00] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:55:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:56:32] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:57:04] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:04] <wikibugs>	 (03PS12) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944)
[17:00:43] <wikibugs>	 (03PS13) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944)
[17:01:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:01:50] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:06:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504
[17:07:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (owner: 10Arturo Borrero Gonzalez)
[17:07:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:08:42] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:09:42] <wikibugs>	 (03PS1) 10Cathal Mooney: Increase threshold for IPv6 RIPE Atlas probes from 65 to 90 [puppet] - 10https://gerrit.wikimedia.org/r/778567 (https://phabricator.wikimedia.org/T305703)
[17:10:12] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:10:58] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:11:10] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: switch back to using the root user [puppet] - 10https://gerrit.wikimedia.org/r/778568 (https://phabricator.wikimedia.org/T305729)
[17:12:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug_deploy: switch back to using the root user [puppet] - 10https://gerrit.wikimedia.org/r/778568 (https://phabricator.wikimedia.org/T305729) (owner: 10Giuseppe Lavagetto)
[17:15:18] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:41] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504
[17:16:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Increase threshold for IPv6 RIPE Atlas probes from 65 to 90 [puppet] - 10https://gerrit.wikimedia.org/r/778567 (https://phabricator.wikimedia.org/T305703) (owner: 10Cathal Mooney)
[17:18:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: also re-add docker credentials [puppet] - 10https://gerrit.wikimedia.org/r/778571
[17:22:16] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:23:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10cmooney) FYI I believe PXE is failing for dumpsdata1006 as the DAC cable is plugged into the second NIC port on the server side.
[17:23:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug_deploy: also re-add docker credentials [puppet] - 10https://gerrit.wikimedia.org/r/778571 (owner: 10Giuseppe Lavagetto)
[17:24:28] <wikibugs>	 (03PS1) 10Majavah: hieradata: use puppet-enc hostname in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/778574 (https://phabricator.wikimedia.org/T295247)
[17:24:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 69 probes of 665 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:24:35] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178)
[17:26:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34760/console" [puppet] - 10https://gerrit.wikimedia.org/r/778574 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[17:31:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: fix resource type [puppet] - 10https://gerrit.wikimedia.org/r/778577
[17:32:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mwdebug_deploy: fix resource type [puppet] - 10https://gerrit.wikimedia.org/r/778577 (owner: 10Giuseppe Lavagetto)
[17:34:53] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:56] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:07] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:11] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:26] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:18] <wikibugs>	 (03CR) 10Cwhite: "Change looks generally correct to me." [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog)
[17:44:59] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10wiki_willy) a:03RobH Hi @ssingh - since this server is out of warranty and due to be refreshed in a few quarters, do you still want us to purchase a replacement DIMM to keep it up and r...
[17:45:14] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "Looks good.  I think debian-glue is unhappy because there's a debian/changelog edit in the same patch as an 'upstream' code edit?  But wha" [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 (owner: 10Giuseppe Lavagetto)
[17:48:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez)
[17:48:20] * James_F grumbles at getting a hundred "page moved" e-mails from wikitech triggered by Krinkle again. :-P
[17:50:00] <Krinkle>	 I enabled flood flag
[17:50:09] <James_F>	 After the first 80?
[17:50:13] <Krinkle>	 no, from the start
[17:50:22] <James_F>	 Well I got the ENOTIF e-mails anyway.
[17:50:45] <James_F>	 Currently working its way through the Talk: ones having done the main set?
[17:51:18] <Krinkle>	 indeed, it's one big batch
[17:51:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24313 and previous config saved to /var/cache/conftool/dbconfig/20220408-175120-ladsgroup.json
[17:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:25] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[17:51:41] <Krinkle>	 https://wikitech.wikimedia.org/wiki/Special:Recentchanges?hidehumans=1&limit=100&days=14&urlversion=2
[17:51:56] <Krinkle>	 https://wikitech.wikimedia.org/wiki/Special:Recentchanges?hidebots=1&limit=100&days=14&urlversion=2
[17:52:00] <James_F>	 I don't need to see it on RC, I've got it in GMail. :-P
[17:52:01] <Krinkle>	 none without bot show up there
[17:52:10] <Krinkle>	 yeah, maybe a bug, probably a controversial one
[17:52:17] <James_F>	 Meh.
[17:52:19] <Krinkle>	 unless its' recent, in which case it's a regression
[17:53:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:53:18] <James_F>	 I don't immediately see anything in Phabriator. Will file a task.
[17:54:16] <AntiComposite>	 you're also creating double redirects btw
[17:54:19] <Krinkle>	 I see things about "hide bots" in the watchlist prefs but enotif  isn't actually from watchlist technically. Echo prefs dont' cover it. The general "email me when ..." is boolean on/off which an extra box for ".. also for minor edits" 
[17:54:28] <Krinkle>	 AntiComposite: I know, I'm queuing that up next
[17:54:50] <Krinkle>	 but deciding whether to run it normnally or not, since the script is kinda broken (doesn't use system user correctly)
[17:54:56] <Krinkle>	 I might run it with pywiki instead
[17:55:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:55:52] <James_F>	 Filed as T305734 FWIW.
[17:55:52] <stashbot>	 T305734: Page move notification e-mails sent for watchlisted pages even when actor has the flood right - https://phabricator.wikimedia.org/T305734
[17:56:16] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:52] <wikibugs>	 (03CR) 10Cwhite: sre.kafka.reboot-workers: remove systemctl stop calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron)
[17:57:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:00:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:02:27] <wikibugs>	 (03CR) 10Cwhite: sre: add alerts for exporter-specific unavailability (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[18:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24315 and previous config saved to /var/cache/conftool/dbconfig/20220408-180625-ladsgroup.json
[18:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:16:40] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:40] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:19:11] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1248 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[18:21:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24316 and previous config saved to /var/cache/conftool/dbconfig/20220408-182130-ladsgroup.json
[18:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[18:25:42] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:25:57] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5785 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[18:27:38] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:29:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[18:36:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24317 and previous config saved to /var/cache/conftool/dbconfig/20220408-183635-ladsgroup.json
[18:36:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:36:41] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[18:36:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24318 and previous config saved to /var/cache/conftool/dbconfig/20220408-183643-ladsgroup.json
[18:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:49] <mutante>	 !log gitlab1001 - giving myself gitlab admin rights via rake console, to be able to connect/disconnect runners T297659
[18:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:52] <stashbot>	 T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659
[18:46:02] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:35] <mutante>	 ^ always 1308 and no other
[18:50:06] <RhinosF1>	 I noticed that very frequent
[18:51:45] <mutante>	 ipmi_sdr_cache_create: internal IPMI error
[18:51:57] <mutante>	 internal IPMI error .. I translate that to "broken DRAC"
[18:52:02] <mutante>	 or "maybe reset fixes it"
[18:53:00] <mutante>	 i'll try the "soft" DRAC reset
[18:53:06] <mutante>	 if it needs hard reset then it needs dcops
[18:55:16] <mutante>	 well, can't connect to DRAC in the first place to reset it.. so broken DRAC it is.. will make ticket
[18:57:18] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:42] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 30.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[19:01:00] <wikibugs>	 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn)
[19:03:21] <wikibugs>	 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn) p:05Triage→03Medium
[19:15:28] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:34] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[19:21:36] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1286.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:22:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1353.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:23:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1377.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:23:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1383.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:25:30] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service daniel_zahn https://phabricator.wikimedia.org/T305741 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:44] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:55] <mutante>	 downtime for a month
[19:32:54] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:33:14] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:56:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24319 and previous config saved to /var/cache/conftool/dbconfig/20220408-195614-ladsgroup.json
[19:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:19] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[19:57:02] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10wiki_willy) Hi @RobH - just followingup to see if they ever sent the DIMM for this.  Thanks, Willy
[20:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24320 and previous config saved to /var/cache/conftool/dbconfig/20220408-201119-ladsgroup.json
[20:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:36] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "One small nit, otherwise LGTM. I reviewed the script in core, it's a simple select plus deletes that are guarded by a LIMIT of $wgUpdateRo" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe)
[20:26:08] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:26:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24321 and previous config saved to /var/cache/conftool/dbconfig/20220408-202624-ladsgroup.json
[20:26:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "tested and they pass the tests on mwdebug1001:" [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[20:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[20:35:29] <mutante>	 uhm.. unexpected puppet behaviour on the passive phab host :p
[20:35:53] <mutante>	 might cause an alert in a moment..but on it
[20:36:59] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: allow disabling ssh-phab service except on one host" [puppet] - 10https://gerrit.wikimedia.org/r/778242
[20:38:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: allow disabling ssh-phab service except on one host" [puppet] - 10https://gerrit.wikimedia.org/r/778242 (owner: 10Dzahn)
[20:40:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:41:22] <mutante>	 that is the one I expected but change is already reverted, this just lags a bit
[20:41:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24322 and previous config saved to /var/cache/conftool/dbconfig/20220408-204129-ladsgroup.json
[20:41:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[20:41:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[20:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:35] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[20:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24323 and previous config saved to /var/cache/conftool/dbconfig/20220408-204138-ladsgroup.json
[20:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:15] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "phabricator: allow disabling ssh-phab service except on one host"" [puppet] - 10https://gerrit.wikimedia.org/r/778243
[20:44:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:44:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:45:11] <wikibugs>	 (03CR) 10Dzahn: "compiler shows it as "present" on phab2001 but in reality it removes the ressources !?" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[20:49:00] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:52:04] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:13:39] <wikibugs>	 (03PS2) 10Legoktm: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827
[21:13:53] <wikibugs>	 (03CR) 10Legoktm: "Ping :)" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm)
[21:26:28] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm)
[21:35:29] <wikibugs>	 (03Abandoned) 10Jdlrobson: Convert performanceNow datatype to Integer in QuickSurvey Initiation in order to resolve data type mismatch in schema. [extensions/QuickSurveys] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777775 (https://phabricator.wikimedia.org/T305171) (owner: 10Jdlrobson)
[21:35:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[21:41:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 29.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:41:28] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:50:08] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:06:04] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:08:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24324 and previous config saved to /var/cache/conftool/dbconfig/20220408-220827-ladsgroup.json
[22:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:33] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:09:14] <mutante>	 !log gitlab - deleted runner-1008 (to replace it with a bullseye instance), recreated runner-1020 with same flavor as existing runners T297659
[22:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:17] <stashbot>	 T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659
[22:17:32] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[22:17:50] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[22:18:08] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite)
[22:23:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24325 and previous config saved to /var/cache/conftool/dbconfig/20220408-222332-ladsgroup.json
[22:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24326 and previous config saved to /var/cache/conftool/dbconfig/20220408-223837-ladsgroup.json
[22:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24327 and previous config saved to /var/cache/conftool/dbconfig/20220408-225342-ladsgroup.json
[22:53:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[22:53:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[22:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:47] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24328 and previous config saved to /var/cache/conftool/dbconfig/20220408-225350-ladsgroup.json
[22:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:36] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1389.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:07:30] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465)
[23:11:11] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Remove unused rewrite_static_assets param [puppet] - 10https://gerrit.wikimedia.org/r/778602 (https://phabricator.wikimedia.org/T302465)
[23:29:56] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook