[00:09:28] (03PS3) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) [00:11:00] (03CR) 10Krinkle: "Confirmed against prod:" [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [00:11:32] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:42] PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:44] PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:24] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:16] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [00:25:12] RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:36] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:00] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:18] PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:08] PROBLEM - Check systemd state on elastic1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:54] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:14] RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:54] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:12] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:20] RECOVERY - Check systemd state on elastic1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:08] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [01:09:54] PROBLEM - Check systemd state on elastic1076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:18] PROBLEM - Check systemd state on elastic1074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:46] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:02] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:06] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:08] RECOVERY - Check systemd state on elastic1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:30] RECOVERY - Check systemd state on elastic1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:36] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:30] PROBLEM - Check systemd state on elastic1078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:30] PROBLEM - Check systemd state on elastic1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:50] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:00] RECOVERY - Check systemd state on elastic1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10RLazarus) @MoritzMuehlenhoff Checking in -- have you had any time to take a look at this? [01:59:00] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:34] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:42] RECOVERY - Check systemd state on elastic1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:50] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:30] PROBLEM - Check systemd state on elastic1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:38] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:18] RECOVERY - Check systemd state on elastic1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:52] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:54] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:18] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:48] PROBLEM - Check systemd state on elastic1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:23] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: security updates - bking@cumin1001 - T304938 [02:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:28] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:46:08] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:50] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:28] RECOVERY - Check systemd state on elastic1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:28] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:50] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:40] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:56:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:14] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:26:12] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 67 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:43:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:46:32] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:12] (03CR) 10Marostegui: [C: 03+2] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [04:54:48] (03Merged) 10jenkins-bot: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [04:57:48] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:10:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P24272 and previous config saved to /var/cache/conftool/dbconfig/20220408-051044-root.json [05:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:48] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:01] (03PS1) 10Marostegui: db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/778380 [05:19:39] (03CR) 10Marostegui: [C: 03+2] db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/778380 (owner: 10Marostegui) [05:25:39] (03PS1) 10KartikMistry: Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) [05:35:32] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:04] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:19] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto) [06:12:28] (03Merged) 10jenkins-bot: Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto) [06:19:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24273 and previous config saved to /var/cache/conftool/dbconfig/20220408-061922-root.json [06:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:36] (03PS1) 10Marostegui: Revert "db1169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778237 [06:21:30] (03CR) 10Marostegui: [C: 03+2] Revert "db1169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778237 (owner: 10Marostegui) [06:25:47] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24274 and previous config saved to /var/cache/conftool/dbconfig/20220408-063426-root.json [06:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:38:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:19] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24275 and previous config saved to /var/cache/conftool/dbconfig/20220408-064930-root.json [06:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:21] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:59:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220408T0700) [07:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24276 and previous config saved to /var/cache/conftool/dbconfig/20220408-070434-root.json [07:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:29] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:10:01] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:40] !log depool cp6011 for reimage - T290005 [07:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:45] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:14:59] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:15:21] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24277 and previous config saved to /var/cache/conftool/dbconfig/20220408-071938-root.json [07:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:02] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS buster [07:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster [07:21:47] !log depool cp6003 for reimage - T290005 [07:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:50] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:23:40] (03PS2) 10MMandere: site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) [07:24:34] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:26:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24278 and previous config saved to /var/cache/conftool/dbconfig/20220408-072615-ladsgroup.json [07:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:27:10] (03CR) 10Filippo Giunchedi: "Doh! Thank you for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [07:28:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS buster [07:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster [07:31:04] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS bullseye [07:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24279 and previous config saved to /var/cache/conftool/dbconfig/20220408-073442-root.json [07:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:12] (03CR) 10JMeybohm: "Make sure to also include a checksum annotation for the new configmap into your deployment spec to have it restarted on config changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [07:36:46] (03PS1) 10Phedenskog: grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) [07:36:58] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2001.codfw.wmnet with OS bullseye [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:14] (03CR) 10Phedenskog: [C: 04-1] grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [07:37:49] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:38:31] (03CR) 10Phedenskog: [C: 04-1] grafana: double-proxy for performance JSON meta data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [07:39:10] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [07:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:49] (03CR) 10JMeybohm: "Not 100% sure if it makes sense, but maybe you could fail rendering of the chart "if .Values.datahub-gms.tls.enabled && !.Values.global.da" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [07:41:11] 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10fgiunchedi) [07:42:07] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [07:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:09] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [07:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:14] (03CR) 10Phedenskog: [C: 04-1] "Hi Chris, I tried to follow your pattern as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/547030 to add a new proxy for other da" [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [07:43:18] (03CR) 10JMeybohm: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [07:44:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10fgiunchedi) >>! In T299462#7839117, @Cmjohnson wrote: > @fgiunchedi Do you recall how the disks are supposed to be set up and I can fix All raid0 from the h... [07:45:26] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [07:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:46] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [07:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24280 and previous config saved to /var/cache/conftool/dbconfig/20220408-074723-ladsgroup.json [07:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:27] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [07:47:45] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24281 and previous config saved to /var/cache/conftool/dbconfig/20220408-074829-ladsgroup.json [07:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:07] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [07:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:59] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2001.codfw.wmnet with reason: host reimage [07:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [07:51:53] (03CR) 10Filippo Giunchedi: "+1 to the idea, although IMHO this should be a property of the cluster(s). That way operators don't have to remember/know which clusters n" [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [07:54:26] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2001.codfw.wmnet with reason: host reimage [07:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:02] RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-04-07 09:54:28 (2917 GiB, +1.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:55:29] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [07:59:59] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS bullseye [08:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:28] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2151.codfw.wmnet with OS bullseye [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24282 and previous config saved to /var/cache/conftool/dbconfig/20220408-080335-ladsgroup.json [08:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:14] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:05] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2001.codfw.wmnet with OS bullseye [08:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:11] !log restart db1133 T299876 [08:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:14] T299876: Upgrade database backup sources and dbprov* hosts to Bullseye + MariaDB 10.4 - https://phabricator.wikimedia.org/T299876 [08:15:06] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [08:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:45] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:17:58] (03CR) 10David Caro: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [08:18:01] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: host reimage [08:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:27] (03PS1) 10TheDJ: Older browser do not return a promise from .play() [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778238 (https://phabricator.wikimedia.org/T304705) [08:18:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24283 and previous config saved to /var/cache/conftool/dbconfig/20220408-081840-ladsgroup.json [08:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:46] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1001.eqiad.wmnet with OS bullseye [08:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:52] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS buster [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster com... [08:26:06] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:34] !log pool cp6011 with HAProxy as TLS termination layer - T290005 [08:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:26:45] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:45] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2102.codfw.wmnet with OS bullseye [08:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:15] (JobUnavailable) firing: (2) Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:33:42] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2151.codfw.wmnet with OS bullseye [08:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24284 and previous config saved to /var/cache/conftool/dbconfig/20220408-083345-ladsgroup.json [08:33:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:33:49] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [08:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24285 and previous config saved to /var/cache/conftool/dbconfig/20220408-083353-ladsgroup.json [08:33:55] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1001.eqiad.wmnet with reason: host reimage [08:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:06] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:08] (03CR) 10Lucas Werkmeister (WMDE): Use wgRestAPIAdditionalRouteFiles for WB REST API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [08:36:47] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1001.eqiad.wmnet with reason: host reimage [08:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:43] (03PS3) 10Jakob: Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 [08:37:45] (03CR) 10Jakob: Use wgRestAPIAdditionalRouteFiles for WB REST API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [08:37:46] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS buster [08:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:56] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster com... [08:38:05] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:07] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2002.codfw.wmnet with OS bullseye [08:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:24] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2102.codfw.wmnet with reason: host reimage [08:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be okay to deploy on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [08:41:55] !log pool cp6003 with HAProxy as TLS termination layer - T290005 [08:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:58] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:43:53] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2102.codfw.wmnet with reason: host reimage [08:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:03] !log depool cp6010 for reimage - T290005 [08:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:48:15] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:48:36] ^that is me doing a reimage and will fix itself soon [08:48:48] ah, actually, it is the resolution [08:49:27] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1001.eqiad.wmnet with OS bullseye [08:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:03] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:50:13] (03PS3) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) [08:50:45] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:20] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2002.codfw.wmnet with reason: host reimage [08:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:45] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:20] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:44] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2002.codfw.wmnet with reason: host reimage [08:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:05] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS buster [08:57:05] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2102.codfw.wmnet with OS bullseye [08:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:14] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster [08:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24286 and previous config saved to /var/cache/conftool/dbconfig/20220408-085810-ladsgroup.json [08:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:59:45] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:01:27] (03PS4) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) [09:02:10] !log depool cp6002 for reimage - T290005 [09:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:04:35] (03CR) 10Btullis: Configure LDAP authentication for DataHub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [09:05:50] (03PS2) 10MMandere: site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) [09:07:36] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [09:08:01] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:08:08] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2002.codfw.wmnet with OS bullseye [09:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:47] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1002.eqiad.wmnet with OS bullseye [09:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24287 and previous config saved to /var/cache/conftool/dbconfig/20220408-090943-ladsgroup.json [09:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:48] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [09:13:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [09:13:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24288 and previous config saved to /var/cache/conftool/dbconfig/20220408-091315-ladsgroup.json [09:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS buster [09:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:39] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster [09:14:36] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [09:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:31] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2003.codfw.wmnet with OS bullseye [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:27] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1002.eqiad.wmnet with reason: host reimage [09:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:45] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1002.eqiad.wmnet with reason: host reimage [09:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24289 and previous config saved to /var/cache/conftool/dbconfig/20220408-092448-ladsgroup.json [09:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:26] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:51] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS buster [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster [09:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24290 and previous config saved to /var/cache/conftool/dbconfig/20220408-092820-ladsgroup.json [09:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:33] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [09:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:41] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2003.codfw.wmnet with reason: host reimage [09:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:37] (03CR) 10jerkins-bot: [V: 04-1] Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [09:32:55] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [09:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:04] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jgiannelos) [09:33:44] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jgiannelos) [09:35:47] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2003.codfw.wmnet with reason: host reimage [09:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:54] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1002.eqiad.wmnet with OS bullseye [09:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24291 and previous config saved to /var/cache/conftool/dbconfig/20220408-093953-ladsgroup.json [09:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24292 and previous config saved to /var/cache/conftool/dbconfig/20220408-094325-ladsgroup.json [09:43:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:43:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:53] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:30] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2003.codfw.wmnet with OS bullseye [09:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:08] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host dbprov1003.eqiad.wmnet with OS bullseye [09:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:54:03] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS buster [09:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with errors... [09:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24293 and previous config saved to /var/cache/conftool/dbconfig/20220408-095458-ladsgroup.json [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:02] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [10:00:37] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1003.eqiad.wmnet with reason: host reimage [10:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:11] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1003.eqiad.wmnet with reason: host reimage [10:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:26] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS buster [10:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster com... [10:07:25] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS buster [10:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:34] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster com... [10:08:15] (03PS1) 10David Caro: Add debian 11 testing support [puppet] - 10https://gerrit.wikimedia.org/r/778477 [10:11:20] !log pool cp6010 with HAProxy as TLS termination layer - T290005 [10:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:23] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:12:44] (03CR) 10David Caro: "After (if) https://gerrit.wikimedia.org/r/c/operations/puppet/+/778477 gets in (enabling proper facts for bullseye for testing), this woul" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [10:15:18] (03CR) 10David Caro: Add debian 11 testing support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778477 (owner: 10David Caro) [10:15:38] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1003.eqiad.wmnet with OS bullseye [10:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:40] !log pool cp6002 with HAProxy as TLS termination layer - T290005 [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:43] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:24:36] (03PS1) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [10:24:38] (03PS1) 10Zabe: swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) [10:25:09] (03CR) 10jerkins-bot: [V: 04-1] swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:25:20] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:27] (03CR) 10jerkins-bot: [V: 04-1] swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:26:18] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:44] (03PS2) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [10:29:25] (03CR) 10Cathal Mooney: [C: 03+2] Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [10:30:03] (03Merged) 10jenkins-bot: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [10:31:02] (03PS3) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [10:33:04] 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) [10:34:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:38:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [10:38:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [10:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:18] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:04] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:17] (03PS4) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [10:58:56] (03PS5) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [10:59:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 62 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:03:34] ^^ not really sure what the cause of this was, been digging here, widely distributed geographically and across ASNs. [11:04:10] which points to something close to us, but not found any smoking gun. Don't expect there was much user impact. [11:04:27] (03PS6) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [11:08:44] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34752/" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:10:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:21] !log depool cp6009 for reimage - T290005 [11:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:25] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:16:24] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:46] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [11:24:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:24:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [11:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [11:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:38] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:07] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 18791 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [11:32:05] (03PS1) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778490 (https://phabricator.wikimedia.org/T290005) [11:32:07] (03PS1) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778491 (https://phabricator.wikimedia.org/T290005) [11:34:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1184', diff saved to https://phabricator.wikimedia.org/P24294 and previous config saved to /var/cache/conftool/dbconfig/20220408-113452-root.json [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 57 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:37:37] anyone looking at the mail queue? [11:38:03] <_joe_> Emperor: if I can get into mx1001, maybe [11:38:09] * Emperor is there [11:38:12] <_joe_> but looks like I can't [11:38:18] the entire queue is to wikimedia.org [11:38:34] 27453 of 29187 [11:38:44] _joe_: you can't get there network-wise? [11:38:57] looks like fr-tech's gone pop again [11:39:22] <_joe_> yep [11:39:32] <_joe_> topranks: did it on the second attempt [11:39:44] <_joe_> Emperor: I would tell them in fr-tech and nuke their emails [11:39:57] <_joe_> the ones in the queue I mean [11:40:07] hmm ok. I'm still digging around on the back of that RIPE atlas result, if you continue to have problems let me know. [11:40:31] <_joe_> topranks: I am indeed using ipv6 [11:40:33] just working on an exiqgrep rune [11:41:19] <_joe_> Emperor: cheers to becoming an eximmaster [11:42:13] _joe_: as root: exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm [11:42:16] look OK? [11:42:33] <_joe_> Emperor: let's tell them first, but yes it does [11:43:35] <_joe_> Emperor: i told them [11:43:37] <_joe_> go on [11:44:06] doing so; please hold [11:45:24] <_joe_> log your action here for posterity [11:45:42] <_joe_> also for twitter purposes [11:45:46] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:51] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 [11:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:04] queue back to a modest 2337 mails [11:46:15] <_joe_> "modest" :P [11:46:45] <_joe_> still all their mails [11:46:59] I've acked the VO incident, it should resolve once the queue checker has another look [11:47:18] <_joe_> yep [11:47:46] modest> I do some volunteer sysadmin for a popular fandom non-profit, we send a couple of million emails a day, 2000 in the queue is nothing ;-) [11:48:03] Anyhow, back to that lunchtime walk I was about to do... [11:48:54] <_joe_> me too! [11:51:06] Same! After filtering for the v6 Atlas IPs that are still unreachable from eqiad LVS VIP, the pattern right now to those ASNs is no different than normal: [11:51:07] https://w.wiki/52jm [11:51:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [11:51:42] something is making us trip the limit, but there is a background level always, I can't see any signs of a general problem our side. [11:51:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:53:36] (03PS3) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) [11:56:01] was fr tech notified- if not I will send an email? [11:56:21] (03CR) 10Hnowlan: [C: 03+1] "lgtm, I will deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984) (owner: 10Jgiannelos) [11:57:00] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:19] jynus: Joe left a message in #wikimedia-fundraising [11:57:26] ah, cool [11:58:22] (03PS1) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) [11:58:24] (03PS1) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) [11:58:55] (03CR) 10jerkins-bot: [V: 04-1] sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:59:12] (03CR) 10jerkins-bot: [V: 04-1] sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:02:07] (03PS2) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) [12:04:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:08:21] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:10:03] (03PS2) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) [12:10:13] (03PS2) 10Zabe: swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) [12:11:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:11:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24295 and previous config saved to /var/cache/conftool/dbconfig/20220408-121138-ladsgroup.json [12:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:11:45] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS buster [12:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:54] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster [12:15:45] !log depool cp6001 for reimage - T290005 [12:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:49] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:17:20] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:24] (03PS2) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) [12:19:17] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:22:55] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS buster [12:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [12:24:15] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:27:59] I can't help but notice that there are 17k messages in the queue to fr-tech again [12:29:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [12:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:09] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [12:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:55] which is well above the p.age threshold [12:35:50] (am poking on the fundraising channel) [12:39:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:40:44] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] I'm a bit worried about the disk space on mx1001 too, not just the raw queue numbers [12:43:34] (03Abandoned) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778490 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:43:44] absent further input from #wikimedia-fundraising I think I should bin another pile of the fr-tech-failmail pileup. Could I get a +1 from someone before I do so, please? I'm a bit wary of just binning 22k emails on a whim... [12:44:14] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [12:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:20] (03Abandoned) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778491 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:49:51] Emperor: there is a task for disk [12:50:37] Emperor: https://phabricator.wikimedia.org/T305567 [12:54:07] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again) [12:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:26] (03PS2) 10JMeybohm: Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) [12:57:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster1002.eqiad.wmnet with reason: reimage [12:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:28] (03CR) 10JMeybohm: [C: 03+2] Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [12:57:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster1002.eqiad.wmnet with reason: reimage [12:57:28] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10MatthewVernon) We've just had a repeat, and again exim mainlog is I think too big for logrotate to succeed with :-/ ` mvernon@mx1001:~$ ls -lsh /var/log/exim4/mainlog 3.9G -rw-r----- 1 Debian... [12:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:28] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [12:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:32] (03PS1) 10Zabe: admin: fix a few indentation issues [puppet] - 10https://gerrit.wikimedia.org/r/778498 [12:59:11] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:59:15] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again) [12:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:15] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:31] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:40] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 02m 11s) [13:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:00] RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [13:03:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:41] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:11] (03PS2) 10JMeybohm: Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) [13:09:43] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:24] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS buster [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:33] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster com... [13:13:17] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6001.drmrs.wmnet with OS buster [13:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:26] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster com... [13:13:53] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 [13:14:07] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:14:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:15:29] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:01] !log pool cp6009 with HAProxy as TLS termination layer - T290005 [13:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:16:07] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:03] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again; keeping queue below the p.age threshold while fr-tech work) [13:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:13] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch) [13:18:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24296 and previous config saved to /var/cache/conftool/dbconfig/20220408-132024-root.json [13:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] !log pool cp6001 with HAProxy as TLS termination layer - T290005 [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster1001.eqiad.wmnet with reason: reimage [13:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster1001.eqiad.wmnet with reason: reimage [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:19] (03CR) 10JMeybohm: [C: 03+2] Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [13:26:16] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:52] (03CR) 10Awight: [C: 03+2] [beta] Enable colorblind-friendly color scheme on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch) [13:28:11] (03Merged) 10jenkins-bot: [beta] Enable colorblind-friendly color scheme on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778505 (https://phabricator.wikimedia.org/T292968) (owner: 10WMDE-Fisch) [13:29:54] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:19] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2008.codfw.wmnet with OS bullseye [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:30:44] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:31:14] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:34:52] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24297 and previous config saved to /var/cache/conftool/dbconfig/20220408-133528-root.json [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:00] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:00] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS bullseye [13:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:02] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24298 and previous config saved to /var/cache/conftool/dbconfig/20220408-133715-ladsgroup.json [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:37:29] (03PS1) 10Giuseppe Lavagetto: requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509 [13:37:31] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510 [13:37:58] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509 (owner: 10Giuseppe Lavagetto) [13:39:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:39:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:59] (03Merged) 10jenkins-bot: requestctl: bugfixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778509 (owner: 10Giuseppe Lavagetto) [13:40:12] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510 (owner: 10Giuseppe Lavagetto) [13:41:02] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) ok, thanks, I'll rotate it manually and plan on embiggening the existing hosts. [13:42:20] (03Merged) 10jenkins-bot: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/778510 (owner: 10Giuseppe Lavagetto) [13:43:08] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:43:36] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2008.codfw.wmnet with reason: host reimage [13:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:56] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:46:26] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2008.codfw.wmnet with reason: host reimage [13:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:59] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1008.eqiad.wmnet with reason: host reimage [13:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:32] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again again; keeping queue below the p.age threshold while fr-tech work) [13:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24299 and previous config saved to /var/cache/conftool/dbconfig/20220408-135032-root.json [13:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:20] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1008.eqiad.wmnet with reason: host reimage [13:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24300 and previous config saved to /var/cache/conftool/dbconfig/20220408-135220-ladsgroup.json [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:07] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:57:43] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2008.codfw.wmnet with OS bullseye [13:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:35] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:02:44] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1008.eqiad.wmnet with OS bullseye [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24302 and previous config saved to /var/cache/conftool/dbconfig/20220408-140536-root.json [14:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:07:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24303 and previous config saved to /var/cache/conftool/dbconfig/20220408-140725-ladsgroup.json [14:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:13] (03CR) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [14:08:49] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:05] (03PS1) 10Herron: sre.kafka.reboot-workers: remove systemctl stop calls [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) [14:19:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:19:59] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24304 and previous config saved to /var/cache/conftool/dbconfig/20220408-142041-root.json [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:13] (03PS1) 10Jcrespo: install_server: Disable wiping of backup[12]008 after bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/778518 (https://phabricator.wikimedia.org/T305446) [14:21:44] !log exiqgrep -i -r fr-tech-failmail@wikimedia.org | xargs exim -Mrm on mx1001 (again again again again; keeping queue below the p.age threshold while fr-tech work) [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24305 and previous config saved to /var/cache/conftool/dbconfig/20220408-142230-ladsgroup.json [14:22:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:22:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24306 and previous config saved to /var/cache/conftool/dbconfig/20220408-142239-ladsgroup.json [14:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] (03CR) 10Jcrespo: [C: 03+2] install_server: Disable wiping of backup[12]008 after bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/778518 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [14:26:12] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:54] (03CR) 10Herron: "This is a simple alternate approach to I2d96ceee31ac3f1029bc45395cebe4dac47d5a4a" [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [14:27:04] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:38] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:28:52] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:31:01] (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [14:31:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:28] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:34:48] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24307 and previous config saved to /var/cache/conftool/dbconfig/20220408-143545-root.json [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:38:34] (03CR) 10Cwhite: [C: 03+2] logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [14:39:32] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:39:50] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, 10Kubernetes: service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) [14:40:20] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:12] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:44:22] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:02] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:40] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:52] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:57:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:16] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:48] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:08:18] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:10:02] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:14:34] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:16:06] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:48] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:00] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:25:06] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:05] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34755/console" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [15:28:22] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:20] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:30:28] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:29] (03CR) 10Andrew Bogott: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [15:31:08] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:31:54] RECOVERY - Host logstash2024 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [15:32:04] PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:28] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [15:34:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:39:32] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:10] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:56] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:06] RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:26] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24308 and previous config saved to /var/cache/conftool/dbconfig/20220408-154414-ladsgroup.json [15:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:45:28] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:46] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:12] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:51:11] (03PS1) 10Vgutierrez: requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 [15:51:17] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:20] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:39] !log dancy@deploy1002 Started scap: (no justification provided) [15:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:02] !log dancy@deploy1002: Testing mw container image build [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:24] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:40] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:58:50] (03PS1) 10Giuseppe Lavagetto: Requestctl VCL/VSL fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 [15:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24309 and previous config saved to /var/cache/conftool/dbconfig/20220408-155919-ladsgroup.json [15:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez) [16:01:39] (03Merged) 10jenkins-bot: requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez) [16:02:42] (03PS2) 10Giuseppe Lavagetto: Requestctl VCL translation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 [16:05:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Requestctl VCL translation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 (owner: 10Giuseppe Lavagetto) [16:05:32] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:55] (03PS1) 10Majavah: hieradata: add new eqiad1 enc servers [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247) [16:06:51] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34756/console" [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [16:06:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:06:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:48] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:08:51] (03CR) 10Vivian Rook: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:11:38] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:11:38] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:13:18] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:52] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24310 and previous config saved to /var/cache/conftool/dbconfig/20220408-161425-ladsgroup.json [16:14:32] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:32] <_joe_> jayme: can you check what went wrong with deploy_to_mwdebug ? [16:15:32] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:16:08] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:17:48] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:19:12] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:19] _joe_ I'll try the helmfile apply under my account to see what happens. [16:21:33] <_joe_> dancy: thanks! [16:21:45] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:21:48] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:21:57] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:22:01] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:22:13] hmm.. [16:22:54] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:22:58] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:23:48] (03CR) 10Krinkle: [C: 03+1] deployment-prep: re-point to new bullseye hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [16:24:16] (03PS1) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) [16:24:18] (03PS1) 10Giuseppe Lavagetto: varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606) [16:24:20] (03PS1) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) [16:24:22] (03PS1) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) [16:24:50] (03CR) 10jerkins-bot: [V: 04-1] varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:25:26] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:25:54] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:03] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:27:32] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:27:32] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24311 and previous config saved to /var/cache/conftool/dbconfig/20220408-162930-ladsgroup.json [16:29:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:29:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24312 and previous config saved to /var/cache/conftool/dbconfig/20220408-162938-ladsgroup.json [16:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:50] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:29:50] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:32:50] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:52] (03PS1) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551 [16:37:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: add new eqiad1 enc servers [puppet] - 10https://gerrit.wikimedia.org/r/778538 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [16:38:02] (03PS2) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551 [16:38:04] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:39:22] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34758/console" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [16:41:30] (03PS3) 10Majavah: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551 [16:41:32] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:00] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34759/console" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [16:45:44] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:28] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:49:04] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:22] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:52:55] (03PS2) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) [16:52:57] (03PS2) 10Giuseppe Lavagetto: varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606) [16:52:59] (03PS2) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) [16:53:01] (03PS2) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) [16:53:08] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:54:46] (03CR) 10Herron: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [16:55:00] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:55:24] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:56:32] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:57:04] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] (03PS12) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [17:00:43] (03PS13) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [17:01:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:01:50] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:43] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 [17:07:35] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (owner: 10Arturo Borrero Gonzalez) [17:07:56] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:08:42] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:42] (03PS1) 10Cathal Mooney: Increase threshold for IPv6 RIPE Atlas probes from 65 to 90 [puppet] - 10https://gerrit.wikimedia.org/r/778567 (https://phabricator.wikimedia.org/T305703) [17:10:12] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:10:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:11:10] (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: switch back to using the root user [puppet] - 10https://gerrit.wikimedia.org/r/778568 (https://phabricator.wikimedia.org/T305729) [17:12:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug_deploy: switch back to using the root user [puppet] - 10https://gerrit.wikimedia.org/r/778568 (https://phabricator.wikimedia.org/T305729) (owner: 10Giuseppe Lavagetto) [17:15:18] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:41] (03PS3) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 [17:16:26] (03CR) 10Cathal Mooney: [C: 03+2] Increase threshold for IPv6 RIPE Atlas probes from 65 to 90 [puppet] - 10https://gerrit.wikimedia.org/r/778567 (https://phabricator.wikimedia.org/T305703) (owner: 10Cathal Mooney) [17:18:01] (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: also re-add docker credentials [puppet] - 10https://gerrit.wikimedia.org/r/778571 [17:22:16] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10cmooney) FYI I believe PXE is failing for dumpsdata1006 as the DAC cable is plugged into the second NIC port on the server side. [17:23:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug_deploy: also re-add docker credentials [puppet] - 10https://gerrit.wikimedia.org/r/778571 (owner: 10Giuseppe Lavagetto) [17:24:28] (03PS1) 10Majavah: hieradata: use puppet-enc hostname in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/778574 (https://phabricator.wikimedia.org/T295247) [17:24:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 69 probes of 665 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:24:35] (03PS4) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [17:26:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34760/console" [puppet] - 10https://gerrit.wikimedia.org/r/778574 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [17:31:44] (03PS1) 10Giuseppe Lavagetto: mwdebug_deploy: fix resource type [puppet] - 10https://gerrit.wikimedia.org/r/778577 [17:32:19] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mwdebug_deploy: fix resource type [puppet] - 10https://gerrit.wikimedia.org/r/778577 (owner: 10Giuseppe Lavagetto) [17:34:53] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:56] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:07] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:11] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:18] (03CR) 10Cwhite: "Change looks generally correct to me." [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [17:44:59] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10wiki_willy) a:03RobH Hi @ssingh - since this server is out of warranty and due to be refreshed in a few quarters, do you still want us to purchase a replacement DIMM to keep it up and r... [17:45:14] (03CR) 10CDanis: [C: 03+1] "Looks good. I think debian-glue is unhappy because there's a debian/changelog edit in the same patch as an 'upstream' code edit? But wha" [software/conftool] - 10https://gerrit.wikimedia.org/r/778537 (owner: 10Giuseppe Lavagetto) [17:48:13] (03CR) 10CDanis: [C: 03+1] requestctl: Fix VCL_acl matching on VSL expressions [software/conftool] - 10https://gerrit.wikimedia.org/r/778536 (owner: 10Vgutierrez) [17:48:20] * James_F grumbles at getting a hundred "page moved" e-mails from wikitech triggered by Krinkle again. :-P [17:50:00] I enabled flood flag [17:50:09] After the first 80? [17:50:13] no, from the start [17:50:22] Well I got the ENOTIF e-mails anyway. [17:50:45] Currently working its way through the Talk: ones having done the main set? [17:51:18] indeed, it's one big batch [17:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24313 and previous config saved to /var/cache/conftool/dbconfig/20220408-175120-ladsgroup.json [17:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:51:41] https://wikitech.wikimedia.org/wiki/Special:Recentchanges?hidehumans=1&limit=100&days=14&urlversion=2 [17:51:56] https://wikitech.wikimedia.org/wiki/Special:Recentchanges?hidebots=1&limit=100&days=14&urlversion=2 [17:52:00] I don't need to see it on RC, I've got it in GMail. :-P [17:52:01] none without bot show up there [17:52:10] yeah, maybe a bug, probably a controversial one [17:52:17] Meh. [17:52:19] unless its' recent, in which case it's a regression [17:53:16] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:53:18] I don't immediately see anything in Phabriator. Will file a task. [17:54:16] you're also creating double redirects btw [17:54:19] I see things about "hide bots" in the watchlist prefs but enotif isn't actually from watchlist technically. Echo prefs dont' cover it. The general "email me when ..." is boolean on/off which an extra box for ".. also for minor edits" [17:54:28] AntiComposite: I know, I'm queuing that up next [17:54:50] but deciding whether to run it normnally or not, since the script is kinda broken (doesn't use system user correctly) [17:54:56] I might run it with pywiki instead [17:55:28] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:55:52] Filed as T305734 FWIW. [17:55:52] T305734: Page move notification e-mails sent for watchlisted pages even when actor has the flood right - https://phabricator.wikimedia.org/T305734 [17:56:16] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:52] (03CR) 10Cwhite: sre.kafka.reboot-workers: remove systemctl stop calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [17:57:56] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:00:10] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:02:27] (03CR) 10Cwhite: sre: add alerts for exporter-specific unavailability (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [18:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24315 and previous config saved to /var/cache/conftool/dbconfig/20220408-180625-ladsgroup.json [18:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:16:40] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:40] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:19:11] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1248 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [18:21:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24316 and previous config saved to /var/cache/conftool/dbconfig/20220408-182130-ladsgroup.json [18:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:25:42] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:25:57] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5785 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [18:27:38] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:29:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:36:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24317 and previous config saved to /var/cache/conftool/dbconfig/20220408-183635-ladsgroup.json [18:36:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:36:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24318 and previous config saved to /var/cache/conftool/dbconfig/20220408-183643-ladsgroup.json [18:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:49] !log gitlab1001 - giving myself gitlab admin rights via rake console, to be able to connect/disconnect runners T297659 [18:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:52] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [18:46:02] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:35] ^ always 1308 and no other [18:50:06] I noticed that very frequent [18:51:45] ipmi_sdr_cache_create: internal IPMI error [18:51:57] internal IPMI error .. I translate that to "broken DRAC" [18:52:02] or "maybe reset fixes it" [18:53:00] i'll try the "soft" DRAC reset [18:53:06] if it needs hard reset then it needs dcops [18:55:16] well, can't connect to DRAC in the first place to reset it.. so broken DRAC it is.. will make ticket [18:57:18] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:42] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 30.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:01:00] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn) [19:03:21] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn) p:05Triage→03Medium [19:15:28] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:34] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:21:36] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1286.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:22:48] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1353.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:23:12] PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1377.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:23:18] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1383.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:25:30] ACKNOWLEDGEMENT - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service daniel_zahn https://phabricator.wikimedia.org/T305741 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:44] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:55] downtime for a month [19:32:54] RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:33:14] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24319 and previous config saved to /var/cache/conftool/dbconfig/20220408-195614-ladsgroup.json [19:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:57:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10wiki_willy) Hi @RobH - just followingup to see if they ever sent the DIMM for this. Thanks, Willy [20:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24320 and previous config saved to /var/cache/conftool/dbconfig/20220408-201119-ladsgroup.json [20:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:36] (03CR) 10Legoktm: [C: 03+1] "One small nit, otherwise LGTM. I reviewed the script in core, it's a simple select plus deletes that are guarded by a LIMIT of $wgUpdateRo" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [20:26:08] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24321 and previous config saved to /var/cache/conftool/dbconfig/20220408-202624-ladsgroup.json [20:26:26] (03CR) 10Dzahn: [C: 03+1] "tested and they pass the tests on mwdebug1001:" [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [20:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:23] (03CR) 10Dzahn: [C: 03+2] phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:35:29] uhm.. unexpected puppet behaviour on the passive phab host :p [20:35:53] might cause an alert in a moment..but on it [20:36:59] (03PS1) 10Dzahn: Revert "phabricator: allow disabling ssh-phab service except on one host" [puppet] - 10https://gerrit.wikimedia.org/r/778242 [20:38:42] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: allow disabling ssh-phab service except on one host" [puppet] - 10https://gerrit.wikimedia.org/r/778242 (owner: 10Dzahn) [20:40:18] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:22] that is the one I expected but change is already reverted, this just lags a bit [20:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24322 and previous config saved to /var/cache/conftool/dbconfig/20220408-204129-ladsgroup.json [20:41:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:41:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24323 and previous config saved to /var/cache/conftool/dbconfig/20220408-204138-ladsgroup.json [20:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:15] (03PS1) 10Dzahn: Revert "Revert "phabricator: allow disabling ssh-phab service except on one host"" [puppet] - 10https://gerrit.wikimedia.org/r/778243 [20:44:14] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:44:54] RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:45:11] (03CR) 10Dzahn: "compiler shows it as "present" on phab2001 but in reality it removes the ressources !?" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:49:00] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:52:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:13:39] (03PS2) 10Legoktm: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 [21:13:53] (03CR) 10Legoktm: "Ping :)" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [21:26:28] (03CR) 10Krinkle: [C: 03+1] Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [21:35:29] (03Abandoned) 10Jdlrobson: Convert performanceNow datatype to Integer in QuickSurvey Initiation in order to resolve data type mismatch in schema. [extensions/QuickSurveys] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777775 (https://phabricator.wikimedia.org/T305171) (owner: 10Jdlrobson) [21:35:46] (03CR) 10Dzahn: [C: 03+2] mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:41:22] RECOVERY - MariaDB Replica Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 29.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:41:28] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:50:08] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:06:04] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:08:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24324 and previous config saved to /var/cache/conftool/dbconfig/20220408-220827-ladsgroup.json [22:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:09:14] !log gitlab - deleted runner-1008 (to replace it with a bullseye instance), recreated runner-1020 with same flavor as existing runners T297659 [22:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:17] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [22:17:32] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [22:17:50] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [22:18:08] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [22:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24325 and previous config saved to /var/cache/conftool/dbconfig/20220408-222332-ladsgroup.json [22:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24326 and previous config saved to /var/cache/conftool/dbconfig/20220408-223837-ladsgroup.json [22:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24327 and previous config saved to /var/cache/conftool/dbconfig/20220408-225342-ladsgroup.json [22:53:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [22:53:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [22:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24328 and previous config saved to /var/cache/conftool/dbconfig/20220408-225350-ladsgroup.json [22:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:36] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1389.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:07:30] (03PS1) 10Krinkle: mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) [23:11:11] (03PS1) 10Krinkle: mediawiki: Remove unused rewrite_static_assets param [puppet] - 10https://gerrit.wikimedia.org/r/778602 (https://phabricator.wikimedia.org/T302465) [23:29:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook