[00:00:05] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:05] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [00:10:05] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:17] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:19:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:19:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:23:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:03] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:33] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [02:07:33] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:43] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:11] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [02:28:11] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:57] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:53] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:29:59] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:25] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [03:37:25] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:59] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:31] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [04:07:31] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:15] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:05] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:01] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:37:35] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [04:37:35] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:57] (03PS1) 10Abijeet Patro: WikiPage group description: prefix source page title [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812531 (https://phabricator.wikimedia.org/T312688) [04:46:04] (03CR) 10Abijeet Patro: [C: 03+1] WikiPage group description: prefix source page title [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812531 (https://phabricator.wikimedia.org/T312688) (owner: 10Abijeet Patro) [04:51:53] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:21] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:11] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:43] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [05:07:43] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:49] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:23] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [05:28:23] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:19] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:23] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:21] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [05:40:21] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:17] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10jcrespo) Found the following exception, sending as NDA, as I suspect it is user-traffic related: {P30998} [05:47:21] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:07] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:53] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [05:54:53] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8823.service,thumbor@8826.service,thumbor@8827.service,thumbor@8829.service,thumbor@8835.service,thumbor@8837.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:39] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:05] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:17] <_joe_> !log depooled thumbor1005, downgraded firejail, restarted units [06:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:33] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [06:28:33] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:47] <_joe_> !log repool thumbor1005 [06:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:29] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:05] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:11] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:59] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:45] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:59] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:09] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:38] (03PS9) 10David Caro: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 [06:44:55] (03PS6) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [06:50:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2077.codfw.wmnet [06:51:15] (03PS1) 10Marostegui: mariadb: Decommission db2077 [puppet] - 10https://gerrit.wikimedia.org/r/812703 (https://phabricator.wikimedia.org/T312191) [06:52:05] (03CR) 10David Caro: [C: 03+2] wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro) [06:54:24] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:55:07] (03Merged) 10jenkins-bot: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [06:55:09] (03Merged) 10jenkins-bot: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro) [06:58:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:59:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2077 [puppet] - 10https://gerrit.wikimedia.org/r/812703 (https://phabricator.wikimedia.org/T312191) (owner: 10Marostegui) [07:00:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2077.codfw.wmnet [07:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:07] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2077.codfw.wmnet` - db2077.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [07:00:30] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) @Papaul this is ready for you [07:01:06] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) a:03Papaul [07:01:15] (03PS5) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 [07:01:19] (03CR) 10David Caro: wmcs: add alerts for any node going down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro) [07:01:23] (03PS4) 10David Caro: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 [07:01:27] (03CR) 10David Caro: wmcs: add systemd unit down alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro) [07:01:31] (03PS7) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [07:09:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2027.codfw.wmnet with OS bullseye [07:09:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bullseye [07:21:40] abijeet_: hi, it looks like no one claimed the window yet! I can deploy if you're still around. [07:21:50] (03PS1) 10Marostegui: mariadb: Decommission db2080 [puppet] - 10https://gerrit.wikimedia.org/r/812704 (https://phabricator.wikimedia.org/T312618) [07:22:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2080.codfw.wmnet [07:23:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage [07:24:08] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [07:25:38] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10Joe) I downgraded firejail on all thumbor servers and that stopped, at least for now, the flurry of restarts we were seeing. More investigation is needed. [07:26:18] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:26:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage [07:29:31] (03PS1) 10Majavah: UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347) [07:29:38] (03CR) 10Filippo Giunchedi: "Thanks Daniel, LGTM as a temporary measure." [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) (owner: 10Dzahn) [07:30:22] urbanecm: are you deploying something or can I deploy a patch of my own? [07:30:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:30:44] (03CR) 10Filippo Giunchedi: [C: 03+2] Add alert manager alert receivers for the Abstract Wikipedia team. [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [07:31:03] taavi: seems abijeet_ isn't here too, so feel free to squeeze in :) [07:31:19] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove DNS targets, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/812329 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:31:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2080 [puppet] - 10https://gerrit.wikimedia.org/r/812704 (https://phabricator.wikimedia.org/T312618) (owner: 10Marostegui) [07:31:41] thanks, will do [07:31:46] godog: ok to merge? [07:31:47] (03CR) 10Majavah: [C: 03+2] UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347) (owner: 10Majavah) [07:31:50] marostegui: yes please! [07:31:52] thank you [07:31:56] godog: done! [07:31:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2080.codfw.wmnet [07:32:01] <3 marostegui [07:32:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:32:44] (03PS1) 10Marostegui: instances.yaml: Remove db2080 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812705 (https://phabricator.wikimedia.org/T312618) [07:33:23] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2080 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812705 (https://phabricator.wikimedia.org/T312618) (owner: 10Marostegui) [07:33:28] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [07:33:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2080 from dbtcl T312618', diff saved to https://phabricator.wikimedia.org/P30999 and previous config saved to /var/cache/conftool/dbconfig/20220711-073346-marostegui.json [07:33:50] lol, good timing again marostegui [07:33:50] T312618: decommission db2080 - https://phabricator.wikimedia.org/T312618 [07:34:21] (03PS8) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [07:34:23] (03PS1) 10David Caro: wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706 [07:34:42] godog: I merged mine! I didn't get any output from any other change! [07:34:48] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Marostegui) a:03Papaul [07:34:56] (03CR) 10David Caro: [C: 03+2] wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro) [07:35:06] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Marostegui) @Papaul this is ready for you! [07:35:16] (03CR) 10David Caro: [C: 03+2] wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro) [07:36:30] (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [07:36:37] (03CR) 10CI reject: [V: 04-1] wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [07:36:42] (03Merged) 10jenkins-bot: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro) [07:36:51] (03CR) 10CI reject: [V: 04-1] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [07:37:06] (03Merged) 10jenkins-bot: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro) [07:37:12] marostegui: yeah I submitted mine, then went to puppet-merge and you were merging too (twice in a row) [07:37:34] (03Merged) 10jenkins-bot: UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347) (owner: 10Majavah) [07:39:13] testing on mwdebug1001 [07:40:00] works, syncing [07:41:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2027.codfw.wmnet with OS bullseye [07:41:31] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bullseye completed: - ganeti2027 (**PASS**) - Downtimed on... [07:43:05] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/PageTriage/includes/HookHandlers/UndeleteHookHandler.php: Backport: [[gerrit:812532|UndeleteHookHandler: fix namespace conditional (T311347)]] (duration: 02m 54s) [07:43:09] T311347: Mark freshly undeleted articles as unreviewed automatically - https://phabricator.wikimedia.org/T311347 [07:43:12] * taavi done [07:43:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:45:21] (03PS1) 10Marostegui: db2165: Candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/812708 (https://phabricator.wikimedia.org/T311493) [07:47:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:47:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:49:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [07:51:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:51:57] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:52:21] !log roll-restart swift-account swift-container across swift/thanos bullseye hosts - T297959 [07:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:24] T297959: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 [07:54:22] (03CR) 10Marostegui: [C: 03+2] db2165: Candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/812708 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:57:45] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: split retention times based on resolution [puppet] - 10https://gerrit.wikimedia.org/r/811932 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [07:58:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [07:59:54] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: trim raw samples retention to 54 weeks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811933 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [08:04:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2027.codfw.wmnet to cluster codfw and group A [08:05:39] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:06:24] !log trim thanos raw samples retention to 54w - T311690 [08:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:28] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [08:10:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [08:14:27] (03CR) 10Ayounsi: [C: 03+1] prometheus: add support to blackbox icmp probe hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:15:54] (03PS4) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 [08:16:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2027.codfw.wmnet to cluster codfw and group A [08:16:43] (03PS5) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 [08:17:16] (03CR) 10Ayounsi: [C: 03+1] "LGTM nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:18:01] (03CR) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [08:19:48] (03CR) 10Ayounsi: [C: 03+1] prometheus: blackbox icmp probes for hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:20:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [08:27:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [08:28:38] (03CR) 10Andrea Denisse: [C: 03+2] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [08:30:00] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:30:33] denisse|m: merging your change too! \o/ [08:30:51] (03PS1) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 [08:30:56] godog: Thank you very much! :) [08:35:11] (03PS2) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 [08:35:17] (03PS2) 10Filippo Giunchedi: prometheus: blackbox icmp probes for hosts [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) [08:35:26] (03CR) 10Filippo Giunchedi: prometheus: blackbox icmp probes for hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:38:27] (03CR) 10Jbond: [C: 03+2] utils: chmod +x setup_rake.sh and vcl_ec2_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/810973 (owner: 10Zabe) [08:38:34] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: blackbox icmp probes for hosts [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:40:10] (03PS1) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819 [08:41:52] (03PS1) 10Volans: tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 [08:41:54] (03PS1) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 [08:46:49] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Volans) @Papaul I've finished my tests on db2175, it's all yours! Thanks for the help. I've sent patches to Gerrit to fix the issue and once merge... [08:46:53] (03CR) 10Jbond: [C: 03+2] mailmap: add a few entries [puppet] - 10https://gerrit.wikimedia.org/r/809163 (owner: 10Zabe) [08:48:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809616 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:49:21] (03CR) 10Ayounsi: "Not 100% sure yet, but I think this could be a Custom Validator now, see T310590." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [08:49:36] (03CR) 10CI reject: [V: 04-1] tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [08:49:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:50:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:52:13] (03PS1) 10Marostegui: wmnet: Update s3-master [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) [08:52:55] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:59] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [08:53:55] (03PS2) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 [08:54:18] (03CR) 10Slyngshede: "Getting CPU allocation per node is easier to do in the exporter, compared to trying to extract the information using the existing metrics " [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede) [08:56:59] (03CR) 10Elukey: [C: 03+1] druid: Fixed UID/GIDs are universally in use now [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff) [09:02:40] (03PS1) 10David Caro: rabbitmq.drain_queue: Fix requeue option for newer API [puppet] - 10https://gerrit.wikimedia.org/r/812825 [09:08:18] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Great, so next step are: # Install the breakout panels, (document them, similar to {T304710}) # Pre-populate the ports/panels that will be used with th... [09:10:16] (03PS1) 10Jbond: puppetmaster: improve error handling for puppet-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/812827 (https://phabricator.wikimedia.org/T311742) [09:15:31] (03PS1) 10Marostegui: mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493) [09:17:11] (03CR) 10Jbond: [C: 03+2] puppetmaster: improve error handling for puppet-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/812827 (https://phabricator.wikimedia.org/T311742) (owner: 10Jbond) [09:17:13] (03PS1) 10Jbond: pcc: add correct tools pm public key [puppet] - 10https://gerrit.wikimedia.org/r/812829 (https://phabricator.wikimedia.org/T311742) [09:17:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:19:29] (03CR) 10Jbond: [C: 03+2] pcc: add correct tools pm public key [puppet] - 10https://gerrit.wikimedia.org/r/812829 (https://phabricator.wikimedia.org/T311742) (owner: 10Jbond) [09:19:41] (03CR) 10Jcrespo: [C: 03+1] mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [09:19:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [09:20:21] (03CR) 10Ayounsi: "Tested on netbox-next and works as expected:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [09:22:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:24:01] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10jbond) i think this is related to when the ssl certificate needed to be extended. I have uploaded the [[ https://gerrit.wikimedia.org/... [09:24:56] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10jbond) 05Open→03Resolved a:03jbond [09:26:46] jbond: ^ not sure if icinga puppet could be your patch or unrelated? [09:27:14] (03PS3) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [09:27:34] probably not, some other earlier patch [09:29:31] (03Abandoned) 10Jbond: P:puppet::agent: add logging of puppet calls [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond) [09:30:03] seems alertmanger related, godog maybe? [09:30:59] (03CR) 10Slyngshede: "Add ignore errors to get behavior similar to crontab." [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:31:29] found it, I think it is https://gerrit.wikimedia.org/r/c/operations/puppet/+/811790 [09:31:57] ^godot there must be an extra - that yaml doesn't like it or something [09:36:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/811790/3/modules/alertmanager/templates/alertmanager.yml.erb#426 maybe? [09:42:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811232 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:42:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:43:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811229 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:43:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811226 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:44:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811231 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:46:13] (03PS9) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [09:46:15] (03PS2) 10David Caro: wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706 [09:46:17] (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [09:47:25] (03CR) 10David Caro: [C: 03+2] rabbitmq.drain_queue: Fix requeue option for newer API [puppet] - 10https://gerrit.wikimedia.org/r/812825 (owner: 10David Caro) [09:48:43] (03CR) 10Jbond: [C: 03+1] tilerator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811225 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:50:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:51:17] (03CR) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:52:12] (03CR) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:52:49] jynus: sorry missed your earlier ping however i think your right the issue seems to be related to godog CR, godog let me know if you need a hand [09:53:16] sorry for the ping, you happened to merge something at the time [09:53:32] so I was guessing until I look at it more deeply [09:53:50] no probs :) [09:54:05] also I pinged because I thought it was blocking icinga updates [09:54:15] (03PS1) 10Majavah: alertmanager: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/812834 [09:54:21] ^ probably fixed by this [09:54:23] (which would have bee high prio) [09:54:41] ack [09:55:16] taavi: ack thanks lgtm will merge [09:55:37] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/812834 (owner: 10Majavah) [09:56:34] (03CR) 10Jbond: [C: 03+1] "LGTM minor comment/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans) [09:57:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans) [09:58:03] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans) [09:59:00] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s3-master [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [09:59:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans) [09:59:32] taavi: that did indeed fix it thanks (cc godog ) [10:00:11] CI should probably catch that [10:00:54] volans: agreeded see T305676 [10:00:55] T305676: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 [10:01:34] that's a .yml.erb so you can't just run it directly through a linter [10:01:48] and T236954 which has morecomments [10:01:49] T236954: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 [10:01:58] ahh yes erb files is a whole other mess [10:02:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:03:14] probably could just move the alert routing configuration to hiera [10:04:28] (03PS1) 10Marostegui: db2160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812836 (https://phabricator.wikimedia.org/T311493) [10:05:20] (03CR) 10Marostegui: [C: 03+2] db2160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812836 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:06:56] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AniketArs out of all services on: 663 hosts [10:07:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AniketArs out of all services on: 663 hosts [10:08:03] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AniketArs out of all services on: 1292 hosts [10:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AniketArs out of all services on: 1292 hosts [10:08:57] (03CR) 10David Caro: [C: 03+2] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [10:11:37] (03Merged) 10jenkins-bot: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [10:16:56] (03PS4) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:17:01] (03PS1) 10Marostegui: db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812837 (https://phabricator.wikimedia.org/T312754) [10:17:33] (03CR) 10Jbond: "thanks for the feedback, updated" [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [10:17:35] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:41] (03CR) 10Jbond: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [10:17:43] (03PS5) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:18:11] (03PS6) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:20:02] (03CR) 10Marostegui: [C: 03+2] db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812837 (https://phabricator.wikimedia.org/T312754) (owner: 10Marostegui) [10:20:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [10:23:44] (03PS1) 10Marostegui: site.pp: db2078 no longer active misc host [puppet] - 10https://gerrit.wikimedia.org/r/812838 (https://phabricator.wikimedia.org/T312754) [10:24:37] (03CR) 10Marostegui: [C: 03+2] site.pp: db2078 no longer active misc host [puppet] - 10https://gerrit.wikimedia.org/r/812838 (https://phabricator.wikimedia.org/T312754) (owner: 10Marostegui) [10:27:57] (PuppetFailure) resolved: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:32:13] (03CR) 10Volans: "One nit inline, LGTM otherwise. To be thoroughly tested." [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [10:34:58] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:35:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:38:27] (03CR) 10Volans: [C: 03+1] "LGTM (I'll leave it to you the details related to k8s APIs discussed in the comment)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:40:17] (03PS1) 10Marostegui: mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) [10:41:05] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [10:42:17] (03CR) 10Jbond: "See inline for comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [10:42:30] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [10:42:42] (03Abandoned) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 (owner: 10Jbond) [10:47:40] (03PS4) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [10:47:43] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [10:47:53] (03PS5) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [10:49:03] (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [10:55:04] jynus taavi thanks for the heads up and the fix! [10:55:19] jbond too [10:55:26] agreed re: yaml + erb, not super easy [10:58:14] (03CR) 10Volans: "Quite a large one. I've done a quick pass, left some comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [11:01:50] (03CR) 10Jbond: [C: 03+1] network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:02:11] (03CR) 10Volans: "quick reply to comments, I didn't do a full pass yet" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [11:03:41] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [11:04:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812172 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:05:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812173 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:06:01] (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179) [11:06:32] (03CR) 10Jbond: [C: 03+1] interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:07:44] (03CR) 10Jbond: [C: 03+1] nginx: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:10:55] PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604910 seconds, message: jmm testing things, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:11:14] moritzm: ^^ [11:13:54] (03PS1) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [11:14:28] (03CR) 10CI reject: [V: 04-1] phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:16:33] (03PS3) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 [11:16:43] (03CR) 10Jbond: "updated thanks" [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond) [11:17:08] (03PS4) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 [11:18:36] (03PS10) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [11:18:40] (03PS2) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [11:18:50] (03PS1) 10Filippo Giunchedi: icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) [11:19:37] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:24] jbond: ah yes, will roll back my test changes for now [11:22:45] ack [11:23:19] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/797293 (owner: 10Majavah) [11:23:23] (03PS3) 10Jbond: nrpe: move plugins off the base nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/797293 (owner: 10Majavah) [11:23:35] (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812855 (https://phabricator.wikimedia.org/T311493) [11:24:01] (03CR) 10Marostegui: [C: 03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812855 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:25:35] (03PS1) 10Marostegui: mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311475) [11:25:59] (03PS2) 10Marostegui: mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311493) [11:27:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:27:13] (03CR) 10Jbond: Add a host's conftool pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [11:27:28] (03PS2) 10Marostegui: mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) [11:28:47] RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:28:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10dr0ptp4kt) @jhathaway, that's correct, thanks! Nothing additionally needed beyond that access at the moment. [11:30:32] (03CR) 10Jbond: [C: 03+1] "monior nit but otherwise lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [11:40:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans) [11:41:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 (owner: 10Volans) [11:47:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans) [11:53:30] (03CR) 10Jbond: [C: 03+1] "lgtm see comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [12:02:10] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) Overall the idea of sending additional headers is the right one @Jgiannelos: specifically for swift `x-delete-after`... [12:05:21] !log updated bullseye netboot image for Bullseye 11.4 point release T312637 [12:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:25] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [12:06:10] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [12:11:11] (03CR) 10Muehlenhoff: [C: 03+2] apparmor: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812172 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:14:43] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) >>! In T300723#8066385, @BCornwall wrote: > @fgiunchedi > > Looks like the rules mentioned in the t... [12:17:49] (03CR) 10Jbond: "see inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [12:19:57] PROBLEM - php7.4-fpm service on mw2301 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.171: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:22:17] RECOVERY - php7.4-fpm service on mw2301 is OK: OK - php7.4-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:22:32] (03PS6) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [12:23:16] (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:26:09] 10SRE-swift-storage, 10Observability-Alerting: Port swift prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312765 (10fgiunchedi) [12:26:20] (03CR) 10Volans: k8s/reboot-nodes: Error if nodes are cordoned (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:32:48] (03PS1) 10Filippo Giunchedi: opensearch: remove icinga::monitor::elasticsearch::old_jvm_gc_checks [puppet] - 10https://gerrit.wikimedia.org/r/812860 (https://phabricator.wikimedia.org/T288622) [12:32:50] (03PS3) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 [12:33:34] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [12:49:29] (03PS10) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [12:49:31] (03PS8) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [12:49:33] (03PS5) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 [12:49:35] (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [12:51:09] (03CR) 10Volans: [C: 03+2] redfish: better compare Dell SCP attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans) [12:51:12] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36244/console" [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [12:51:19] (03CR) 10Volans: [C: 03+2] tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans) [12:55:34] (03CR) 10Muehlenhoff: [C: 03+2] Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287 (owner: 10Muehlenhoff) [12:55:42] (03PS2) 10Muehlenhoff: Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287 [12:59:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10Ottomata) > It looks like I will not need the SSH key as my use case fits in "Dashboards in Superset / Hive interfaces (like Hue) that do access private data". Correct! A... [12:59:30] (03PS1) 10Marostegui: instances.yaml: Add db2163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812864 (https://phabricator.wikimedia.org/T311493) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:22] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812864 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [13:00:41] moritzm: your change is ok to merge? [13:01:07] (03PS37) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [13:01:38] (03Merged) 10jenkins-bot: redfish: better compare Dell SCP attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans) [13:02:28] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [13:02:46] (03Merged) 10jenkins-bot: tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans) [13:03:00] moritzm: I have merged it as it looks harmless [13:04:21] marostegui: thanks! [13:04:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2163 to s8 T311493', diff saved to https://phabricator.wikimedia.org/P31002 and previous config saved to /var/cache/conftool/dbconfig/20220711-130441-marostegui.json [13:04:47] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [13:05:09] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [13:05:15] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [13:06:29] (03CR) 10David Caro: wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [13:08:04] (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [13:09:16] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36245/console" [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [13:10:00] (03CR) 10David Caro: [V: 03+1] "The pcc run is as expected, adding the absented file to the secondary nodes, and doing nothing on the primary." [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [13:10:29] (03CR) 10David Caro: [V: 03+1] novafullstack: add types and some names refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [13:10:38] (03CR) 10David Caro: [V: 03+1] "The pcc run is as expected, adding the absented file to the secondary nodes, and doing nothing on the primary." [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [13:11:23] (03CR) 10CDanis: [C: 03+2] varnish: use libvmod-querysort on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/812450 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [13:14:18] (03PS2) 10CDanis: haproxy: also log high client concurrency [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) [13:14:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:38] here? [13:15:03] there seems to be a peak of requests [13:15:16] looks like the typical saturation event [13:15:18] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:22] (03PS7) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [13:15:29] similar to what happened last week maybe? [13:16:05] PROBLEM - Apache HTTP on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:05] PROBLEM - Apache HTTP on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:05] PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:05] PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:05] PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:06] PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:13] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:13] PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:13] PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:13] PROBLEM - Apache HTTP on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:16:15] Wow [13:16:24] * Emperor here [13:16:30] * volans here if needed [13:16:42] * jbond here [13:17:05] PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:05] PROBLEM - Apache HTTP on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:25] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 1675 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:17:31] PROBLEM - Apache HTTP on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:49] PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:49] PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:49] PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:55] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:57] PROBLEM - Apache HTTP on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:17:57] PROBLEM - Apache HTTP on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:18:07] should we transition to #sre and open an incident? [13:18:35] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:18:35] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:18:47] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.16.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.16.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:18:47] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:18:47] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:18:47] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:18:47] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:18:47] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:18:49] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.23:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.23:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:18:49] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:19:05] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [13:19:05] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:19:26] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:27] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes1022.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1014.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers wtp1029.eqiad.wmnet, wtp1048.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1042.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1031.eqiad.wmne [13:19:27] 46.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:19:27] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.31:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF [13:19:27] S%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:19:29] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers wtp1048.eqiad.wmnet, wtp1 [13:19:29] d.wmnet, wtp1044.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1029.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1043.eqiad.wmnet, wtp1041.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad [13:19:29] wtp1030.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:19:39] PROBLEM - PHP7 rendering on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:39] PROBLEM - PHP7 rendering on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:39] PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:39] PROBLEM - PHP7 rendering on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:45] PROBLEM - PHP7 rendering on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:45] oh dear [13:19:45] PROBLEM - PHP7 rendering on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:45] PROBLEM - PHP7 rendering on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:45] PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:45] PROBLEM - PHP7 rendering on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:46] PROBLEM - PHP7 rendering on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:47] PROBLEM - PHP7 rendering on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:47] PROBLEM - PHP7 rendering on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:19:47] PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:20:19] PROBLEM - PHP7 rendering on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:20:57] PROBLEM - PHP7 rendering on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:20:59] PROBLEM - Apache HTTP on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:21:03] PROBLEM - PHP7 rendering on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:21:03] PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:21:15] PROBLEM - PHP7 rendering on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:21:17] PROBLEM - Apache HTTP on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:21:23] PROBLEM - PHP7 rendering on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:21:27] PROBLEM - PHP7 rendering on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:21:29] PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:22:05] (03PS8) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [13:22:05] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:05] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:07] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:07] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:13] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:15] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:17] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:27] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:22:39] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:39] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:49] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:49] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:22:52] (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:22:53] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:55] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:22:55] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:55] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:55] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:55] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:22:55] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:57] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:22:59] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:59] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [13:22:59] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:01] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:23:01] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:23:05] PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:23:15] PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:23:17] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:27] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:23:29] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:29] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:29] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:30] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:33] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [13:23:33] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:33] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:03] PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:24:18] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:37] PROBLEM - PHP7 rendering on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:24:38] (03CR) 10Vgutierrez: haproxy: also log high client concurrency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:24:43] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36247/console" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:24:59] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.190:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein [13:25:00] t on connection while downloading http://10.192.32.190:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:25:33] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [13:25:55] PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:26:35] PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:27:41] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:28:05] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content [13:28:05] test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/R [13:28:35] (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:28:47] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10mark) >>! In T256217#7960730, @Krinkle wrote: > I'm not sure since when, but based on us having <14 days ats-be stor... [13:29:18] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:47] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:29:51] (03CR) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:30:15] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1 [13:30:15] timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:30:18] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:29] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:30:35] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2018.codfw.wmnet, restbase2019.codfw.wmnet, restbase2012.codfw.wmnet, restbase2013.codfw.wmnet, restbase2021.codfw.wmnet, restbase2023.codfw.wmnet, restbase2020.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:30:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:31:16] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10mark) This appears to be configurable now in Swift 2.24.0 and later (we currently seem to be running 2.26.0 on 6/8 o... [13:31:29] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) is CRITICAL: Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) is CRITICAL: Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{tit [13:31:29] nslate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:31:31] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April [13:31:31] 6 returned the unexpected status 500 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/ [13:31:31] d/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:31:55] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:32:13] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:32:57] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:32:57] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:33:07] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:15] RECOVERY - Apache HTTP on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 4.553 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:33:17] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [13:33:27] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:33:33] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:33:41] RECOVERY - PHP7 rendering on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:33:45] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:33:49] RECOVERY - PHP7 rendering on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 6.475 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:33:51] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:33:53] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:33:55] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:33:59] RECOVERY - Apache HTTP on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 3.277 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:34:03] RECOVERY - PHP7 rendering on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.636 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:34:03] RECOVERY - Apache HTTP on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:34:19] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:34:43] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:34:49] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:35:01] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:35:01] RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.616 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:05] RECOVERY - PHP7 rendering on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.851 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:35:09] RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 4.678 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:35:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:35:13] RECOVERY - PHP7 rendering on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 9.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:35:17] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:35:18] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:23] RECOVERY - Apache HTTP on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 3.396 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:39] RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:45] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:35:45] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.344 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:45] (JobUnavailable) resolved: (5) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:49] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:35:49] RECOVERY - Apache HTTP on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.804 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:50] RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 6.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:35:50] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:35:53] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:35:55] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:36:15] RECOVERY - Apache HTTP on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:19] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:23] RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:23] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:29] RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:29] RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.627 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:31] RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:33] RECOVERY - Apache HTTP on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 4.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:33] RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:37] RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:37] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:36:37] RECOVERY - PHP7 rendering on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:36:39] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:39] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:39] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:43] RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:36:43] RECOVERY - PHP7 rendering on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:36:43] RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:36:43] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:36:43] RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:36:45] RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 8.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:37:15] RECOVERY - PHP7 rendering on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:23] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:23] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:37:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:37:31] RECOVERY - PHP7 rendering on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:31] RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:31] RECOVERY - PHP7 rendering on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:31] RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:37:33] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:35] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:37] RECOVERY - PHP7 rendering on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:37] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:37] RECOVERY - PHP7 rendering on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:39] RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:39] RECOVERY - PHP7 rendering on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:41] RECOVERY - PHP7 rendering on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:43] RECOVERY - PHP7 rendering on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.357 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:43] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:45] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:45] RECOVERY - PHP7 rendering on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.715 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:37:49] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:55] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:38:07] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:13] RECOVERY - PHP7 rendering on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:38:17] RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:38:17] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:19] RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.488 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:38:23] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:38:23] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:23] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:23] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:23] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:23] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:24] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:27] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:37] RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.719 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:38:41] RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:38:55] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:56] 10SRE, 10SRE-Access-Requests: Requesting access to _security IRC channel for TheresNoTime - https://phabricator.wikimedia.org/T312771 (10TheresNoTime) [13:38:57] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:59] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:07] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:09] RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:39:18] (ProbeDown) resolved: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:33] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:40:18] (ProbeDown) resolved: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:39] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) a:03Cmjohnson Juniper agreed on an RMA, forwarded the email thread to Chris for the shipping details. @Cmjohnson please sync up with Netops once rec... [13:47:58] (03CR) 10Elukey: [C: 03+2] ml-services: Add knative and egress config for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/812010 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [13:48:23] (03CR) 10Volans: [C: 03+2] tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [13:48:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:49:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:50:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:53:28] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye [13:53:32] (03CR) 10CDanis: haproxy: also log high client concurrency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:53:33] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [13:53:40] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [13:53:46] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [13:53:48] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [13:53:52] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [13:54:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [13:54:36] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [13:55:49] (03Merged) 10jenkins-bot: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans) [13:58:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36248/console" [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond) [13:59:02] (03CR) 10Vgutierrez: [C: 03+1] haproxy: also log high client concurrency [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:59:31] (03CR) 10Jbond: [C: 03+2] resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 (owner: 10Jbond) [13:59:41] (03PS3) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 [13:59:58] (03PS3) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 [14:02:04] (03PS1) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 [14:05:01] (03PS4) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 [14:05:03] (03PS3) 10Jbond: P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 [14:05:17] (03PS2) 10Jbond: base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461 [14:05:36] (03PS2) 10Jbond: P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555 [14:07:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:08:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:09:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:10:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:11:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:11:41] (03CR) 10Filippo Giunchedi: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [14:18:04] (03CR) 10Jbond: [C: 03+2] P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 (owner: 10Jbond) [14:18:08] (03CR) 10Jbond: [C: 03+2] base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461 (owner: 10Jbond) [14:18:12] (03CR) 10Jbond: [C: 03+2] P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555 (owner: 10Jbond) [14:19:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:53] there we go [14:22:27] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [14:22:27] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:37] PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:22:37] PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:22:43] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:22:43] PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:22:54] How surprising :D [14:22:57] PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:19] PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:37] PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:37] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:23:45] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 2202 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:24:18] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:55] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:24:59] (03CR) 10Ori: "This change is ready for review." [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812873 (owner: 10Ori) [14:25:01] RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:01] RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:05] RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:05] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:19] RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:43] RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:53] RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:55] RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:55] RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:25:55] RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:26:03] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:26:03] RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:26:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:11] (03Abandoned) 10Ori: Dummy change to test CI [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812873 (owner: 10Ori) [14:28:51] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:29:18] (ProbeDown) resolved: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:05] (03PS17) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [14:31:37] (03CR) 10Jbond: beaker: add initial beaker files (WIP) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [14:34:01] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye [14:34:06] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [14:34:14] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [14:34:17] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [14:34:22] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [14:34:24] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [14:34:29] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [14:34:50] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [14:34:55] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [14:43:27] (03PS2) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179) [14:43:33] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [14:46:49] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:56] (03PS1) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 [14:56:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [14:56:58] (03CR) 10CI reject: [V: 04-1] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond) [14:58:53] (03CR) 10Filippo Giunchedi: wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [15:01:59] (03PS2) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 [15:03:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) >>! In T304888#8062648, @Cmjohnson wrote: > all but the cloudnets installed correctly, they're still presenting the dhcp error. I am thi... [15:03:42] (03CR) 10CI reject: [V: 04-1] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond) [15:07:22] (03PS4) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 [15:08:57] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye [15:09:02] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:09:29] (03PS5) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 [15:15:04] (03PS1) 10Jbond: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 [15:17:24] (03PS1) 10Jbond: puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/812888 [15:17:26] (03CR) 10CI reject: [V: 04-1] release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond) [15:17:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) Just for the record cloudnet1005 did seem to install ok. Or at least DHCP did not fail at PXE or debian-installer stage. It's using NI... [15:17:37] (03PS1) 10Nskaggs: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 [15:19:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS bullseye [15:19:48] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/812891 [15:20:15] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2175.codfw.wmnet with OS bullseye [15:20:37] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10EBernhardson) [15:22:44] (03PS3) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 [15:23:31] (03PS2) 10Jbond: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 [15:23:39] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:23:45] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:23:47] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [15:23:52] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:25:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546) [15:27:15] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:21] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:22] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:28] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:27:43] (03CR) 10Jbond: [C: 03+2] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond) [15:27:46] (03CR) 10Jbond: [C: 03+2] release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond) [15:28:11] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:28:17] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1530). [15:30:15] (03Merged) 10jenkins-bot: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond) [15:31:59] (03Merged) 10jenkins-bot: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond) [15:32:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:34:28] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:34:30] (03CR) 10David Caro: Force depends so setup.py install works (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [15:35:15] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [15:39:37] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:812892| Bumping portals to master (T128546)]] (duration: 02m 58s) [15:39:40] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:39:59] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [15:40:11] (03CR) 10David Caro: [C: 03+1] "Feel free to +2 when the comment is in :)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [15:41:04] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/812888 (owner: 10Jbond) [15:41:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [15:41:50] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:42:29] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:812892| Bumping portals to master (T128546)]] (duration: 02m 51s) [15:42:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:45:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:45:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:45:24] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1005.wikimedia.org with reason: host reimage [15:46:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36249/console" [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond) [15:48:32] (03PS2) 10JMeybohm: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333 [15:48:34] (03PS1) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 [15:48:54] (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [15:49:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1005.wikimedia.org with reason: host reimage [15:49:24] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Papaul) [15:49:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:49:42] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Papaul) 05Open→03Resolved complete [15:50:03] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Papaul) [15:50:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway) [15:50:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but I'd use something local_access_log_min_code to clarify it only works on the local downstream." [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 (owner: 10JMeybohm) [15:50:36] (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [15:50:41] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Papaul) 05Open→03Resolved complete [15:51:10] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Papaul) 05Open→03Resolved complete [15:51:54] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:52:27] (03PS2) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 [15:52:29] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond) [15:52:41] (03PS3) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 [15:52:44] (03PS2) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 [15:53:11] (03PS18) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [15:54:06] (03PS1) 10Mforns: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) [15:54:40] (03CR) 10Ottomata: [C: 03+1] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [15:54:44] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 (owner: 10JMeybohm) [15:55:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2175.codfw.wmnet with OS bullseye [15:55:34] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2175.codfw.wmnet with OS bullseye completed: - db2... [15:56:21] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:56:40] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [15:56:47] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:57:09] (03CR) 10Filippo Giunchedi: "LGTM, see inline for non-blocking comment" [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [15:58:10] (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [16:00:16] (03PS2) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) [16:00:20] (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [16:00:35] (03PS3) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 [16:00:37] (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [16:00:57] (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [16:01:06] (03Merged) 10jenkins-bot: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [16:02:44] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) [16:03:33] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) 05Open→03Resolved @Marostegui All your's [16:04:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:05:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:05:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:06:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:07:23] (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [16:10:25] (03PS4) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 [16:11:28] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 (duration: 02m 55s) [16:12:30] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1005.wikimedia.org with OS bullseye [16:12:36] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye completed: - cloudel... [16:13:02] (03PS10) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [16:13:04] (03PS3) 10David Caro: openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 [16:13:06] (03PS5) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [16:13:08] (03CR) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [16:13:10] (03PS5) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [16:13:12] (03CR) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [16:13:14] (03PS1) 10David Caro: WIP: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 [16:13:16] (03PS1) 10David Caro: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 [16:13:18] (03PS1) 10David Caro: WIP: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 [16:13:59] (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [16:14:41] (03CR) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [16:15:16] (03PS1) 10Jbond: wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 [16:15:49] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:26] (03PS1) 10JHathaway: admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676) [16:19:08] (03CR) 10CI reject: [V: 04-1] wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [16:20:56] (03CR) 10CI reject: [V: 04-1] WIP: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 (owner: 10David Caro) [16:21:14] (03CR) 10CI reject: [V: 04-1] WIP: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 (owner: 10David Caro) [16:22:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676) (owner: 10JHathaway) [16:22:09] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) [16:23:20] (03CR) 10JHathaway: [C: 03+2] admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676) (owner: 10JHathaway) [16:24:08] (03CR) 10CI reject: [V: 04-1] wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [16:24:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway) 05Open→03Resolved a:03jhathaway @DDeSouza access granted, please let me know if you have any issues. [16:24:47] (03CR) 10David Caro: [C: 03+2] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro) [16:26:14] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [16:28:14] (03CR) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [16:28:47] (03PS2) 10Jbond: wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 [16:28:49] (03PS1) 10Jbond: wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907 [16:30:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36251/console" [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [16:31:27] (03PS6) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [16:31:51] (03CR) 10Ori: New service: function-evaluator (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [16:33:31] (03CR) 10CI reject: [V: 04-1] wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond) [16:34:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@02ab1c2]: use mode=reschedule on all airflow sensors [16:34:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36252/console" [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond) [16:36:13] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@02ab1c2]: use mode=reschedule on all airflow sensors (duration: 02m 02s) [16:49:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10DDeSouza) @Ottamata Thanks! I have `wmf`. @jhathaway Thanks! I was getting access denied at first but after a few tries it magically worked. [16:57:07] (03PS1) 10Aqu: Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913 [17:00:04] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1700). [17:01:20] (03PS2) 10Aqu: Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913 [17:06:17] (03CR) 10Ottomata: [C: 03+2] Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913 (owner: 10Aqu) [17:07:48] (03PS1) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [17:09:50] (03PS2) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [17:10:19] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:15:09] (03CR) 10David Caro: "You can format the code with:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [17:16:55] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [17:25:37] (03PS3) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [17:27:41] (03PS4) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [17:28:17] (03CR) 10Nskaggs: Ensure quota_increase cookbook runs and validates (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [17:29:58] (03PS2) 10BCornwall: varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) [17:30:13] (03CR) 10BCornwall: varnish: Port over traffic_drop from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:34:17] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [18:05:28] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:12:13] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10KFrancis) Hi all, the NDA has been signed and completed for WMDE LDAP group access. Please proceed with the request. Thanks! [18:16:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:04] (03PS1) 10Ssingh: durum: add console log message [puppet] - 10https://gerrit.wikimedia.org/r/812919 [18:19:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36253/console" [puppet] - 10https://gerrit.wikimedia.org/r/812919 (owner: 10Ssingh) [18:23:20] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:25:02] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add console log message [puppet] - 10https://gerrit.wikimedia.org/r/812919 (owner: 10Ssingh) [18:29:36] (03CR) 10JHathaway: [C: 03+1] "looks good, just some minor comments" [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [18:30:20] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond) [18:32:34] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [18:36:25] ACKNOWLEDGEMENT - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T312626 - Still working on this https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:39:02] PROBLEM - Host mw2376.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:45:51] (03CR) 10JHathaway: [C: 03+1] wmflib: create a helper function for querying puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [18:50:50] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:02:39] 10SRE, 10Traffic-Icebox: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Krinkle) This appears to now work as expected. I'm guessing that [[ https://wikitech.wikimedia.org/wiki/HAProxy | HAProxy ]] is better at this than Nginx. I don't recall if we verified it on ATS (ats... [19:02:55] 10SRE, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Krinkle) [19:06:56] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:52] PROBLEM - Host thumbor2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:25:29] 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Eevans) [19:25:38] 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Eevans) p:05Triage→03Medium [19:29:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:18] (ProbeDown) firing: (5) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:34] PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [19:30:36] PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:36] PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:38] PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:40] PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:40] PROBLEM - Apache HTTP on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:40] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:44] PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:46] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:48] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:48] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:50] PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:50] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:50] PROBLEM - Apache HTTP on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:50] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:50] PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:54] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:54] PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:30:56] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:31:08] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:14] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:16] PROBLEM - Apache HTTP on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:16] PROBLEM - Apache HTTP on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:16] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:31:18] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:31:20] PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:26] PROBLEM - Apache HTTP on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:26] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:30] PROBLEM - Apache HTTP on mw1450 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:30] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:34] PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:34] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:54] PROBLEM - Apache HTTP on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:31:56] PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:06] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:32:08] PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:08] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:08] PROBLEM - Apache HTTP on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:12] PROBLEM - Apache HTTP on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:18] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1404.eqiad.wmnet, mw1447.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1362.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1342.eqiad.wmnet, mw1402.eq [19:32:18] t, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1317.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1396.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1378.eqiad.wmnet, [19:32:18] eqiad.wmnet, mw1444.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1376.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad https://wikitech.wikimedia.org/wiki/PyBal [19:32:21] I see #page s on klaxon so ✨ [19:32:24] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:32:28] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:32:28] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:32:32] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:34] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:34] PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1404.eqiad.wmnet, mw1447.eqiad.wmnet, mw1427.eqiad.wmnet, mw1361.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1348.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1448.eqiad.wmnet, mw1402.eqiad.wmnet, mw1390.eqiad.wmnet, mw1381.eq [19:32:34] t, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1358.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1444.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1315.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, [19:32:34] eqiad.wmnet, mw1383.eqiad.wmnet, mw1400.eqiad.wmnet, mw1392.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1360.eqiad.wmnet, mw1382.eqiad.wmnet, mw1398.eqiad.wmnet, mw1341.eqiad https://wikitech.wikimedia.org/wiki/PyBal [19:32:36] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:36] PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:36] PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:36] PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:38] PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:38] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:50] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:32:56] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [19:33:00] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:02] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:02] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:04] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:04] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:33:04] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:04] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:06] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:06] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:06] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:33:06] PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:33:08] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:08] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:08] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 2585 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:10] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:12] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:12] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [19:33:12] e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [19:33:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P31005 and previous config saved to /var/cache/conftool/dbconfig/20220711-193315-marostegui.json [19:33:18] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [19:33:18] e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random [19:33:18] title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:33:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:33:20] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:24] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:24] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [19:33:28] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:28] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:30] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [19:33:30] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [19:33:30] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [19:33:30] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [19:33:32] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response [19:33:34] ived: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page [19:33:34] TICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:33:50] RECOVERY - Apache HTTP on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.711 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:52] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:33:52] RECOVERY - Apache HTTP on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.653 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:52] RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.801 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:33:56] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:34:00] RECOVERY - Apache HTTP on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:02] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:34:02] RECOVERY - Apache HTTP on mw1450 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:04] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:06] RECOVERY - Apache HTTP on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.717 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:06] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:14] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:34:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): [19:34:16] est/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [19:34:22] RECOVERY - Apache HTTP on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:24] RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:34] RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:34] RECOVERY - Apache HTTP on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:36] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.539 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:36] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:34:38] RECOVERY - Apache HTTP on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:44] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wik [19:34:52] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:34:58] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:00] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:00] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:00] RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:00] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:00] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:04] RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:04] RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:04] RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:04] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:06] RECOVERY - Apache HTTP on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:06] RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:35:26] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:35:28] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [19:35:32] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:33] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [19:35:34] RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:38] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:38] RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:38] RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:40] RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:40] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:40] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:42] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:42] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:42] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:35:42] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:42] RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:44] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:44] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:44] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:44] RECOVERY - Apache HTTP on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:44] RECOVERY - Apache HTTP on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:45] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:35:46] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.880 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:46] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:46] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:47] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:47] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:35:48] RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:48] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:35:50] RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:52] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:54] RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:58] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:35:58] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:58] RECOVERY - Apache HTTP on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:00] RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:02] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:36:02] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:36:02] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:36:04] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:36:04] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:36:06] (ProbeDown) resolved: (5) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:06] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:36:06] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:36:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:36:14] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:22] RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:26] RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:26] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:36:26] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:36:28] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:36:30] RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:33] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:36:34] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:36:48] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:36:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:38:08] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:38:20] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:38:30] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:38:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:38:58] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:39:04] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:39:08] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:40:50] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:56] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [19:41:21] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [19:41:48] !log hashar@deploy1002 Started deploy [integration/docroot@fc5d65a]: Add language-data library [19:41:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:41:57] !log hashar@deploy1002 Finished deploy [integration/docroot@fc5d65a]: Add language-data library (duration: 00m 08s) [19:43:06] y'all behaving now..? [19:43:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:43:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:44:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:47:24] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:48:14] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T2000). [20:00:05] mforns: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] hii! [20:02:21] I'm able to deploy if absolutely needed, but lets give urbanecm and cjming a few more minutes ^^ [20:02:39] 👍 [20:02:58] I can deploy but I need like 10 minutes [20:03:11] no problem! [20:03:18] ^^ phew! [20:04:17] TheresNoTime: if you want to try, you can press the buttons after I get to my laptop, and i can stand by. How does that sound? [20:04:53] urbanecm: sure :) I've taken a look at it and I'm comfortable with the deploy, just the idea of doing it alone wasn't ideal [20:06:50] Yeah, definitely. In that case, I'll ping you in a few minutes [20:06:56] sure :) [20:08:34] thanks for bearing with us mforns :) this will be my second (?) deployment so... cross your fingers! [20:09:03] sure TheresNoTime! full confidence :] [20:11:22] TheresNoTime: I'm at my laptop now, so feel free to go ahead! [20:11:30] okay :) [20:11:40] happy to answer any questions, or step in if needed. [20:12:00] (03CR) 10Samtar: [C: 03+2] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:12:53] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 try again ref T311788 (duration: 03m 07s) [20:12:58] T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) - https://phabricator.wikimedia.org/T311788 [20:13:49] (03PS2) 10Samtar: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:14:22] Krinkle: i just saw your sync-file, fyi it's B&C time now (TheresNoTime's doing the deploy) [20:15:18] ack [20:15:47] urbanecm: realise I did the rebase -> +2 the wrong way around there, doesn't make a difference though, correct? Other than now having to manually "submit" the patch to merge? [20:16:19] TheresNoTime: looks so. remove the -2, ensure it's on master, re-apply it is the "correct" way to fix this when it happens [20:16:26] *remove the +2, ofc [20:16:32] thank you, okay [20:16:49] (03CR) 10Samtar: [C: 03+2] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:17:07] looks jenkins noticed it this time [20:18:23] (03Merged) 10jenkins-bot: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:18:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:57] mforns: should be live on mwdebug1001 now, are you able to test? [20:20:06] yes! trying [20:24:26] 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Cmjohnson) Acknowledged and will look into it and update the task with what I find [20:24:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:25:27] TheresNoTime: sent a couple events and they appeared in kafka, seems all is working correctly [20:25:46] mforns: thanks :) now deploying [20:25:54] 👍 [20:27:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:56] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:812897|Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis (T290303)]] (duration: 02m 53s) [20:29:01] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 [20:29:20] mforns: that should be live now, can you test again if needed? [20:29:27] yes! [20:29:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:30:14] all looks okay to you urbanecm? [20:31:09] TheresNoTime: yup yup :) [20:31:16] :) [20:33:42] TheresNoTime: I'm looking at grafana, and it seems events are flowing normally. Will continue to monitor for a bit. [20:33:48] TheresNoTime: thanks a lot! :] [20:34:00] No worries, thank you for your patience mforns :) [20:34:17] 👍 [20:34:47] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@a559f82]: subgraph: Use HivePartitionRangeSensor to wait for sparql queries [20:35:06] We've got ~30 minutes, worth calling for any other patches or should I close the window urbanecm? [20:35:35] TheresNoTime: usually people say so in this chan if they have anything that's not in the calendar, so I'd close [20:36:06] !log UTC late deploys done [20:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:39] thanks for the deployment TheresNoTime! [20:36:48] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@a559f82]: subgraph: Use HivePartitionRangeSensor to wait for sparql queries (duration: 02m 00s) [20:36:51] thank you for being around! :) [20:37:12] np [20:47:15] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this... [20:48:29] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:53:28] (03PS5) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [21:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T2100). [21:00:28] ^ no sec patches for deployment this week AFAIK [21:02:18] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [21:05:22] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) [21:05:27] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [21:05:58] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) This is (finally) complete, closing... [21:06:08] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) 05Open→03Resolved [21:06:10] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [21:09:11] (03PS1) 10JHathaway: lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) [21:09:13] (03PS1) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) [21:10:01] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [21:10:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [21:14:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [21:14:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [21:47:43] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@3ba1d4c]: subgraph_query_mapping_daily: Increase partitioning to 2048 [21:49:45] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@3ba1d4c]: subgraph_query_mapping_daily: Increase partitioning to 2048 (duration: 02m 02s) [22:21:36] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:35:17] (03PS1) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945 [22:37:27] (03CR) 10CI reject: [V: 04-1] monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle) [22:47:32] (03CR) 10Krinkle: "I'm not sure I understand the warning from build_notes_url()." [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle) [23:06:49] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [23:10:27] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) Using a personal google workspace and google cloud account (for the time being) the dispatchdev instance is now creating a new google drive folde... [23:21:09] PROBLEM - Zookeeper Server #page on conf1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [23:21:28] PROBLEM - Zookeeper Server #page on conf1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [23:21:44] PROBLEM - Zookeeper Server #page on conf1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [23:21:49] Oh this happened the other day too :/ [23:22:17] here [23:22:50] TheresNoTime: what was the cause yesterday? [23:23:17] jhathaway: I'm trying to remember, sorry :/ fairly sure it was the exact same alerts ^ [23:24:10] PROBLEM - Check unit status of etcd-backup on conf1009 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:25:26] yeah, afaik this is T311407 and the underlying cause is T312539 [23:25:27] T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 [23:25:57] last time it came up a.kosiaris silenced the alert, I think that's still the right action here but I don't know if we've made any progress on those hosts while I wasn't looking [23:26:14] doesn't look like it from phab, double-checking [23:26:19] rzl: I don't think so, I don't think zookeeper is even installed [23:26:35] at least on 1009 it is not [23:26:43] good enough for me [23:26:55] looks like host downtime ended 7min ago [23:27:36] I'm gonna extend for another week or so, minus a little bit so that it pops earlier in the day [23:27:49] hopefully we'll just cancel it before then anyway :) [23:28:01] rzl: thanks, are you using cumin to do the deed, or manually, just curious? [23:28:10] PROBLEM - Check unit status of etcd-backup on conf1008 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:29:09] jhathaway: the cookbook -- for one host I'd probably fight my way through the web ui, but honestly there's no need [23:29:28] if those hosts are not yet prod-ready you could also set the disable notification hiera settings for them [23:29:45] I think they're ready for etcd, just not for zookeeper [23:29:51] rzl: thanks, just trying to learn the ways of sre :) [23:30:11] jhathaway: oh in that case I'm gonna downtime them this way: "jhathaway could you please downtime those hosts? 162 hours or so should be perfect" [23:30:21] :D [23:30:29] (unless you'd rather not, in which case I'm happy to finish up) [23:30:50] and `cumin2002:~$ sudo cookbook sre.hosts.downtime -h` should be all you need [23:31:18] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (29) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, conf1007, conf1008, conf1009, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003 [23:31:18] -fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:31:44] Learning the ways of SRE hmm? 👀 [23:31:54] 👀 [23:32:26] * TheresNoTime remains the personification of "a little knowledge is dangerous" [23:33:58] * perryprog is really here to just learn SRE knowledge stuff and /sometimes/ help with whatever [23:35:39] perryprog: I have the wonderful excuse that "I work here" :3 [23:36:06] * perryprog grumbles [23:44:32] (03CR) 10Cwhite: [C: 04-2] "icinga does the encoding of these urls: https://phabricator.wikimedia.org/T213052" [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle) [23:44:52] (I went ahead and downtimed, decided to do just that service after all so I used the web ui) [23:45:39] systemd alerts are still CRIT for etcd-backup but those are non-paging so I'll leave em for visibility [23:47:19] thanks, rzl - I'll resolve in VO as well so they don't fire again this time tomorrow [23:47:32] ah thanks [23:58:40] (03CR) 10Cwhite: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812860 (https://phabricator.wikimedia.org/T288622) (owner: 10Filippo Giunchedi) [23:59:35] (03CR) 10Cwhite: [C: 03+1] icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)