[00:00:05] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:05] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[00:10:05] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:17] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:19:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[00:19:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[00:23:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:03] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:33] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[02:07:33] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:43] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:11] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service
[02:28:11] <icinga-wm>	 @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:57] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:19:53] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:29:59] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:25] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[03:37:25] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:59:59] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:07:31] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[04:07:31] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:21:15] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:30:05] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:01] <icinga-wm>	 PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:37:35] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[04:37:35] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:57] <wikibugs>	 (03PS1) 10Abijeet Patro: WikiPage group description: prefix source page title [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812531 (https://phabricator.wikimedia.org/T312688)
[04:46:04] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 03+1] WikiPage group description: prefix source page title [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812531 (https://phabricator.wikimedia.org/T312688) (owner: 10Abijeet Patro)
[04:51:53] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:21] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:00:11] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:43] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[05:07:43] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:20:49] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:23] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service
[05:28:23] <icinga-wm>	 @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:19] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:23] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:40:21] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service
[05:40:21] <icinga-wm>	 @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8836.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:17] <wikibugs>	 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10jcrespo) Found the following exception, sending as NDA, as I suspect it is user-traffic related:  {P30998}
[05:47:21] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:07] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:53] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service
[05:54:53] <icinga-wm>	 @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8823.service,thumbor@8826.service,thumbor@8827.service,thumbor@8829.service,thumbor@8835.service,thumbor@8837.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:39] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:21:05] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:17] <_joe_>	 !log depooled thumbor1005, downgraded firejail, restarted units
[06:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:33] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service
[06:28:33] <icinga-wm>	 @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8829.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:47] <_joe_>	 !log repool thumbor1005
[06:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:29] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:05] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:11] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:59] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:45] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:59] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:40:09] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:38] <wikibugs>	 (03PS9) 10David Caro: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011
[06:44:55] <wikibugs>	 (03PS6) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013
[06:50:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2077.codfw.wmnet
[06:51:15] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2077 [puppet] - 10https://gerrit.wikimedia.org/r/812703 (https://phabricator.wikimedia.org/T312191)
[06:52:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro)
[06:54:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:55:07] <wikibugs>	 (03Merged) 10jenkins-bot: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro)
[06:55:09] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro)
[06:58:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:59:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2077 [puppet] - 10https://gerrit.wikimedia.org/r/812703 (https://phabricator.wikimedia.org/T312191) (owner: 10Marostegui)
[07:00:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2077.codfw.wmnet
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T0700).
[07:00:05] <jouncebot>	 abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:07] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2077.codfw.wmnet` - db2077.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[07:00:30] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) @Papaul this is ready for you
[07:01:06] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) a:03Papaul
[07:01:15] <wikibugs>	 (03PS5) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259
[07:01:19] <wikibugs>	 (03CR) 10David Caro: wmcs: add alerts for any node going down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro)
[07:01:23] <wikibugs>	 (03PS4) 10David Caro: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313
[07:01:27] <wikibugs>	 (03CR) 10David Caro: wmcs: add systemd unit down alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro)
[07:01:31] <wikibugs>	 (03PS7) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999
[07:09:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2027.codfw.wmnet with OS bullseye
[07:09:15] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bullseye
[07:21:40] <urbanecm>	 abijeet_: hi, it looks like no one claimed the window yet! I can deploy if you're still around. 
[07:21:50] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2080 [puppet] - 10https://gerrit.wikimedia.org/r/812704 (https://phabricator.wikimedia.org/T312618)
[07:22:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2080.codfw.wmnet
[07:23:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage
[07:24:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[07:25:38] <wikibugs>	 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10Joe) I downgraded firejail on all thumbor servers and that stopped, at least for now, the flurry of restarts we were seeing. More investigation is needed.
[07:26:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[07:26:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2027.codfw.wmnet with reason: host reimage
[07:29:31] <wikibugs>	 (03PS1) 10Majavah: UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347)
[07:29:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks Daniel, LGTM as a temporary measure." [puppet] - 10https://gerrit.wikimedia.org/r/812427 (https://phabricator.wikimedia.org/T275170) (owner: 10Dzahn)
[07:30:22] <taavi>	 urbanecm: are you deploying something or can I deploy a patch of my own?
[07:30:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:30:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Add alert manager alert receivers for the Abstract Wikipedia team. [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[07:31:03] <urbanecm>	 taavi: seems abijeet_ isn't here too, so feel free to squeeze in :)
[07:31:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove DNS targets, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/812329 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:31:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2080 [puppet] - 10https://gerrit.wikimedia.org/r/812704 (https://phabricator.wikimedia.org/T312618) (owner: 10Marostegui)
[07:31:41] <taavi>	 thanks, will do
[07:31:46] <marostegui>	 godog: ok to merge?
[07:31:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347) (owner: 10Majavah)
[07:31:50] <godog>	 marostegui: yes please!
[07:31:52] <godog>	 thank you
[07:31:56] <marostegui>	 godog: done!
[07:31:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2080.codfw.wmnet
[07:32:01] <godog>	 <3 marostegui 
[07:32:05] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[07:32:44] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2080 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812705 (https://phabricator.wikimedia.org/T312618)
[07:33:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2080 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812705 (https://phabricator.wikimedia.org/T312618) (owner: 10Marostegui)
[07:33:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: turn off uwsgi request logging [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi)
[07:33:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2080 from dbtcl T312618', diff saved to https://phabricator.wikimedia.org/P30999 and previous config saved to /var/cache/conftool/dbconfig/20220711-073346-marostegui.json
[07:33:50] <godog>	 lol, good timing again marostegui 
[07:33:50] <stashbot>	 T312618: decommission db2080 - https://phabricator.wikimedia.org/T312618
[07:34:21] <wikibugs>	 (03PS8) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999
[07:34:23] <wikibugs>	 (03PS1) 10David Caro: wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706
[07:34:42] <marostegui>	 godog: I merged mine! I didn't get any output from any other change!
[07:34:48] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Marostegui) a:03Papaul
[07:34:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro)
[07:35:06] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Marostegui) @Papaul this is ready for you!
[07:35:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro)
[07:36:30] <wikibugs>	 (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[07:36:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[07:36:42] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro)
[07:36:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro)
[07:37:06] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro)
[07:37:12] <godog>	 marostegui: yeah I submitted mine, then went to puppet-merge and you were merging too (twice in a row)
[07:37:34] <wikibugs>	 (03Merged) 10jenkins-bot: UndeleteHookHandler: fix namespace conditional [extensions/PageTriage] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812532 (https://phabricator.wikimedia.org/T311347) (owner: 10Majavah)
[07:39:13] <taavi>	 testing on mwdebug1001
[07:40:00] <taavi>	 works, syncing
[07:41:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2027.codfw.wmnet with OS bullseye
[07:41:31] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2027.codfw.wmnet with OS bullseye completed: - ganeti2027 (**PASS**)   - Downtimed on...
[07:43:05] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/PageTriage/includes/HookHandlers/UndeleteHookHandler.php: Backport: [[gerrit:812532|UndeleteHookHandler: fix namespace conditional (T311347)]] (duration: 02m 54s)
[07:43:09] <stashbot>	 T311347: Mark freshly undeleted articles as unreviewed automatically - https://phabricator.wikimedia.org/T311347
[07:43:12] * taavi done
[07:43:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:45:21] <wikibugs>	 (03PS1) 10Marostegui: db2165: Candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/812708 (https://phabricator.wikimedia.org/T311493)
[07:47:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:47:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:49:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
[07:51:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:51:57] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:52:21] <godog>	 !log roll-restart swift-account swift-container across swift/thanos bullseye hosts - T297959
[07:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:24] <stashbot>	 T297959: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959
[07:54:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2165: Candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/812708 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:57:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: split retention times based on resolution [puppet] - 10https://gerrit.wikimedia.org/r/811932 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi)
[07:58:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
[07:59:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: trim raw samples retention to 54 weeks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811933 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi)
[08:04:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2027.codfw.wmnet to cluster codfw and group A
[08:05:39] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:06:24] <godog>	 !log trim thanos raw samples retention to 54w - T311690
[08:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:28] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[08:10:55] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[08:14:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] prometheus: add support to blackbox icmp probe hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:15:54] <wikibugs>	 (03PS4) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106
[08:16:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2027.codfw.wmnet to cluster codfw and group A
[08:16:43] <wikibugs>	 (03PS5) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106
[08:17:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:18:01] <wikibugs>	 (03CR) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[08:19:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] prometheus: blackbox icmp probes for hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:20:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[08:27:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[08:28:38] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[08:30:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:30:33] <godog>	 denisse|m: merging your change too! \o/
[08:30:51] <wikibugs>	 (03PS1) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818
[08:30:56] <denisse|m>	 godog: Thank you very much! :)
[08:35:11] <wikibugs>	 (03PS2) 10Slyngshede: Add per node vCPU allocations [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818
[08:35:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: blackbox icmp probes for hosts [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860)
[08:35:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: blackbox icmp probes for hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:38:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] utils: chmod +x setup_rake.sh and vcl_ec2_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/810973 (owner: 10Zabe)
[08:38:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: blackbox icmp probes for hosts [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:40:10] <wikibugs>	 (03PS1) 10David Caro: dumps.kiwix-rsync-cron: Return 0 when not failed [puppet] - 10https://gerrit.wikimedia.org/r/812819
[08:41:52] <wikibugs>	 (03PS1) 10Volans: tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820
[08:41:54] <wikibugs>	 (03PS1) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821
[08:46:49] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Volans) @Papaul I've finished my tests on db2175, it's all yours! Thanks for the help. I've sent patches to Gerrit to fix the issue and once merge...
[08:46:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mailmap: add a few entries [puppet] - 10https://gerrit.wikimedia.org/r/809163 (owner: 10Zabe)
[08:48:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809616 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:49:21] <wikibugs>	 (03CR) 10Ayounsi: "Not 100% sure yet, but I think this could be a Custom Validator now, see T310590." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[08:49:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[08:49:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:50:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:52:13] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s3-master [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610)
[08:52:55] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:59] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui)
[08:53:55] <wikibugs>	 (03PS2) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821
[08:54:18] <wikibugs>	 (03CR) 10Slyngshede: "Getting CPU allocation per node is easier to do in the exporter, compared to trying to extract the information using the existing metrics " [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede)
[08:56:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] druid: Fixed UID/GIDs are universally in use now [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff)
[09:02:40] <wikibugs>	 (03PS1) 10David Caro: rabbitmq.drain_queue: Fix requeue option for newer API [puppet] - 10https://gerrit.wikimedia.org/r/812825
[09:08:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Great, so next step are: # Install the breakout panels, (document them, similar to {T304710}) # Pre-populate the ports/panels that will be used with th...
[09:10:16] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: improve error handling for puppet-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/812827 (https://phabricator.wikimedia.org/T311742)
[09:15:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493)
[09:17:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: improve error handling for puppet-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/812827 (https://phabricator.wikimedia.org/T311742) (owner: 10Jbond)
[09:17:13] <wikibugs>	 (03PS1) 10Jbond: pcc: add correct tools pm public key [puppet] - 10https://gerrit.wikimedia.org/r/812829 (https://phabricator.wikimedia.org/T311742)
[09:17:33] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:19:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pcc: add correct tools pm public key [puppet] - 10https://gerrit.wikimedia.org/r/812829 (https://phabricator.wikimedia.org/T311742) (owner: 10Jbond)
[09:19:41] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[09:19:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Replace db2078 with db2160 [puppet] - 10https://gerrit.wikimedia.org/r/812828 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[09:20:21] <wikibugs>	 (03CR) 10Ayounsi: "Tested on netbox-next and works as expected:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi)
[09:22:33] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:24:01] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10jbond) i think this is related to when the ssl certificate needed to be extended.  I have uploaded the [[ https://gerrit.wikimedia.org/...
[09:24:56] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10jbond) 05Open→03Resolved a:03jbond
[09:26:46] <jynus>	 jbond: ^ not sure if icinga puppet could be your patch or unrelated?
[09:27:14] <wikibugs>	 (03PS3) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673)
[09:27:34] <jynus>	 probably not, some other earlier patch
[09:29:31] <wikibugs>	 (03Abandoned) 10Jbond: P:puppet::agent: add logging of puppet calls [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond)
[09:30:03] <jynus>	 seems alertmanger related, godog maybe?
[09:30:59] <wikibugs>	 (03CR) 10Slyngshede: "Add ignore errors to get behavior similar to crontab." [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:31:29] <jynus>	 found it, I think it is https://gerrit.wikimedia.org/r/c/operations/puppet/+/811790
[09:31:57] <jynus>	 ^godot there must be an extra - that yaml doesn't like it or something
[09:36:25] <jynus>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/811790/3/modules/alertmanager/templates/alertmanager.yml.erb#426 maybe?
[09:42:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811232 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:42:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:43:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811229 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:43:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811226 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:44:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811231 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:46:13] <wikibugs>	 (03PS9) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999
[09:46:15] <wikibugs>	 (03PS2) 10David Caro: wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706
[09:46:17] <wikibugs>	 (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[09:47:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] rabbitmq.drain_queue: Fix requeue option for newer API [puppet] - 10https://gerrit.wikimedia.org/r/812825 (owner: 10David Caro)
[09:48:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] tilerator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811225 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:50:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:51:17] <wikibugs>	 (03CR) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:52:12] <wikibugs>	 (03CR) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:52:49] <jbond>	 jynus: sorry missed your earlier ping however i think your right the issue seems to be related to godog CR, godog let me know if you need a hand
[09:53:16] <jynus>	 sorry for the ping, you happened to merge something at the time
[09:53:32] <jynus>	 so I was guessing until I look at it more deeply
[09:53:50] <jbond>	 no probs :)
[09:54:05] <jynus>	 also I pinged because I thought it was blocking icinga updates
[09:54:15] <wikibugs>	 (03PS1) 10Majavah: alertmanager: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/812834
[09:54:21] <taavi>	 ^ probably fixed by this
[09:54:23] <jynus>	 (which would have bee high prio)
[09:54:41] <jbond>	 ack
[09:55:16] <jbond>	 taavi: ack thanks lgtm will merge
[09:55:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/812834 (owner: 10Majavah)
[09:56:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM minor comment/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans)
[09:57:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans)
[09:58:03] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans)
[09:59:00] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s3-master [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui)
[09:59:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans)
[09:59:32] <jbond>	 taavi: that did indeed fix it thanks (cc godog )
[10:00:11] <volans>	 CI should probably catch that
[10:00:54] <jbond>	 volans: agreeded see T305676
[10:00:55] <stashbot>	 T305676: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676
[10:01:34] <taavi>	 that's a .yml.erb so you can't just run it directly through a linter
[10:01:48] <jbond>	 and T236954 which has morecomments
[10:01:49] <stashbot>	 T236954: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954
[10:01:58] <jbond>	 ahh yes erb files is a whole other mess
[10:02:33] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:03:14] <taavi>	 probably could just move the alert routing configuration to hiera
[10:04:28] <wikibugs>	 (03PS1) 10Marostegui: db2160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812836 (https://phabricator.wikimedia.org/T311493)
[10:05:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812836 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:06:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AniketArs out of all services on: 663 hosts
[10:07:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AniketArs out of all services on: 663 hosts
[10:08:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AniketArs out of all services on: 1292 hosts
[10:08:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AniketArs out of all services on: 1292 hosts
[10:08:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro)
[10:11:37] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro)
[10:16:56] <wikibugs>	 (03PS4) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219
[10:17:01] <wikibugs>	 (03PS1) 10Marostegui: db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812837 (https://phabricator.wikimedia.org/T312754)
[10:17:33] <wikibugs>	 (03CR) 10Jbond: "thanks for the feedback, updated" [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond)
[10:17:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:41] <wikibugs>	 (03CR) 10Jbond: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond)
[10:17:43] <wikibugs>	 (03PS5) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219
[10:18:11] <wikibugs>	 (03PS6) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219
[10:20:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812837 (https://phabricator.wikimedia.org/T312754) (owner: 10Marostegui)
[10:20:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi)
[10:23:44] <wikibugs>	 (03PS1) 10Marostegui: site.pp: db2078 no longer active misc host [puppet] - 10https://gerrit.wikimedia.org/r/812838 (https://phabricator.wikimedia.org/T312754)
[10:24:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: db2078 no longer active misc host [puppet] - 10https://gerrit.wikimedia.org/r/812838 (https://phabricator.wikimedia.org/T312754) (owner: 10Marostegui)
[10:27:57] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:32:13] <wikibugs>	 (03CR) 10Volans: "One nit inline, LGTM otherwise. To be thoroughly tested." [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi)
[10:34:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:35:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:38:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (I'll leave it to you the details related to k8s APIs discussed in the comment)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:40:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610)
[10:41:05] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui)
[10:42:17] <wikibugs>	 (03CR) 10Jbond: "See inline for comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[10:42:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui)
[10:42:42] <wikibugs>	 (03Abandoned) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 (owner: 10Jbond)
[10:47:40] <wikibugs>	 (03PS4) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[10:47:43] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[10:47:53] <wikibugs>	 (03PS5) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[10:49:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[10:55:04] <godog>	 jynus taavi thanks for the heads up and the fix!
[10:55:19] <godog>	 jbond too
[10:55:26] <godog>	 agreed re: yaml + erb, not super easy
[10:58:14] <wikibugs>	 (03CR) 10Volans: "Quite a large one. I've done a quick pass, left some comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond)
[11:01:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:02:11] <wikibugs>	 (03CR) 10Volans: "quick reply to comments, I didn't do a full pass yet" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[11:03:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[11:04:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812172 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:05:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812173 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:06:01] <wikibugs>	 (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179)
[11:06:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:07:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] nginx: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:10:55] <icinga-wm>	 PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604910 seconds, message: jmm testing things, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:11:14] <jbond>	 moritzm: ^^
[11:13:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847)
[11:14:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[11:16:33] <wikibugs>	 (03PS3) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338
[11:16:43] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond)
[11:17:08] <wikibugs>	 (03PS4) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338
[11:18:36] <wikibugs>	 (03PS10) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[11:18:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847)
[11:18:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847)
[11:19:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:24] <moritzm>	 jbond: ah yes, will roll back my test changes for now
[11:22:45] <jbond>	 ack
[11:23:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/797293 (owner: 10Majavah)
[11:23:23] <wikibugs>	 (03PS3) 10Jbond: nrpe: move plugins off the base nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/797293 (owner: 10Majavah)
[11:23:35] <wikibugs>	 (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812855 (https://phabricator.wikimedia.org/T311493)
[11:24:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812855 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[11:25:35] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311475)
[11:25:59] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311493)
[11:27:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productinize db2163 [puppet] - 10https://gerrit.wikimedia.org/r/812856 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[11:27:13] <wikibugs>	 (03CR) 10Jbond: Add a host's conftool pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[11:27:28] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610)
[11:28:47] <icinga-wm>	 RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:28:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10dr0ptp4kt) @jhathaway, that's correct, thanks! Nothing additionally needed beyond that access at the moment.
[11:30:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "monior nit but otherwise lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[11:40:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans)
[11:41:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 (owner: 10Volans)
[11:47:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans)
[11:53:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm see comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[12:02:10] <wikibugs>	 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) Overall the idea of sending additional headers is the right one @Jgiannelos: specifically for swift `x-delete-after`...
[12:05:21] <moritzm>	 !log updated bullseye netboot image for Bullseye 11.4 point release T312637
[12:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:25] <stashbot>	 T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637
[12:06:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[12:11:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] apparmor: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812172 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:14:43] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) >>! In T300723#8066385, @BCornwall wrote: > @fgiunchedi  >  > Looks like the rules mentioned in the t...
[12:17:49] <wikibugs>	 (03CR) 10Jbond: "see inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi)
[12:19:57] <icinga-wm>	 PROBLEM - php7.4-fpm service on mw2301 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.171: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:22:17] <icinga-wm>	 RECOVERY - php7.4-fpm service on mw2301 is OK: OK - php7.4-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:22:32] <wikibugs>	 (03PS6) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[12:23:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[12:26:09] <wikibugs>	 10SRE-swift-storage, 10Observability-Alerting: Port swift prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312765 (10fgiunchedi)
[12:26:20] <wikibugs>	 (03CR) 10Volans: k8s/reboot-nodes: Error if nodes are cordoned (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[12:32:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: opensearch: remove icinga::monitor::elasticsearch::old_jvm_gc_checks [puppet] - 10https://gerrit.wikimedia.org/r/812860 (https://phabricator.wikimedia.org/T288622)
[12:32:50] <wikibugs>	 (03PS3) 10Volans: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821
[12:33:34] <wikibugs>	 (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[12:49:29] <wikibugs>	 (03PS10) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950
[12:49:31] <wikibugs>	 (03PS8) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316
[12:49:33] <wikibugs>	 (03PS5) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037
[12:49:35] <wikibugs>	 (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro)
[12:51:09] <wikibugs>	 (03CR) 10Volans: [C: 03+2] redfish: better compare Dell SCP attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans)
[12:51:12] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36244/console" [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro)
[12:51:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans)
[12:55:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287 (owner: 10Muehlenhoff)
[12:55:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287
[12:59:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10Ottomata) > It looks like I will not need the SSH key as my use case fits in "Dashboards in Superset / Hive interfaces (like Hue) that do access private data".  Correct!  A...
[12:59:30] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812864 (https://phabricator.wikimedia.org/T311493)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812864 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[13:00:41] <marostegui>	 moritzm: your change is ok to merge?
[13:01:07] <wikibugs>	 (03PS37) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[13:01:38] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: better compare Dell SCP attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/812442 (owner: 10Volans)
[13:02:28] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond)
[13:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: tests: fix caplog usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/812820 (owner: 10Volans)
[13:03:00] <marostegui>	 moritzm: I have merged it as it looks harmless
[13:04:21] <moritzm>	 marostegui: thanks!
[13:04:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2163 to s8 T311493', diff saved to https://phabricator.wikimedia.org/P31002 and previous config saved to /var/cache/conftool/dbconfig/20220711-130441-marostegui.json
[13:04:47] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[13:05:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[13:05:15] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[13:06:29] <wikibugs>	 (03CR) 10David Caro: wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[13:08:04] <wikibugs>	 (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro)
[13:09:16] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36245/console" [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro)
[13:10:00] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "The pcc run is as expected, adding the absented file to the secondary nodes, and doing nothing on the primary." [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro)
[13:10:29] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] novafullstack: add types and some names refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro)
[13:10:38] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "The pcc run is as expected, adding the absented file to the secondary nodes, and doing nothing on the primary." [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro)
[13:11:23] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: use libvmod-querysort on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/812450 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori)
[13:14:18] <wikibugs>	 (03PS2) 10CDanis: haproxy: also log high client concurrency [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580)
[13:14:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:14:38] <Amir1>	 here?
[13:15:03] <elukey>	 there seems to be a peak of requests
[13:15:16] <cdanis>	 looks like the typical saturation event
[13:15:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:15:22] <wikibugs>	 (03PS7) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[13:15:29] <jynus>	 similar to what happened last week maybe?
[13:16:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:06] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:13] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:13] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:13] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:13] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:16:15] <akosiaris>	 Wow
[13:16:24] * Emperor here
[13:16:30] * volans here if needed
[13:16:42] * jbond here
[13:17:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 1675 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:17:31] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:49] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:49] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:49] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:55] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:57] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:57] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:18:07] <elukey>	 should we transition to #sre and open an incident?
[13:18:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:18:35] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:18:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.16.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.16.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:18:47] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:18:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:18:47] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:18:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:18:47] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:18:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.23:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.23:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:18:49] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:19:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[13:19:05] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:19:26] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:19:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes1022.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1014.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers wtp1029.eqiad.wmnet, wtp1048.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1042.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1031.eqiad.wmne
[13:19:27] <icinga-wm>	 46.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:19:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.31:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF
[13:19:27] <icinga-wm>	 S%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:19:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers wtp1048.eqiad.wmnet, wtp1
[13:19:29] <icinga-wm>	 d.wmnet, wtp1044.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1029.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1043.eqiad.wmnet, wtp1041.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad
[13:19:29] <icinga-wm>	 wtp1030.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:19:39] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:39] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:39] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:39] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:45] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:45] <TheresNoTime>	 oh dear
[13:19:45] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:45] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:45] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:45] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:46] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:47] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:47] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:19:47] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:20:19] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:20:57] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:20:59] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:21:03] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:21:03] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:21:15] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:21:17] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:21:23] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:21:27] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:21:29] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:22:05] <wikibugs>	 (03PS8) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[13:22:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:27] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:22:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:49] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:22:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[13:22:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:55] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:22:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:55] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:22:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:57] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:22:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_
[13:22:59] <icinga-wm>	 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:01] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:23:01] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:23:05] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:23:15] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:23:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:27] <icinga-wm>	 PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:23:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W
[13:23:33] <icinga-wm>	 MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:23:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:24:03] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:24:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:24:37] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:24:38] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: also log high client concurrency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[13:24:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:24:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36247/console" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[13:24:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.190:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein
[13:25:00] <icinga-wm>	 t on connection while downloading http://10.192.32.190:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:25:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:25:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[13:25:55] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:26:35] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:27:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:28:05] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content 
[13:28:05] <icinga-wm>	  test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/R
[13:28:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[13:28:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10mark) >>! In T256217#7960730, @Krinkle wrote: > I'm not sure since when, but based on us having <14 days ats-be stor...
[13:29:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:29:47] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:29:51] <wikibugs>	 (03CR) 10Ladsgroup: Move CirrusSearch settings from IS.php to ext-CirrusSearch.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:30:15] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1
[13:30:15] <icinga-wm>	 timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:30:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:30:29] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[13:30:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2018.codfw.wmnet, restbase2019.codfw.wmnet, restbase2012.codfw.wmnet, restbase2013.codfw.wmnet, restbase2021.codfw.wmnet, restbase2023.codfw.wmnet, restbase2020.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:30:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:31:01] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:31:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10mark) This appears to be configurable now in Swift 2.24.0 and later (we currently seem to be running 2.26.0 on 6/8 o...
[13:31:29] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) is CRITICAL: Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) is CRITICAL: Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{tit
[13:31:29] <icinga-wm>	 nslate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[13:31:31] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April
[13:31:31] <icinga-wm>	 6 returned the unexpected status 500 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/
[13:31:31] <icinga-wm>	 d/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:31:55] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[13:32:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:32:57] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[13:32:57] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[13:33:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:33:15] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 4.553 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:33:17] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[13:33:27] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:33:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:33:41] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:33:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:33:49] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 6.475 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:33:51] <icinga-wm>	 RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:33:53] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[13:33:55] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:33:59] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 3.277 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:34:03] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.636 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:34:03] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:34:19] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[13:34:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:34:49] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:35:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:35:01] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.616 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:05] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.851 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:35:09] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 4.678 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:35:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:35:13] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 9.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:35:17] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:35:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:35:23] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 3.396 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:39] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:45] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:45] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.344 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:35:49] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:49] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.804 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:50] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 6.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:35:50] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:53] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:55] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:36:15] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:23] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:29] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:29] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.627 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:31] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:33] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 4.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:33] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:37] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:37] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:36:37] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:36:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:43] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:36:43] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:36:43] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:36:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:36:43] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:36:45] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 8.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:37:15] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:37:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:37:31] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:31] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:31] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:31] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:37:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:37] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:37] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:39] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:39] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:41] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:43] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.357 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:45] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 5.715 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:37:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:55] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:38:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:13] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:38:17] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:38:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:19] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.488 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:38:23] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:38:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:37] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.719 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:38:41] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:38:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to _security IRC channel for TheresNoTime - https://phabricator.wikimedia.org/T312771 (10TheresNoTime)
[13:38:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:09] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:39:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:40:18] <jinxer-wm>	 (ProbeDown) resolved: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) a:03Cmjohnson Juniper agreed on an RMA, forwarded the email thread to Chris for the shipping details.  @Cmjohnson please sync up with Netops once rec...
[13:47:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Add knative and egress config for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/812010 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[13:48:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[13:48:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:49:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:50:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:53:28] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye
[13:53:32] <wikibugs>	 (03CR) 10CDanis: haproxy: also log high client concurrency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[13:53:33] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[13:53:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[13:53:46] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[13:53:48] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye
[13:53:52] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[13:54:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[13:54:36] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[13:55:49] <wikibugs>	 (03Merged) 10jenkins-bot: tests: reduce runtime by more than 80% [software/spicerack] - 10https://gerrit.wikimedia.org/r/812821 (owner: 10Volans)
[13:58:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36248/console" [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond)
[13:59:02] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] haproxy: also log high client concurrency [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[13:59:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455 (owner: 10Jbond)
[13:59:41] <wikibugs>	 (03PS3) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449
[13:59:58] <wikibugs>	 (03PS3) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455
[14:02:04] <wikibugs>	 (03PS1) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869
[14:05:01] <wikibugs>	 (03PS4) 10Jbond: resolvconf: add parameter to disable managing resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/812455
[14:05:03] <wikibugs>	 (03PS3) 10Jbond: P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457
[14:05:17] <wikibugs>	 (03PS2) 10Jbond: base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461
[14:05:36] <wikibugs>	 (03PS2) 10Jbond: P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555
[14:07:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:08:49] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:09:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:10:23] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:11:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:11:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[14:18:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:environment: add dependency to vim package [puppet] - 10https://gerrit.wikimedia.org/r/812457 (owner: 10Jbond)
[14:18:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base::firewall: add flag do disable managing nf_conntrack hashsize [puppet] - 10https://gerrit.wikimedia.org/r/812461 (owner: 10Jbond)
[14:18:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555 (owner: 10Jbond)
[14:19:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:53] <godog>	 there we go
[14:22:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2
[14:22:27] <icinga-wm>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:37] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:37] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:43] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:43] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:54] <TheresNoTime>	 How surprising :D
[14:22:57] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:19] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:37] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:37] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 2202 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:24:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:24:59] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812873 (owner: 10Ori)
[14:25:01] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:01] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:05] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:05] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:19] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:43] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:53] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:55] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:55] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:55] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:03] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:03] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:11] <wikibugs>	 (03Abandoned) 10Ori: Dummy change to test CI [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812873 (owner: 10Ori)
[14:28:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:29:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:31:05] <wikibugs>	 (03PS17) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224
[14:31:37] <wikibugs>	 (03CR) 10Jbond: beaker: add initial beaker files (WIP) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond)
[14:34:01] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye
[14:34:06] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[14:34:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond)
[14:34:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[14:34:22] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[14:34:24] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye
[14:34:29] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[14:34:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[14:34:55] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[14:43:27] <wikibugs>	 (03PS2) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179)
[14:43:33] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wwwportals: Make sure portal assets are also visible in wikibooks vhost [puppet] - 10https://gerrit.wikimedia.org/r/812843 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[14:46:49] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:54:56] <wikibugs>	 (03PS1) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881
[14:56:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED
[14:56:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond)
[14:58:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[15:01:59] <wikibugs>	 (03PS2) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881
[15:03:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) >>! In T304888#8062648, @Cmjohnson wrote: > all but the cloudnets installed correctly, they're still presenting the dhcp error. I am thi...
[15:03:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond)
[15:07:22] <wikibugs>	 (03PS4) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449
[15:08:57] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye
[15:09:02] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[15:09:29] <wikibugs>	 (03PS5) 10Jbond: wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449
[15:15:04] <wikibugs>	 (03PS1) 10Jbond: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887
[15:17:24] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/812888
[15:17:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond)
[15:17:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) Just for the record cloudnet1005 did seem to install ok.  Or at least DHCP did not fail at PXE or debian-installer stage.  It's using NI...
[15:17:37] <wikibugs>	 (03PS1) 10Nskaggs: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889
[15:19:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS bullseye
[15:19:48] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/812891
[15:20:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2175.codfw.wmnet with OS bullseye
[15:20:37] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10EBernhardson)
[15:22:44] <wikibugs>	 (03PS3) 10Jbond: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881
[15:23:31] <wikibugs>	 (03PS2) 10Jbond: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887
[15:23:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[15:23:45] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[15:23:47] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye
[15:23:52] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[15:25:47] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546)
[15:27:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[15:27:21] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[15:27:22] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye
[15:27:28] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors...
[15:27:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond)
[15:27:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond)
[15:28:11] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye
[15:28:17] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye
[15:30:05] <jouncebot>	 jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1530).
[15:30:15] <wikibugs>	 (03Merged) 10jenkins-bot: prepare: add storeconfig to production puppet.conf [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812881 (owner: 10Jbond)
[15:31:59] <wikibugs>	 (03Merged) 10jenkins-bot: release: 2.3.2 release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/812887 (owner: 10Jbond)
[15:32:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:34:28] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:34:30] <wikibugs>	 (03CR) 10David Caro: Force depends so setup.py install works (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs)
[15:35:15] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812892 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:36:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: host reimage
[15:39:37] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:812892| Bumping portals to master (T128546)]] (duration: 02m 58s)
[15:39:40] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:39:59] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond)
[15:40:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Feel free to +2 when the comment is in :)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs)
[15:41:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version number [puppet] - 10https://gerrit.wikimedia.org/r/812888 (owner: 10Jbond)
[15:41:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: host reimage
[15:41:50] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:42:29] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:812892| Bumping portals to master (T128546)]] (duration: 02m 51s)
[15:42:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:45:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:45:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:45:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1005.wikimedia.org with reason: host reimage
[15:46:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36249/console" [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond)
[15:48:32] <wikibugs>	 (03PS2) 10JMeybohm: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333
[15:48:34] <wikibugs>	 (03PS1) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895
[15:48:54] <wikibugs>	 (03CR) 10David Caro: wmcs: Add ceph cluster alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[15:49:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1005.wikimedia.org with reason: host reimage
[15:49:24] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Papaul)
[15:49:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:49:42] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Papaul) 05Open→03Resolved complete
[15:50:03] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Papaul)
[15:50:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway)
[15:50:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but I'd use something local_access_log_min_code to clarify it only works on the local downstream." [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 (owner: 10JMeybohm)
[15:50:36] <wikibugs>	 (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[15:50:41] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Papaul) 05Open→03Resolved complete
[15:51:10] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2080 - https://phabricator.wikimedia.org/T312618 (10Papaul) 05Open→03Resolved complete
[15:51:54] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:52:27] <wikibugs>	 (03PS2) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895
[15:52:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: make wmflib::resource::import safe for puppet apply [puppet] - 10https://gerrit.wikimedia.org/r/812449 (owner: 10Jbond)
[15:52:41] <wikibugs>	 (03PS3) 10JMeybohm: Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895
[15:52:44] <wikibugs>	 (03PS2) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869
[15:53:11] <wikibugs>	 (03PS18) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224
[15:54:06] <wikibugs>	 (03PS1) 10Mforns: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303)
[15:54:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns)
[15:54:44] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Allow to enable access logging in tls-terminator [deployment-charts] - 10https://gerrit.wikimedia.org/r/812895 (owner: 10JMeybohm)
[15:55:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2175.codfw.wmnet with OS bullseye
[15:55:34] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2175.codfw.wmnet with OS bullseye completed: - db2...
[15:56:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:56:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond)
[15:56:47] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:57:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline for non-blocking comment" [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[15:58:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: Add ceph cluster alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro)
[16:00:16] <wikibugs>	 (03PS2) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916)
[16:00:20] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle)
[16:00:35] <wikibugs>	 (03PS3) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869
[16:00:37] <wikibugs>	 (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[16:00:57] <wikibugs>	 (03CR) 10David Caro: wmcs: use a nicer task title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[16:01:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle)
[16:02:44] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul)
[16:03:33] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) 05Open→03Resolved @Marostegui  All your's
[16:04:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:05:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:05:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:06:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:07:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[16:10:25] <wikibugs>	 (03PS4) 10David Caro: wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869
[16:11:28] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 (duration: 02m 55s)
[16:12:30] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1005.wikimedia.org with OS bullseye
[16:12:36] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye completed: - cloudel...
[16:13:02] <wikibugs>	 (03PS10) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368
[16:13:04] <wikibugs>	 (03PS3) 10David Caro: openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854
[16:13:06] <wikibugs>	 (03PS5) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914
[16:13:08] <wikibugs>	 (03CR) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro)
[16:13:10] <wikibugs>	 (03PS5) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915
[16:13:12] <wikibugs>	 (03CR) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro)
[16:13:14] <wikibugs>	 (03PS1) 10David Caro: WIP: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900
[16:13:16] <wikibugs>	 (03PS1) 10David Caro: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901
[16:13:18] <wikibugs>	 (03PS1) 10David Caro: WIP: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902
[16:13:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[16:14:41] <wikibugs>	 (03CR) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro)
[16:15:16] <wikibugs>	 (03PS1) 10Jbond: wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904
[16:15:49] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:26] <wikibugs>	 (03PS1) 10JHathaway: admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676)
[16:19:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond)
[16:20:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 (owner: 10David Caro)
[16:21:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 (owner: 10David Caro)
[16:22:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676) (owner: 10JHathaway)
[16:22:09] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking)
[16:23:20] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] admin: add ddesouza-ctr@wikimedia.org to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/812905 (https://phabricator.wikimedia.org/T312676) (owner: 10JHathaway)
[16:24:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro)
[16:24:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway) 05Open→03Resolved a:03jhathaway @DDeSouza access granted, please let me know if you have any issues.
[16:24:47] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: use a nicer task title [puppet] - 10https://gerrit.wikimedia.org/r/812869 (owner: 10David Caro)
[16:26:14] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[16:28:14] <wikibugs>	 (03CR) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro)
[16:28:47] <wikibugs>	 (03PS2) 10Jbond: wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904
[16:28:49] <wikibugs>	 (03PS1) 10Jbond: wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907
[16:30:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36251/console" [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond)
[16:31:27] <wikibugs>	 (03PS6) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698)
[16:31:51] <wikibugs>	 (03CR) 10Ori: New service: function-evaluator (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[16:33:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond)
[16:34:10] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@02ab1c2]: use mode=reschedule on all airflow sensors
[16:34:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36252/console" [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond)
[16:36:13] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@02ab1c2]: use mode=reschedule on all airflow sensors (duration: 02m 02s)
[16:49:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10DDeSouza) @Ottamata Thanks! I have `wmf`.  @jhathaway Thanks! I was getting access denied at first but after a few tries it magically worked.
[16:57:07] <wikibugs>	 (03PS1) 10Aqu: Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913
[17:00:04] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T1700).
[17:01:20] <wikibugs>	 (03PS2) 10Aqu: Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913
[17:06:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Set AQS mediawiki history snapshot to 2022 June [puppet] - 10https://gerrit.wikimedia.org/r/812913 (owner: 10Aqu)
[17:07:48] <wikibugs>	 (03PS1) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916
[17:09:50] <wikibugs>	 (03PS2) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916
[17:10:19] <logmsgbot>	 !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[17:15:09] <wikibugs>	 (03CR) 10David Caro: "You can format the code with:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs)
[17:16:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs)
[17:25:37] <wikibugs>	 (03PS3) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916
[17:27:41] <wikibugs>	 (03PS4) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916
[17:28:17] <wikibugs>	 (03CR) 10Nskaggs: Ensure quota_increase cookbook runs and validates (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs)
[17:29:58] <wikibugs>	 (03PS2) 10BCornwall: varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723)
[17:30:13] <wikibugs>	 (03CR) 10BCornwall: varnish: Port over traffic_drop from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:34:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs)
[18:05:28] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:12:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10KFrancis) Hi all, the NDA has been signed and completed for WMDE LDAP group access.  Please proceed with the request.  Thanks!
[18:16:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:18:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:19:04] <wikibugs>	 (03PS1) 10Ssingh: durum: add console log message [puppet] - 10https://gerrit.wikimedia.org/r/812919
[18:19:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36253/console" [puppet] - 10https://gerrit.wikimedia.org/r/812919 (owner: 10Ssingh)
[18:23:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:25:02] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add console log message [puppet] - 10https://gerrit.wikimedia.org/r/812919 (owner: 10Ssingh)
[18:29:36] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, just some minor comments" [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond)
[18:30:20] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond)
[18:32:34] <logmsgbot>	 !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[18:36:25] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T312626 - Still working on this https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:39:02] <icinga-wm>	 PROBLEM - Host mw2376.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:45:51] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] wmflib: create a helper function for querying puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond)
[18:50:50] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:02:39] <wikibugs>	 10SRE, 10Traffic-Icebox: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Krinkle) This appears to now work as expected. I'm guessing that [[ https://wikitech.wikimedia.org/wiki/HAProxy | HAProxy ]] is better at this than Nginx. I don't recall if we verified it on ATS (ats...
[19:02:55] <wikibugs>	 10SRE, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Krinkle)
[19:06:56] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:08:52] <icinga-wm>	 PROBLEM - Host thumbor2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:29] <wikibugs>	 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Eevans)
[19:25:38] <wikibugs>	 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Eevans) p:05Triage→03Medium
[19:29:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:30:18] <jinxer-wm>	 (ProbeDown) firing: (5) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:30:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[19:30:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:48] <icinga-wm>	 PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:48] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:30:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[19:31:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:31:18] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:31:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1450 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:06] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:32:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1404.eqiad.wmnet, mw1447.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1362.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1342.eqiad.wmnet, mw1402.eq
[19:32:18] <icinga-wm>	 t, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1317.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1396.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1378.eqiad.wmnet,
[19:32:18] <icinga-wm>	 eqiad.wmnet, mw1444.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1376.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[19:32:21] <TheresNoTime>	 I see #page s on klaxon so ✨
[19:32:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:32:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:32:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:32:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1404.eqiad.wmnet, mw1447.eqiad.wmnet, mw1427.eqiad.wmnet, mw1361.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1348.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1448.eqiad.wmnet, mw1402.eqiad.wmnet, mw1390.eqiad.wmnet, mw1381.eq
[19:32:34] <icinga-wm>	 t, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1358.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1444.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1315.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet,
[19:32:34] <icinga-wm>	 eqiad.wmnet, mw1383.eqiad.wmnet, mw1400.eqiad.wmnet, mw1392.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1360.eqiad.wmnet, mw1382.eqiad.wmnet, mw1398.eqiad.wmnet, mw1341.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[19:32:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:50] <icinga-wm>	 PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[19:32:56] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS
[19:33:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:04] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[19:33:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 2585 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:33:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:12] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu
[19:33:12] <icinga-wm>	 e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:33:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P31005 and previous config saved to /var/cache/conftool/dbconfig/20220711-193315-marostegui.json
[19:33:18] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu
[19:33:18] <icinga-wm>	 e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random 
[19:33:18] <icinga-wm>	 title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:33:20] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[19:33:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:33:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:24] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[19:33:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:30] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou
[19:33:30] <icinga-wm>	  nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[19:33:30] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou
[19:33:30] <icinga-wm>	  nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[19:33:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:34] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response 
[19:33:34] <icinga-wm>	 ived: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page
[19:33:34] <icinga-wm>	 TICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:33:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.711 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:52] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[19:33:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.653 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.801 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:33:56] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:34:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:02] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:34:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1450 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.717 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:14] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[19:34:16] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): 
[19:34:16] <icinga-wm>	 est/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[19:34:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.539 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:36] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:34:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:44] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wik
[19:34:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:34:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:35:26] <icinga-wm>	 RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:35:28] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[19:35:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:33] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[19:35:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:42] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:35:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:45] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:35:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.880 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:35:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:48] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:35:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:58] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:35:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:02] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:36:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:36:02] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[19:36:04] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[19:36:04] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:36:06] <jinxer-wm>	 (ProbeDown) resolved: (5) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:36:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:36:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:36:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:36:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:26] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:36:26] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[19:36:28] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:36:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:36:33] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:36:34] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:36:48] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[19:36:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[19:38:08] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:38:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:38:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:38:52] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[19:38:58] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[19:39:04] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[19:39:08] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:40:50] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:40:56] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[19:41:21] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[19:41:48] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@fc5d65a]: Add language-data library
[19:41:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[19:41:57] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@fc5d65a]: Add language-data library (duration: 00m 08s)
[19:43:06] <TheresNoTime>	 y'all behaving now..?
[19:43:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[19:43:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[19:44:22] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[19:47:24] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[19:48:14] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T2000).
[20:00:05] <jouncebot>	 mforns: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <mforns>	 hii!
[20:02:21] <TheresNoTime>	 I'm able to deploy if absolutely needed, but lets give urbanecm and cjming a few more minutes ^^
[20:02:39] <mforns>	 👍
[20:02:58] <urbanecm>	 I can deploy but I need like 10 minutes
[20:03:11] <mforns>	 no problem!
[20:03:18] <TheresNoTime>	 ^^ phew!
[20:04:17] <urbanecm>	 TheresNoTime: if you want to try, you can press the buttons after I get to my laptop, and i can stand by. How does that sound?
[20:04:53] <TheresNoTime>	 urbanecm: sure :) I've taken a look at it and I'm comfortable with the deploy, just the idea of doing it alone wasn't ideal
[20:06:50] <urbanecm>	 Yeah, definitely. In that case, I'll ping you in a few minutes
[20:06:56] <TheresNoTime>	 sure :)
[20:08:34] <TheresNoTime>	 thanks for bearing with us mforns :) this will be my second (?) deployment so... cross your fingers!
[20:09:03] <mforns>	 sure TheresNoTime! full confidence :]
[20:11:22] <urbanecm>	 TheresNoTime: I'm at my laptop now, so feel free to go ahead!
[20:11:30] <TheresNoTime>	 okay :)
[20:11:40] <urbanecm>	 happy to answer any questions, or step in if needed.
[20:12:00] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns)
[20:12:53] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 try again ref T311788 (duration: 03m 07s)
[20:12:58] <stashbot>	 T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) - https://phabricator.wikimedia.org/T311788
[20:13:49] <wikibugs>	 (03PS2) 10Samtar: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns)
[20:14:22] <urbanecm>	 Krinkle: i just saw your sync-file, fyi it's B&C time now (TheresNoTime's doing the deploy)
[20:15:18] <Krinkle>	 ack
[20:15:47] <TheresNoTime>	 urbanecm: realise I did the rebase -> +2 the wrong way around there, doesn't make a difference though, correct? Other than now having to manually "submit" the patch to merge?
[20:16:19] <urbanecm>	 TheresNoTime: looks so. remove the -2, ensure it's on master, re-apply it is the "correct" way to fix this when it happens
[20:16:26] <urbanecm>	 *remove the +2, ofc
[20:16:32] <TheresNoTime>	 thank you, okay
[20:16:49] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns)
[20:17:07] <urbanecm>	 looks jenkins noticed it this time
[20:18:23] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812897 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns)
[20:18:36] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:19:57] <TheresNoTime>	 mforns: should be live on mwdebug1001 now, are you able to test?
[20:20:06] <mforns>	 yes! trying
[20:24:26] <wikibugs>	 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Cmjohnson) Acknowledged and will look into it and update the task with what I find
[20:24:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:25:27] <mforns>	 TheresNoTime: sent a couple events and they appeared in kafka, seems all is working correctly
[20:25:46] <TheresNoTime>	 mforns: thanks :) now deploying
[20:25:54] <mforns>	 👍
[20:27:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:27:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:28:56] <logmsgbot>	 !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:812897|Migrate WikibaseTermboxInteraction from EventLogging to EventGate on all wikis (T290303)]] (duration: 02m 53s)
[20:29:01] <stashbot>	 T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303
[20:29:20] <TheresNoTime>	 mforns: that should be live now, can you test again if needed?
[20:29:27] <mforns>	 yes!
[20:29:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:30:14] <TheresNoTime>	 all looks okay to you urbanecm?
[20:31:09] <urbanecm>	 TheresNoTime: yup yup :)
[20:31:16] <TheresNoTime>	 :)
[20:33:42] <mforns>	 TheresNoTime: I'm looking at grafana, and it seems events are flowing normally. Will continue to monitor for a bit.
[20:33:48] <mforns>	 TheresNoTime: thanks a lot! :]
[20:34:00] <TheresNoTime>	 No worries, thank you for your patience mforns :)
[20:34:17] <mforns>	 👍
[20:34:47] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@a559f82]: subgraph: Use HivePartitionRangeSensor to wait for sparql queries
[20:35:06] <TheresNoTime>	 We've got ~30 minutes, worth calling for any other patches or should I close the window urbanecm?
[20:35:35] <urbanecm>	 TheresNoTime: usually people say so in this chan if they have anything that's not in the calendar, so I'd close
[20:36:06] <TheresNoTime>	 !log UTC late deploys done
[20:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:39] <urbanecm>	 thanks for the deployment TheresNoTime!
[20:36:48] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@a559f82]: subgraph: Use HivePartitionRangeSensor to wait for sparql queries (duration: 02m 00s)
[20:36:51] <TheresNoTime>	 thank you for being around! :)
[20:37:12] <urbanecm>	 np
[20:47:15] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this...
[20:48:29] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[20:53:28] <wikibugs>	 (03PS5) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220711T2100).
[21:00:28] <sbassett>	 ^ no sec patches for deployment this week AFAIK
[21:02:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs)
[21:05:22] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking)
[21:05:27] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[21:05:58] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) This is (finally) complete, closing...
[21:06:08] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) 05Open→03Resolved
[21:06:10] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[21:09:11] <wikibugs>	 (03PS1) 10JHathaway: lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506)
[21:09:13] <wikibugs>	 (03PS1) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506)
[21:10:01] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway)
[21:10:14] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway)
[21:14:14] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway)
[21:14:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway)
[21:47:43] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@3ba1d4c]: subgraph_query_mapping_daily: Increase partitioning to 2048
[21:49:45] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@3ba1d4c]: subgraph_query_mapping_daily: Increase partitioning to 2048 (duration: 02m 02s)
[22:21:36] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:35:17] <wikibugs>	 (03PS1) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945
[22:37:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle)
[22:47:32] <wikibugs>	 (03CR) 10Krinkle: "I'm not sure I understand the warning from build_notes_url()." [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle)
[23:06:49] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron)
[23:10:27] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) Using a personal google workspace and google cloud account (for the time being) the dispatchdev instance is now creating a new google drive folde...
[23:21:09] <icinga-wm>	 PROBLEM - Zookeeper Server #page on conf1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[23:21:28] <icinga-wm>	 PROBLEM - Zookeeper Server #page on conf1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[23:21:44] <icinga-wm>	 PROBLEM - Zookeeper Server #page on conf1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[23:21:49] <TheresNoTime>	 Oh this happened the other day too :/
[23:22:17] <jhathaway>	 here
[23:22:50] <jhathaway>	 TheresNoTime: what was the cause yesterday?
[23:23:17] <TheresNoTime>	 jhathaway: I'm trying to remember, sorry :/ fairly sure it was the exact same alerts ^
[23:24:10] <icinga-wm>	 PROBLEM - Check unit status of etcd-backup on conf1009 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:25:26] <rzl>	 yeah, afaik this is T311407 and the underlying cause is T312539
[23:25:27] <stashbot>	 T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407
[23:25:57] <rzl>	 last time it came up a.kosiaris silenced the alert, I think that's still the right action here but I don't know if we've made any progress on those hosts while I wasn't looking
[23:26:14] <rzl>	 doesn't look like it from phab, double-checking
[23:26:19] <jhathaway>	 rzl: I don't think so, I don't think zookeeper is even installed
[23:26:35] <jhathaway>	 at least on 1009 it is not
[23:26:43] <rzl>	 good enough for me
[23:26:55] <cwhite>	 looks like host downtime ended 7min ago
[23:27:36] <rzl>	 I'm gonna extend for another week or so, minus a little bit so that it pops earlier in the day
[23:27:49] <rzl>	 hopefully we'll just cancel it before then anyway :)
[23:28:01] <jhathaway>	 rzl: thanks, are you using cumin to do the deed, or manually, just curious?
[23:28:10] <icinga-wm>	 PROBLEM - Check unit status of etcd-backup on conf1008 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:29:09] <rzl>	 jhathaway: the cookbook -- for one host I'd probably fight my way through the web ui, but honestly there's no need
[23:29:28] <volans>	 if those hosts are not yet prod-ready you could also set the disable notification hiera settings for them
[23:29:45] <rzl>	 I think they're ready for etcd, just not for zookeeper
[23:29:51] <jhathaway>	 rzl: thanks, just trying to learn the ways of sre :)
[23:30:11] <rzl>	 jhathaway: oh in that case I'm gonna downtime them this way: "jhathaway could you please downtime those hosts? 162 hours or so should be perfect"
[23:30:21] <rzl>	 :D
[23:30:29] <rzl>	 (unless you'd rather not, in which case I'm happy to finish up)
[23:30:50] <rzl>	 and `cumin2002:~$ sudo cookbook sre.hosts.downtime -h` should be all you need
[23:31:18] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (29) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, conf1007, conf1008, conf1009, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003
[23:31:18] <icinga-wm>	 -fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[23:31:44] <TheresNoTime>	 Learning the ways of SRE hmm? 👀
[23:31:54] <perryprog>	 👀
[23:32:26] * TheresNoTime remains the personification of "a little knowledge is dangerous"
[23:33:58] * perryprog is really here to just learn SRE knowledge stuff and /sometimes/ help with whatever
[23:35:39] <TheresNoTime>	 perryprog: I have the wonderful excuse that "I work here" :3
[23:36:06] * perryprog grumbles
[23:44:32] <wikibugs>	 (03CR) 10Cwhite: [C: 04-2] "icinga does the encoding of these urls: https://phabricator.wikimedia.org/T213052" [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle)
[23:44:52] <rzl>	 (I went ahead and downtimed, decided to do just that service after all so I used the web ui)
[23:45:39] <rzl>	 systemd alerts are still CRIT for etcd-backup but those are non-paging so I'll leave em for visibility
[23:47:19] <cwhite>	 thanks, rzl - I'll resolve in VO as well so they don't fire again this time tomorrow
[23:47:32] <rzl>	 ah thanks
[23:58:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812860 (https://phabricator.wikimedia.org/T288622) (owner: 10Filippo Giunchedi)
[23:59:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)