[00:31:23] (03PS2) 10Krinkle: captchaloop: Replace deprecated blacklist parameter [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [00:31:37] (03PS3) 10Krinkle: mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [00:31:45] (03CR) 10Krinkle: [C: 03+1] "Good to go now, I think?" [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [00:33:06] (03PS2) 10Krinkle: wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 (owner: 10Reedy) [00:33:16] (03CR) 10Krinkle: [C: 03+1] wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 (owner: 10Reedy) [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:13] (03CR) 10Reedy: mediawiki: Replace deprecated blacklist parameter in captchaloop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:41] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:05:23] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wikimedia.org/wiki/Monit [03:05:23] eck_systemd_state [03:10:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:11:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:13:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:14:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:35] PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:26:05] RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:31:51] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:03] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:24:19] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:51] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:23] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [05:06:23] @8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:55] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wik [05:35:55] rg/wiki/Monitoring/check_systemd_state [06:02:15] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:47] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8817.service,thumbor@8818.service,thumbor@8819.service,thumbor@8824.service https://wikitech.wik [06:09:47] rg/wiki/Monitoring/check_systemd_state [06:52:35] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-08-09 06:51:41 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220710T0700) [07:30:47] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [07:30:47] @8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:43] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:59] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:11] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wik [08:35:11] rg/wiki/Monitoring/check_systemd_state [09:24:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:19] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [09:30:19] @8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:37] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:17] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:20:57] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:33:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:38:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:53] Good, that paged so it means my patch definitely works [11:01:33] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:53] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8817.service,thumbor@8818.service,thumbor@8819.service,thumbor@8824.service https://wikitech.wik [11:08:53] rg/wiki/Monitoring/check_systemd_state [11:12:29] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:14:59] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:22:19] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:23:45] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:41] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:55:13] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:00:35] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [12:00:35] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:41] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wik [13:05:41] rg/wiki/Monitoring/check_systemd_state [13:12:39] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:32:03] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:05] mmhh I'll take a look [13:36:08] godog: I am around if needed [13:36:13] cheers [13:37:10] but yeah looks like thumbor has been unhappy for sure [13:38:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:35] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8817.service,thumbor@8818.service,thumbor@8819.service,thumbor@8824.service https://wikitech.wik [13:39:35] rg/wiki/Monitoring/check_systemd_state [13:40:57] looking for smoking guns, but overall I see flapping for sure in terms of availability [13:41:51] yeah i see some errors like thumbor:ERROR [ImagesHandler] Throttled by PoolCounter [13:42:14] there was also a rise in HTTP 429s [13:43:06] but from grafana no sign (in my opinion) of traffic overload [13:46:03] *nod* there's a fair few 500s in eqiad from https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=now-7d&to=now and one instance latency (thumbor1005) went up considerably in the last few days [13:48:39] !log silence ProbeDown pages for thumbor:8800 until wed [13:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:28] it doesn't look like the service is in special trouble, slower than usual for sure though, I think it can wait SMEs to look at in during the week [13:51:52] one weird thing is that some services on a node like thumbor1005 fails to start thumbor units [13:51:55] like [13:51:58] thumbor@8820.service: Main process exited, code=killed, status=6/ABRT [13:52:07] not sure if those units are supposed to be down or not though [13:53:50] they are supposed to be up yeah [13:56:26] looks like that (systemd unit fail) has been going on for a while now [13:56:29] https://logstash.wikimedia.org/goto/675a74135d800aeaa89cef5c51ab96dd [13:59:06] and firejail complains in the logs too [13:59:09] filed as T312722 [13:59:09] T312722: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 [14:00:06] added the info as well [14:03:49] firejail: util.c:906: crea [14:03:51] te_empty_dir_as_root: Assertion `(s.st_mode & 07777) == (mode)' failed. [14:03:54] uff sorry [14:03:59] firejail: util.c:906: crea [14:04:12] ok my copy/paste doesn't work [14:04:29] paste works fine, just not copy [14:04:33] I found some github issue with somebody having a similar issue, will post it in the task [14:04:46] AntiComposite: ? [14:06:22] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10elukey) [16:01:01] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:03:31] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:49:27] (03PS12) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [16:51:51] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:23:13] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:58] (03PS13) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:28:00] (03PS1) 10Jbond: P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555 [17:29:54] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:32:45] (03CR) 10CI reject: [V: 04-1] P:base: dont use haveged in containers [puppet] - 10https://gerrit.wikimedia.org/r/812555 (owner: 10Jbond) [17:38:26] (03PS14) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:42:22] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [17:59:47] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:19] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [18:07:19] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:04] (03PS1) 10Jbond: cli: Add ability to override th amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 [18:16:40] (03PS15) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [18:18:20] (03CR) 10CI reject: [V: 04-1] cli: Add ability to override th amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [18:20:25] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [18:36:43] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8819.service,thumbor@8844.service https://wikitech.wikimedia.org/wiki/Monit [18:36:43] eck_systemd_state [19:03:57] PROBLEM - cassandra-b CQL 10.64.48.127:9042 on restbase1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:04:47] PROBLEM - cassandra-c CQL 10.64.48.128:9042 on restbase1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:04:55] PROBLEM - cassandra-a CQL 10.64.48.126:9042 on restbase1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:05:15] PROBLEM - Restbase root url on restbase1025 is CRITICAL: connect to address 10.64.48.125 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [19:06:47] PROBLEM - SSH on restbase1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:20:37] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [19:20:37] @8818.service,thumbor@8820.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:59] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:11] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8818.service,thumbor@8833.service [19:35:11] /wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:29] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [19:37:29] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:39] (03PS16) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [19:50:26] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [19:50:35] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:05] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [19:58:05] @8818.service,thumbor@8820.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:29] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:57] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [20:27:57] @8818.service,thumbor@8820.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [20:48:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [22:05:05] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:35] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8818.service,thumbor@8833.service [22:12:35] /wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:27] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:59] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [22:27:59] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:53] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:23] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [22:37:23] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:51] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:29:55] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:27] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service [23:37:27] @8817.service,thumbor@8818.service,thumbor@8820.service,thumbor@8824.service,thumbor@8825.service,thumbor@8827.service,thumbor@8834.service,thumbor@8843.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:47:20] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:37] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:41] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:47] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [23:50:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 2, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:57:44] (03PS3) 10Tim Starling: Add ucfirst overrides for the PHP 7.4 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) [23:58:07] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service,thumbor@8816.service,thumbor@8817.service [23:58:07] @8818.service,thumbor@8819.service,thumbor@8820.service,thumbor@8821.service,thumbor@8822.service,thumbor@8823.service,thumbor@8824.service,thumbor@8828.service,thumbor@8831.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:12] (03CR) 10Tim Starling: [C: 03+2] Add ucfirst overrides for the PHP 7.4 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) (owner: 10Tim Starling) [23:58:59] (03Merged) 10jenkins-bot: Add ucfirst overrides for the PHP 7.4 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) (owner: 10Tim Starling)