[00:00:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.60, 4.10, 3.93 [00:01:45] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/1216a89c8882...48e18635d19b [00:01:47] [02ssl] 07WikiTideSSLBot 0348e1863 - Bot: Update SSL cert for wikitide.org [00:02:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.77, 3.60, 3.75 [00:03:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.96, 22.30, 21.36 [00:04:12] RECOVERY - poserdazfreebies.orain.org - LetsEncrypt on sslhost is OK: OK - Certificate 'orain.org' will expire on Wed 30 Oct 2024 09:58:01 PM GMT +0000. [00:05:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.61, 22.30, 21.49 [00:06:10] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.29, 2.75, 3.36 [00:06:25] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/48e18635d19b...2e33a83204d6 [00:06:28] [02ssl] 07WikiTideSSLBot 032e33a83 - Bot: Update SSL cert for dcmultiversewiki.com [00:07:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.69, 22.72, 21.72 [00:09:36] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 20.55, 19.68, 18.15 [00:10:35] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/2e33a83204d6...096226865f9d [00:10:38] [02ssl] 07WikiTideSSLBot 030962268 - Bot: Update SSL cert for wiki.kirbygang.com [00:10:51] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/096226865f9d...1716da1781f1 [00:10:52] [02ssl] 07WikiTideSSLBot 031716da1 - Bot: Update SSL cert for lgbtqia.wiki [00:11:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.56, 23.26, 22.34 [00:11:30] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 18.74, 19.27, 18.18 [00:23:22] RECOVERY - rippaverse.wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:23:55] RECOVERY - dc.wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:25:55] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.79, 4.49, 3.70 [00:26:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:26:25] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:27:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.06, 18.75, 20.21 [00:27:28] RECOVERY - issue-tracker.wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:27:40] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:10] RECOVERY - dcmultiversewiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'dcmultiversewiki.com' will expire on Wed 30 Oct 2024 11:07:50 PM GMT +0000. [00:28:20] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.063 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [00:29:25] RECOVERY - polcompball.wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:30:18] RECOVERY - dcmultiverse.wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:31:23] RECOVERY - lgbtqia.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'lgbtqia.wiki' will expire on Wed 30 Oct 2024 11:12:15 PM GMT +0000. [00:31:50] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:35:24] RECOVERY - www.lgbtqia.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'lgbtqia.wiki' will expire on Wed 30 Oct 2024 11:12:15 PM GMT +0000. [00:36:18] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [00:37:46] RECOVERY - wiki.kirbygang.com - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.kirbygang.com' will expire on Wed 30 Oct 2024 11:12:00 PM GMT +0000. [00:38:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [00:40:15] RECOVERY - www.dcmultiversewiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'dcmultiversewiki.com' will expire on Wed 30 Oct 2024 11:07:50 PM GMT +0000. [00:40:31] RECOVERY - wikitide.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wikitide.org' will expire on Wed 30 Oct 2024 11:03:09 PM GMT +0000. [00:45:43] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:50] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:51:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:56:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:01:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:03:43] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.52, 21.31, 19.39 [01:05:40] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.11, 21.18, 19.56 [01:11:32] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.79, 19.27, 19.31 [01:12:44] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.43, 3.11, 3.68 [01:16:37] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.62, 3.83, 3.86 [01:18:33] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.79, 3.76, 3.86 [01:22:18] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [01:22:26] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.08, 3.76, 3.79 [01:24:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [01:27:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.41, 21.80, 20.52 [01:30:13] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.45, 3.84, 3.89 [01:31:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 14.66, 19.17, 19.87 [01:34:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.49, 3.93, 3.92 [01:39:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.91, 21.73, 20.73 [01:39:14] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:41:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.61, 20.10, 20.29 [01:43:13] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.073 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [01:46:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:47:34] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:49:32] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 3.789 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [01:51:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:58:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.40, 2.72, 3.72 [02:00:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.25, 3.37, 3.81 [02:01:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:02:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.99, 3.55, 3.85 [02:06:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:06:12] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.80, 3.70, 3.78 [02:08:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.67, 3.03, 3.51 [02:10:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.92, 4.58, 4.03 [02:10:45] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:11:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:14:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN:> Your server seems to be powered off. (Execution of FreeIPMI returned an empty output or only 1 header row!)-> /usr/sbin/ipmi-sensors was executed with the following parameters: [HIDDEN]sudo /usr/sbin/ipmi-sensors --exclude-sensor-types Drive_Slot,Entity_Presence --quiet-cache --sdr-cache-recreate --interpret-oem-data --output-sensor-state --ignore-not-available-sensors --output-sens [02:14:20] lds [02:14:49] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.701 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [02:16:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:16:21] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [02:18:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.00, 3.73, 3.98 [02:24:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.81, 4.18, 4.04 [02:31:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:37:43] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.81, 20.55, 18.51 [02:39:41] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.04, 19.54, 18.39 [02:40:10] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.82, 3.48, 3.93 [02:41:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:43:00] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.37, 20.29, 15.44 [02:43:25] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 27.52, 19.39, 14.62 [02:43:32] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 29.85, 24.96, 20.83 [02:45:08] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 29.95, 22.49, 16.98 [02:46:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:46:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.91, 3.68, 3.82 [02:46:09] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 29.36, 22.59, 15.89 [02:46:34] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 30.28, 22.55, 16.07 [02:50:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.31, 2.89, 3.45 [02:51:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:52:09] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 19.37, 23.02, 18.52 [02:52:10] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.84, 2.54, 3.26 [02:52:34] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 17.50, 22.81, 18.68 [02:55:00] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 18.03, 23.57, 21.15 [02:55:24] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 20.33, 23.45, 20.64 [02:56:09] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 12.45, 17.88, 17.49 [02:56:34] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 12.12, 18.18, 17.82 [02:57:08] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 16.78, 23.35, 21.60 [03:01:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:01:00] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 27.78, 24.24, 21.94 [03:01:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.56, 4.46, 3.80 [03:01:08] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 24.16, 23.55, 22.07 [03:01:24] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 26.57, 23.27, 21.27 [03:02:34] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 29.28, 23.30, 19.95 [03:03:36] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 27.30, 21.31, 17.21 [03:04:09] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 29.52, 23.67, 19.84 [03:11:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:12:18] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [03:14:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [03:14:27] PROBLEM - cp37 Varnish Backends on cp37 is CRITICAL: 1 backends are down. mw182 [03:17:31] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 18.94, 23.92, 21.88 [03:18:24] RECOVERY - cp37 Varnish Backends on cp37 is OK: All 19 backends are healthy [03:19:31] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 10.56, 18.84, 20.24 [03:20:09] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 13.51, 21.50, 23.22 [03:20:29] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.87, 3.31, 3.89 [03:20:34] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 12.55, 21.23, 23.46 [03:21:08] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 12.00, 20.26, 23.92 [03:21:24] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 13.30, 20.13, 23.54 [03:22:26] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 3.88, 3.85, 4.03 [03:23:00] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 11.94, 18.31, 22.87 [03:24:09] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 8.91, 14.93, 20.13 [03:26:21] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.23, 3.18, 3.72 [03:26:34] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 10.52, 13.59, 19.24 [03:27:00] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 10.09, 14.64, 20.39 [03:27:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 16.27, 18.44, 23.06 [03:27:08] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 11.83, 14.36, 20.04 [03:27:24] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 10.36, 13.75, 19.53 [03:28:17] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.11, 3.80, 3.89 [03:30:14] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.56, 3.48, 3.75 [03:31:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:34:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.71, 4.18, 3.94 [03:35:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 12.68, 15.67, 19.99 [03:36:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:36:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.97, 3.55, 3.76 [03:38:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.63, 4.42, 4.05 [03:42:19] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:46:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:48:29] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 5.033 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [03:51:59] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:01] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [04:06:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:14:18] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:15:03] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:15:23] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.92, 19.12, 17.98 [04:16:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [04:17:00] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.30, 23.23, 18.04 [04:17:02] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0005412995815 secs [04:19:18] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 20.25, 20.16, 18.65 [04:22:01] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.24, 20.43, 16.77 [04:24:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.10, 3.12, 3.93 [04:25:00] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 20.98, 23.18, 20.48 [04:27:04] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.65, 23.30, 20.52 [04:29:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.84, 23.51, 20.98 [04:31:39] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 18.50, 20.23, 18.59 [04:32:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.34, 3.53, 3.73 [04:34:10] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.05, 3.18, 3.60 [04:35:00] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 24.06, 22.75, 21.49 [04:35:32] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 21.05, 20.99, 19.30 [04:37:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.00, 23.87, 21.81 [04:38:11] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.71, 4.14, 3.85 [04:39:15] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:00] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 21.50, 23.29, 22.31 [04:41:17] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [04:41:20] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 16.99, 19.86, 19.49 [04:46:48] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [04:47:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.19, 23.95, 23.06 [04:48:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.22, 3.50, 3.91 [04:50:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 3.07, 3.90, 4.04 [04:51:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.09, 24.27, 23.33 [04:52:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.44, 3.11, 3.73 [04:53:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.26, 22.50, 22.78 [04:54:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.89, 3.70, 3.86 [04:55:00] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 17.66, 18.96, 20.30 [04:58:04] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:58:53] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:39] PROBLEM - prometheus151 APT on prometheus151 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [05:00:40] PROBLEM - prometheus151 Puppet on prometheus151 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [05:01:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:02:06] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 2.581 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:02:39] RECOVERY - prometheus151 Puppet on prometheus151 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [05:02:40] RECOVERY - prometheus151 APT on prometheus151 is OK: APT OK: 52 packages available for upgrade (0 critical updates). [05:03:01] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:06:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:06:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 0.21, 3.13, 3.98 [05:10:09] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 0.10, 1.45, 3.09 [05:11:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:11:01] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 1 backends are down. mw182 [05:11:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 13.60, 17.13, 19.77 [05:12:59] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [05:13:42] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 20.91, 21.22, 20.14 [05:15:36] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 17.82, 19.79, 19.73 [05:16:47] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o