[00:00:07] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.96, 3.62, 3.59 [00:00:51] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 26.72, 20.81, 17.92 [00:02:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.05, 3.89, 3.66 [00:02:49] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:02:50] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 18.97, 19.81, 17.90 [00:03:55] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 329.10 ms [00:03:56] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.96, 3.25, 3.44 [00:05:51] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.92, 2.81, 3.26 [00:11:53] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 262.32 ms [00:17:52] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 353.35 ms [00:21:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.59, 3.60, 3.42 [00:21:51] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 292.63 ms [00:22:49] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:23:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.57, 3.01, 3.24 [00:25:50] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 333.73 ms [00:29:49] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 231.48 ms [00:31:49] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 330.11 ms [00:32:49] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:33:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.47, 4.59, 3.69 [00:33:48] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 253.10 ms [00:35:28] PROBLEM - prometheus151 APT on prometheus151 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [00:37:49] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:38:22] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:48] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 322.48 ms [00:40:01] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:41:05] RECOVERY - prometheus151 APT on prometheus151 is OK: APT OK: 52 packages available for upgrade (0 critical updates). [00:41:48] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 173.68 ms [00:41:57] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.059 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [00:42:20] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:42:49] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:49:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:49:53] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 274.51 ms [00:51:52] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 371.62 ms [00:53:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.51, 2.98, 3.80 [00:53:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.24, 22.61, 23.84 [00:54:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:55:52] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 249.71 ms [00:59:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.25, 3.78, 3.78 [00:59:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:03:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.15, 3.50, 3.78 [01:03:50] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 399.09 ms [01:03:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.41, 21.22, 22.39 [01:04:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:05:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 0.24, 2.39, 3.33 [01:05:49] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 262.16 ms [01:11:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.51, 22.93, 22.88 [01:15:48] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 150.14 ms [01:17:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:27:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.34, 21.13, 21.11 [01:31:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.81, 21.81, 21.39 [01:39:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.57, 18.28, 20.05 [02:05:34] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 30.07, 23.51, 20.78 [02:27:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [02:28:05] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 28.25, 22.43, 18.54 [02:29:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [02:30:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 25.09, 22.68, 18.03 [02:34:46] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 21.44, 19.51, 15.67 [02:36:46] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 20.32, 19.99, 16.32 [02:37:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.51, 24.40, 18.76 [02:40:09] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 22.26, 19.89, 16.98 [02:40:35] PROBLEM - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.012 second response time [02:40:57] PROBLEM - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 502 Bad Gateway [02:42:07] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 14.28, 17.83, 16.58 [02:42:25] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:42:54] PROBLEM - phorge171 php-fpm on phorge171 is CRITICAL: PROCS CRITICAL: 0 processes with command name 'php-fpm8.2' [02:43:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 11.99, 20.76, 19.44 [02:44:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 11.43, 19.06, 20.45 [02:45:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 13.45, 18.04, 18.58 [02:46:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 13.06, 21.02, 22.91 [02:46:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 10.06, 16.22, 19.25 [02:47:25] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:52:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.68, 15.94, 19.92 [02:56:46] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.29, 21.05, 23.83 [02:57:41] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:37] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001534998417 secs [03:06:37] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 13.12, 15.77, 20.16 [03:08:26] RECOVERY - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is OK: HTTP OK: HTTP/1.1 200 OK - 19644 bytes in 0.078 second response time [03:08:53] RECOVERY - phorge171 php-fpm on phorge171 is OK: PROCS OK: 9 processes with command name 'php-fpm8.2' [03:08:57] RECOVERY - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is OK: HTTP OK: Status line output matched "HTTP/1.1 200" - 17718 bytes in 0.036 second response time [03:15:47] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [03:17:44] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [03:18:22] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.05, 21.08, 20.25 [03:20:20] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.63, 20.37, 20.14 [03:37:02] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.49, 20.00, 19.11 [03:39:00] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.72, 20.36, 19.40 [03:42:55] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.12, 21.77, 20.20 [03:52:45] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.62, 24.46, 21.92 [03:52:48] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:44] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0002720057964 secs [04:04:34] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.21, 22.90, 23.17 [04:05:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:07:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [04:12:27] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.20, 17.07, 20.23 [04:15:06] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [04:24:14] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.94, 21.33, 20.06 [04:28:05] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 28.85, 18.95, 14.76 [04:28:10] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.47, 20.37, 20.18 [04:30:03] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 14.10, 17.21, 14.66 [04:39:55] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.74, 20.29, 20.23 [04:41:53] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.33, 22.28, 20.96 [04:45:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.65, 22.16, 21.23 [04:49:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.69, 23.97, 22.20 [04:51:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.23, 23.71, 22.35 [04:53:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.46, 23.46, 22.39 [04:57:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:59:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [04:59:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 17.94, 22.54, 22.64 [05:02:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:05:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.32, 24.05, 23.21 [05:06:36] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:06:37] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 9.33, 4.62, 2.00 [05:07:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:07:53] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.56, 22.32, 22.68 [05:08:31] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.13, 3.72, 1.98 [05:09:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.18, 24.42, 23.39 [05:10:28] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.39, 4.44, 2.45 [05:11:29] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.79, 23.60, 23.22 [05:12:44] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.072 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:14:42] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:15:28] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:15:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.91, 23.97, 23.36 [05:18:07] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.13, 3.56, 2.89 [05:20:02] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.58, 2.96, 2.75 [05:29:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.38, 22.48, 23.51 [05:31:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.09, 23.03, 23.54 [05:33:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.72, 23.07, 23.54 [05:35:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.29, 24.40, 23.97 [05:37:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:39:53] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.27, 23.70, 23.78 [05:41:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.03, 24.32, 24.02 [05:45:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.96, 23.71, 23.83 [05:47:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:48:49] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.81, 3.99, 3.20 [05:52:40] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.98, 3.03, 2.98 [05:53:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.48, 20.90, 22.13 [05:55:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.37, 21.90, 22.40 [06:03:15] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.23, 3.90, 3.57 [06:03:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.25, 21.63, 21.80 [06:07:05] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.81, 4.18, 3.71 [06:07:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:13:52] PROBLEM - wiki.kirbygang.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.kirbygang.com' expires in 7 day(s) (Thu 08 Aug 2024 05:46:20 AM GMT +0000). [06:14:45] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.07, 3.84, 3.83 [06:17:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:17:50] PROBLEM - wiki.andreijiroh.uk.eu.org - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - wiki.andreijiroh.uk.eu.org All nameservers failed to answer the query. [06:20:29] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.14, 3.71, 3.76 [06:22:23] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.12, 3.20, 3.55 [06:24:18] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.26, 2.67, 3.30 [06:32:57] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 21.33, 19.68, 17.40 [06:34:56] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 25.99, 22.12, 18.58 [06:35:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 21.89, 19.25, 16.04 [06:36:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 24.30, 18.68, 14.51 [06:37:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 27.09, 22.11, 17.49 [06:38:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 19.76, 18.44, 14.91 [06:42:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.92, 21.47, 17.21 [06:44:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 15.10, 19.71, 17.13 [06:45:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 13.65, 21.31, 19.64 [06:46:51] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 15.40, 22.71, 22.17 [06:47:11] RECOVERY - wiki.andreijiroh.uk.eu.org - reverse DNS on sslhost is OK: SSL OK - wiki.andreijiroh.uk.eu.org reverse DNS resolves to cp36.wikitide.net - CNAME OK [06:47:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 10.50, 17.77, 18.56 [06:49:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [06:50:49] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.08, 16.70, 19.77 [06:51:34] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [06:55:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.05, 20.44, 23.92 [06:56:53] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o