[00:00:07] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.96, 3.62, 3.59 [00:00:51] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 26.72, 20.81, 17.92 [00:02:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.05, 3.89, 3.66 [00:02:49] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:02:50] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 18.97, 19.81, 17.90 [00:03:55] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 329.10 ms [00:03:56] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.96, 3.25, 3.44 [00:05:51] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.92, 2.81, 3.26 [00:11:53] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 262.32 ms [00:17:52] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 353.35 ms [00:21:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.59, 3.60, 3.42 [00:21:51] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 292.63 ms [00:22:49] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:23:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.57, 3.01, 3.24 [00:25:50] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 333.73 ms [00:29:49] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 231.48 ms [00:31:49] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 330.11 ms [00:32:49] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:33:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.47, 4.59, 3.69 [00:33:48] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 253.10 ms [00:35:28] PROBLEM - prometheus151 APT on prometheus151 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [00:37:49] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:38:22] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:48] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 322.48 ms [00:40:01] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:41:05] RECOVERY - prometheus151 APT on prometheus151 is OK: APT OK: 52 packages available for upgrade (0 critical updates). [00:41:48] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 173.68 ms [00:41:57] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.059 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [00:42:20] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:42:49] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:49:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:49:53] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 274.51 ms [00:51:52] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 371.62 ms [00:53:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.51, 2.98, 3.80 [00:53:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.24, 22.61, 23.84 [00:54:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:55:52] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 249.71 ms [00:59:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.25, 3.78, 3.78 [00:59:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:03:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.15, 3.50, 3.78 [01:03:50] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 399.09 ms [01:03:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.41, 21.22, 22.39 [01:04:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:05:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 0.24, 2.39, 3.33 [01:05:49] PROBLEM - ping6 on cp26 is WARNING: PING WARNING - Packet loss = 0%, RTA = 262.16 ms [01:11:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.51, 22.93, 22.88 [01:15:48] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 150.14 ms [01:17:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:27:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.34, 21.13, 21.11 [01:31:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.81, 21.81, 21.39 [01:39:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.57, 18.28, 20.05 [02:05:34] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 30.07, 23.51, 20.78 [02:27:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [02:28:05] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 28.25, 22.43, 18.54 [02:29:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [02:30:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 25.09, 22.68, 18.03 [02:34:46] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 21.44, 19.51, 15.67 [02:36:46] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 20.32, 19.99, 16.32 [02:37:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.51, 24.40, 18.76 [02:40:09] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 22.26, 19.89, 16.98 [02:40:35] PROBLEM - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.012 second response time [02:40:57] PROBLEM - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 502 Bad Gateway [02:42:07] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 14.28, 17.83, 16.58 [02:42:25] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:42:54] PROBLEM - phorge171 php-fpm on phorge171 is CRITICAL: PROCS CRITICAL: 0 processes with command name 'php-fpm8.2' [02:43:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 11.99, 20.76, 19.44 [02:44:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 11.43, 19.06, 20.45 [02:45:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 13.45, 18.04, 18.58 [02:46:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 13.06, 21.02, 22.91 [02:46:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 10.06, 16.22, 19.25 [02:47:25] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:52:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.68, 15.94, 19.92 [02:56:46] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.29, 21.05, 23.83 [02:57:41] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:37] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001534998417 secs [03:06:37] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 13.12, 15.77, 20.16 [03:08:26] RECOVERY - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is OK: HTTP OK: HTTP/1.1 200 OK - 19644 bytes in 0.078 second response time [03:08:53] RECOVERY - phorge171 php-fpm on phorge171 is OK: PROCS OK: 9 processes with command name 'php-fpm8.2' [03:08:57] RECOVERY - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is OK: HTTP OK: Status line output matched "HTTP/1.1 200" - 17718 bytes in 0.036 second response time [03:15:47] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [03:17:44] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [03:18:22] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.05, 21.08, 20.25 [03:20:20] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.63, 20.37, 20.14 [03:37:02] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.49, 20.00, 19.11 [03:39:00] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.72, 20.36, 19.40 [03:42:55] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.12, 21.77, 20.20 [03:52:45] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.62, 24.46, 21.92 [03:52:48] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:44] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0002720057964 secs [04:04:34] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.21, 22.90, 23.17 [04:05:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:07:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [04:12:27] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.20, 17.07, 20.23 [04:15:06] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [04:24:14] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.94, 21.33, 20.06 [04:28:05] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 28.85, 18.95, 14.76 [04:28:10] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.47, 20.37, 20.18 [04:30:03] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 14.10, 17.21, 14.66 [04:39:55] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.74, 20.29, 20.23 [04:41:53] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.33, 22.28, 20.96 [04:45:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.65, 22.16, 21.23 [04:49:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.69, 23.97, 22.20 [04:51:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.23, 23.71, 22.35 [04:53:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.46, 23.46, 22.39 [04:57:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:59:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [04:59:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 17.94, 22.54, 22.64 [05:02:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:05:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.32, 24.05, 23.21 [05:06:36] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:06:37] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 9.33, 4.62, 2.00 [05:07:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:07:53] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.56, 22.32, 22.68 [05:08:31] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.13, 3.72, 1.98 [05:09:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.18, 24.42, 23.39 [05:10:28] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.39, 4.44, 2.45 [05:11:29] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.79, 23.60, 23.22 [05:12:44] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.072 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:14:42] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:15:28] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:15:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.91, 23.97, 23.36 [05:18:07] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.13, 3.56, 2.89 [05:20:02] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.58, 2.96, 2.75 [05:29:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.38, 22.48, 23.51 [05:31:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.09, 23.03, 23.54 [05:33:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.72, 23.07, 23.54 [05:35:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.29, 24.40, 23.97 [05:37:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:39:53] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.27, 23.70, 23.78 [05:41:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.03, 24.32, 24.02 [05:45:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.96, 23.71, 23.83 [05:47:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:48:49] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.81, 3.99, 3.20 [05:52:40] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.98, 3.03, 2.98 [05:53:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.48, 20.90, 22.13 [05:55:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.37, 21.90, 22.40 [06:03:15] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.23, 3.90, 3.57 [06:03:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.25, 21.63, 21.80 [06:07:05] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.81, 4.18, 3.71 [06:07:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:13:52] PROBLEM - wiki.kirbygang.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.kirbygang.com' expires in 7 day(s) (Thu 08 Aug 2024 05:46:20 AM GMT +0000). [06:14:45] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.07, 3.84, 3.83 [06:17:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:17:50] PROBLEM - wiki.andreijiroh.uk.eu.org - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - wiki.andreijiroh.uk.eu.org All nameservers failed to answer the query. [06:20:29] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.14, 3.71, 3.76 [06:22:23] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.12, 3.20, 3.55 [06:24:18] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.26, 2.67, 3.30 [06:32:57] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 21.33, 19.68, 17.40 [06:34:56] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 25.99, 22.12, 18.58 [06:35:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 21.89, 19.25, 16.04 [06:36:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 24.30, 18.68, 14.51 [06:37:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 27.09, 22.11, 17.49 [06:38:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 19.76, 18.44, 14.91 [06:42:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.92, 21.47, 17.21 [06:44:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 15.10, 19.71, 17.13 [06:45:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 13.65, 21.31, 19.64 [06:46:51] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 15.40, 22.71, 22.17 [06:47:11] RECOVERY - wiki.andreijiroh.uk.eu.org - reverse DNS on sslhost is OK: SSL OK - wiki.andreijiroh.uk.eu.org reverse DNS resolves to cp36.wikitide.net - CNAME OK [06:47:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 10.50, 17.77, 18.56 [06:49:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [06:50:49] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.08, 16.70, 19.77 [06:51:34] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [06:55:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.05, 20.44, 23.92 [06:56:53] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [06:58:50] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001155465841 secs [07:02:25] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:09:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.55, 21.11, 21.28 [07:11:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.47, 20.41, 20.97 [07:12:49] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [07:14:46] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0002555251122 secs [07:15:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.43, 19.00, 20.38 [07:16:19] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [07:25:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.77, 19.19, 19.50 [07:27:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.76, 18.67, 19.26 [07:43:59] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:01:34] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.60, 20.59, 19.54 [08:03:32] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.30, 19.74, 19.41 [08:23:10] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.69, 20.41, 19.17 [08:29:04] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 13.86, 18.08, 18.78 [08:41:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.24, 20.25, 19.34 [08:43:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.95, 18.38, 18.76 [08:47:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.05, 21.04, 19.80 [08:55:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 20.09, 20.24, 19.92 [09:01:38] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [09:03:35] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [09:25:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.54, 20.41, 19.33 [09:27:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.55, 19.22, 19.02 [09:38:43] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.78, 22.71, 20.29 [09:44:38] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.31, 22.18, 20.90 [09:47:07] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_dns] [09:52:30] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.49, 19.61, 20.21 [10:07:14] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 29.45, 22.61, 20.48 [10:14:09] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:19:33] [02dns-check-action] 07redbluegreenhat pushed 031 commit to 03master [+1/-0/±0] 13https://github.com/miraheze/dns-check-action/compare/f275b5620e7e...7186b8e1b808 [10:19:36] [02dns-check-action] 07redbluegreenhat 037186b8e - Initial action [10:29:02] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 21.40, 20.02, 17.09 [10:31:01] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 24.25, 20.97, 17.74 [10:39:16] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 22.45, 18.08, 14.33 [10:39:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 27.36, 20.93, 16.42 [10:40:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 24.50, 20.89, 16.45 [10:41:16] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 26.40, 20.72, 15.73 [10:42:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:42:09] [02dns] 07redbluegreenhat created branch 03setup-ci - 13https://github.com/miraheze/dns [10:42:11] [02dns] 07redbluegreenhat pushed 031 commit to 03setup-ci [+1/-0/±0] 13https://github.com/miraheze/dns/commit/9534f6003038 [10:42:13] [02dns] 07redbluegreenhat 039534f60 - Setup workflow [10:42:24] [02dns] 07redbluegreenhat opened pull request 03#540: Setup workflow - 13https://github.com/miraheze/dns/pull/540 [10:43:16] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 19.43, 19.92, 16.04 [10:44:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 17.24, 22.22, 18.25 [10:45:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 11.94, 19.61, 17.78 [10:45:43] [02dns-check-action] 07redbluegreenhat pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/dns-check-action/compare/7186b8e1b808...0a081fb84583 [10:45:44] [02dns-check-action] 07redbluegreenhat 030a081fb - Run apt update prior to installing [10:46:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 9.75, 17.89, 17.15 [10:46:28] [02dns] 07redbluegreenhat pushed 031 commit to 03setup-ci [+0/-0/±1] 13https://github.com/miraheze/dns/compare/9534f6003038...a5a8bb1442c4 [10:46:31] [02dns] 07redbluegreenhat 03a5a8bb1 - oops [10:46:33] [02dns] 07redbluegreenhat synchronize pull request 03#540: Setup workflow - 13https://github.com/miraheze/dns/pull/540 [10:46:54] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 11.45, 20.52, 21.44 [10:47:00] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:48:53] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 11.38, 17.41, 20.18 [10:49:16] [02mw-config] 07OAuthority closed pull request 03#5625: T12390: Support gemini:// for rainversewiki - 13https://github.com/miraheze/mw-config/pull/5625 [10:49:17] [02mw-config] 07OAuthority pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/mw-config/compare/56c3e29de4ef...cd8c00be1a19 [10:49:18] [02mw-config] 07BlankEclair 03cd8c00b - T12390: Support gemini:// for rainversewiki (#5625) [10:50:14] miraheze/mw-config - OAuthority the build passed. [10:50:58] [02dns] 07redbluegreenhat pushed 031 commit to 03setup-ci [+0/-0/±1] 13https://github.com/miraheze/dns/compare/a5a8bb1442c4...cb1212345078 [10:51:01] [02dns] 07redbluegreenhat 03cb12123 - Run on Ubuntu 24.04 [10:51:02] [02dns] 07redbluegreenhat synchronize pull request 03#540: Setup workflow - 13https://github.com/miraheze/dns/pull/540 [10:52:55] [02dns-check-action] 07redbluegreenhat pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/dns-check-action/compare/0a081fb84583...b0231ba9622c [10:52:56] [02dns-check-action] 07redbluegreenhat 03b0231ba - fix [10:53:11] !log [@test151] starting deploy of {'config': True} to test151 [10:53:12] !log [@test151] finished deploy of {'config': True} to test151 - SUCCESS in 0s [10:53:16] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [10:53:20] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [10:56:27] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 17.49, 20.09, 23.75 [10:56:44] [02dns-check-action] 07redbluegreenhat pushed 032 commits to 03master [+0/-0/±2] 13https://github.com/miraheze/dns-check-action/compare/b0231ba9622c...b1f796504836 [10:56:45] [02dns-check-action] 07redbluegreenhat 0315ebd25 - fix wording [10:56:46] [02dns-check-action] 07redbluegreenhat 03b1f7965 - Update README [11:01:03] [02dns-check-action] 07redbluegreenhat created tag 03v1 - 13https://github.com/miraheze/dns-check-action [11:01:05] [02dns-check-action] 07redbluegreenhat tagged 03b1f7965 as 03v1 13https://github.com/miraheze/dns-check-action/commit/b1f7965048363ca64be81fb3960860b3195ad37c [11:01:45] [02dns-check-action] 07redbluegreenhat published 03v1 | v1 - 13https://github.com/miraheze/dns-check-action/releases/tag/v1 [11:01:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:02:27] [02dns] 07redbluegreenhat pushed 031 commit to 03setup-ci [+0/-0/±1] 13https://github.com/miraheze/dns/compare/cb1212345078...e4a8de65c5ae [11:02:29] [02dns] 07redbluegreenhat 03e4a8de6 - Use v1 [11:02:31] [02dns] 07redbluegreenhat synchronize pull request 03#540: Setup workflow - 13https://github.com/miraheze/dns/pull/540 [11:02:43] !log [@mwtask181] starting deploy of {'config': True} to all [11:02:51] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [11:02:55] !log [@mwtask181] finished deploy of {'config': True} to all - SUCCESS in 12s [11:03:01] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [11:03:19] [02dns] 07redbluegreenhat edited pull request 03#540: Setup CI - 13https://github.com/miraheze/dns/pull/540 [11:04:01] [02dns] 07redbluegreenhat edited pull request 03#540: Setup CI - 13https://github.com/miraheze/dns/pull/540 [11:04:10] [02dns] 07redbluegreenhat closed pull request 03#540: Setup CI - 13https://github.com/miraheze/dns/pull/540 [11:04:11] [02dns] 07redbluegreenhat pushed 031 commit to 03master [+1/-0/±0] 13https://github.com/miraheze/dns/compare/595471876d7c...9db6ade0032c [11:04:12] [02dns] 07redbluegreenhat 039db6ade - Setup CI (#540) [11:04:15] [02dns] 07redbluegreenhat deleted branch 03setup-ci [11:04:16] [02dns] 07redbluegreenhat deleted branch 03setup-ci - 13https://github.com/miraheze/dns [11:06:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:07:36] !log [@mwtask171] starting deploy of {'config': True} to all [11:07:41] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [11:07:46] !log [@mwtask171] finished deploy of {'config': True} to all - SUCCESS in 10s [11:07:52] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [11:08:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:14:11] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.11, 18.68, 20.33 [11:17:27] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.56, 4.28, 2.63 [11:18:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:20:04] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.89, 20.92, 20.58 [11:21:17] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.86, 3.37, 2.61 [11:22:02] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.54, 21.81, 20.93 [11:24:00] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.54, 22.30, 21.25 [11:29:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:33:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.62, 19.09, 20.29 [11:39:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:46:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:47:09] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.47, 4.21, 3.30 [11:48:19] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:49:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [11:50:13] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [11:51:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:51:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [11:54:41] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.36, 22.56, 21.02 [11:56:42] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.99, 3.88, 3.80 [11:58:38] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.24, 22.54, 21.37 [11:59:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:04:23] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.02, 3.85, 3.68 [12:04:32] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.33, 18.69, 20.04 [12:06:18] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.10, 3.28, 3.51 [12:08:12] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.30, 2.86, 3.34 [12:09:26] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [12:09:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:10:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:10:25] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.60, 20.22, 20.12 [12:11:27] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0002946555614 secs [12:12:03] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.70, 4.17, 3.77 [12:12:23] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.50, 22.10, 20.79 [12:14:17] PROBLEM - wiki.walkscape.app - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'wiki.walkscape.app' expires in 15 day(s) (Fri 16 Aug 2024 11:48:55 AM GMT +0000). [12:14:21] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.76, 19.99, 20.16 [12:14:29] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/c0893637ca03...70cbe195de92 [12:14:32] [02ssl] 07WikiTideSSLBot 0370cbe19 - Bot: Update SSL cert for wiki.walkscape.app [12:15:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:15:54] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.44, 3.19, 3.45 [12:17:50] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.47, 2.78, 3.29 [12:17:58] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [12:20:15] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:21:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.37, 3.87, 3.70 [12:23:44] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.18, 4.76, 4.05 [12:26:13] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:28:10] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.200 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [12:30:15] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:32:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:41:35] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 5 backends are down. mw151 mw161 mw162 mw172 mw181 [12:41:38] PROBLEM - cp51 HTTPS on cp51 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10005 milliseconds with 0 bytes received [12:41:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.10, 2.99, 3.74 [12:41:48] PROBLEM - ping6 on cp51 is CRITICAL: PING CRITICAL - Packet loss = 50%, RTA = 372.57 ms [12:42:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:43:32] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [12:43:39] RECOVERY - cp51 HTTPS on cp51 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3821 bytes in 1.158 second response time [12:43:51] RECOVERY - ping6 on cp51 is OK: PING OK - Packet loss = 0%, RTA = 184.92 ms [12:45:44] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.53, 2.38, 3.33 [12:46:41] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:47:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:52:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:52:38] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.21, 21.58, 20.22 [12:54:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:54:37] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.52, 22.40, 20.61 [12:56:35] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.37, 19.60, 19.81 [12:59:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [13:00:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.45, 22.16, 20.71 [13:02:27] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.16, 20.59, 20.30 [13:04:25] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.76, 19.47, 19.92 [13:12:17] RECOVERY - wiki.walkscape.app - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.walkscape.app' will expire on Tue 29 Oct 2024 11:14:23 AM GMT +0000. [13:15:42] PROBLEM - wiki.jill-jimmy.com - reverse DNS on sslhost is WARNING: LifetimeTimeout: The resolution lifetime expired after 5.401 seconds: Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out. [13:15:46] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [13:15:58] PROBLEM - www.streamerlore.com - reverse DNS on sslhost is WARNING: LifetimeTimeout: The resolution lifetime expired after 5.404 seconds: Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out. [13:19:09] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 29.81, 24.27, 21.55 [13:21:07] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.61, 23.69, 21.69 [13:21:14] PROBLEM - wiki.ventistudio.fr - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - wiki.ventistudio.fr All nameservers failed to answer the query. [13:23:05] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.93, 24.10, 22.04 [13:23:29] PROBLEM - wiki.case-clicker.com - reverse DNS on sslhost is WARNING: LifetimeTimeout: The resolution lifetime expired after 5.407 seconds: Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out.; Server 2606:4700:4700::1111 UDP port 53 answered The DNS operation timed out. [13:23:33] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [13:24:31] PROBLEM - wiki.col6.de - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - wiki.col6.de All nameservers failed to answer the query. [13:25:30] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001696288586 secs [13:29:04] PROBLEM - alternatewiki.tombricks.com - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - alternatewiki.tombricks.com All nameservers failed to answer the query. [13:29:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [13:31:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [13:34:54] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.96, 23.38, 23.15 [13:42:47] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.39, 23.21, 23.03 [13:44:45] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.59, 22.39, 22.77 [13:44:46] RECOVERY - www.streamerlore.com - reverse DNS on sslhost is OK: SSL OK - www.streamerlore.com reverse DNS resolves to cp36.wikitide.net - CNAME OK [13:45:42] RECOVERY - wiki.jill-jimmy.com - reverse DNS on sslhost is OK: SSL OK - wiki.jill-jimmy.com reverse DNS resolves to cp36.wikitide.net - CNAME OK [13:46:15] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:49] RECOVERY - wiki.ventistudio.fr - reverse DNS on sslhost is OK: SSL OK - wiki.ventistudio.fr reverse DNS resolves to cp36.wikitide.net - CNAME OK [13:52:37] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 27.61, 24.62, 23.45 [13:53:12] RECOVERY - wiki.case-clicker.com - reverse DNS on sslhost is OK: SSL OK - wiki.case-clicker.com reverse DNS resolves to cp36.wikitide.net - CNAME OK [13:54:23] RECOVERY - wiki.col6.de - reverse DNS on sslhost is OK: SSL OK - wiki.col6.de reverse DNS resolves to cp36.wikitide.net - CNAME OK [13:56:34] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.42, 23.62, 23.41 [13:58:32] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.59, 23.87, 23.51 [13:58:55] RECOVERY - alternatewiki.tombricks.com - reverse DNS on sslhost is OK: SSL OK - alternatewiki.tombricks.com reverse DNS resolves to cp36.wikitide.net - CNAME OK [14:00:30] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.32, 22.74, 23.16 [14:06:24] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 12.48, 15.23, 19.52 [14:10:19] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.59, 19.50, 20.27 [14:12:17] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.76, 21.97, 21.08 [14:14:15] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.38, 21.66, 21.05 [14:16:32] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [14:17:14] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [14:17:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [14:18:12] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.58, 22.90, 21.60 [14:19:11] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001590847969 secs [14:19:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [441 system event log (SEL) entries present] [14:20:10] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.63, 23.33, 21.94 [14:22:08] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.43, 23.95, 22.29 [14:30:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 22.88, 20.64, 16.73 [14:33:14] [02mw-config] 07dependabot[bot] pushed 031 commit to 03dependabot/composer/justinrainbow/json-schema-6.0.0 [+0/-0/±1] 13https://github.com/miraheze/mw-config/commit/062057d5b616 [14:33:16] [02mw-config] 07dependabot[bot] 03062057d - Update justinrainbow/json-schema requirement from 5.3.0 to 6.0.0 [14:33:19] [02mw-config] 07dependabot[bot] created branch 03dependabot/composer/justinrainbow/json-schema-6.0.0 - 13https://github.com/miraheze/mw-config [14:33:20] [02mw-config] 07dependabot[bot] labeled pull request 03#5626: Update justinrainbow/json-schema requirement from 5.3.0 to 6.0.0 - 13https://github.com/miraheze/mw-config/pull/5626 [14:33:23] [02mw-config] 07dependabot[bot] labeled pull request 03#5626: Update justinrainbow/json-schema requirement from 5.3.0 to 6.0.0 - 13https://github.com/miraheze/mw-config/pull/5626 [14:33:25] [02mw-config] 07dependabot[bot] opened pull request 03#5626: Update justinrainbow/json-schema requirement from 5.3.0 to 6.0.0 - 13https://github.com/miraheze/mw-config/pull/5626 [14:34:05] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 26.23, 22.33, 18.17 [14:34:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 26.59, 20.92, 15.95 [14:34:07] miraheze/mw-config - dependabot[bot] the build passed. [14:37:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 23.34, 20.79, 16.42 [14:39:16] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 24.08, 21.01, 16.48 [14:41:16]