[00:00:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:01:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.25, 2.95, 3.33 [00:02:38] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 20.94, 20.54, 18.02 [00:04:36] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 17.96, 19.54, 17.94 [00:05:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:10:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:11:16] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 8.66, 5.01, 3.87 [00:12:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 18.97, 23.51, 23.53 [00:13:11] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:13:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.13, 3.92, 3.60 [00:15:20] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:21:45] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.89, 3.67, 3.56 [00:23:32] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:25:20] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:25:28] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.246 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [00:26:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.57, 15.40, 19.62 [00:29:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.70, 3.50, 3.72 [00:29:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 16.50, 19.03, 23.87 [00:30:20] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:33:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.42, 4.43, 3.98 [00:39:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.78, 3.72, 3.92 [00:40:20] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:43:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.36, 16.88, 19.84 [00:45:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.99, 3.46, 3.68 [00:46:49] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:47:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.89, 3.64, 3.75 [00:49:32] RECOVERY - www.dovearchives.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'www.dovearchives.wiki' will expire on Tue 29 Oct 2024 10:49:57 PM GMT +0000. [00:49:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.95, 3.95, 3.84 [00:50:47] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.37, 20.86, 20.60 [00:51:01] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:53:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.53, 3.97, 3.93 [00:54:43] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.91, 23.21, 21.53 [00:55:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.72, 4.60, 4.16 [00:56:36] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:30] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [00:59:15] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.066 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [01:00:38] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.91, 23.14, 22.24 [01:01:49] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:03:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 0.34, 2.54, 3.57 [01:05:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 0.12, 1.72, 3.14 [01:22:17] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.64, 18.86, 20.13 [01:26:12] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.40, 22.17, 21.26 [01:36:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 27.93, 23.54, 21.99 [01:38:01] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.80, 22.83, 21.89 [01:43:55] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.50, 22.71, 21.88 [02:16:05] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 28.09, 21.90, 17.84 [02:17:31] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [02:19:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 21.87, 19.06, 15.30 [02:19:29] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [02:19:45] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [02:20:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 27.44, 20.58, 15.56 [02:21:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 24.31, 21.61, 16.73 [02:21:42] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [02:23:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 23.98, 22.44, 17.64 [02:25:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 25.21, 22.69, 18.27 [02:26:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.27, 23.35, 18.72 [02:27:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 18.96, 21.38, 18.34 [02:27:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:28:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 15.09, 20.34, 18.18 [02:32:26] PROBLEM - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.013 second response time [02:32:53] PROBLEM - phorge171 php-fpm on phorge171 is CRITICAL: PROCS CRITICAL: 0 processes with command name 'php-fpm8.2' [02:32:57] PROBLEM - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 502 Bad Gateway [02:33:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 25.88, 22.53, 19.68 [02:34:26] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [02:36:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 25.75, 21.66, 19.28 [02:36:23] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [02:38:17] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 27.30, 21.62, 18.11 [02:38:26] RECOVERY - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is OK: HTTP OK: HTTP/1.1 200 OK - 19644 bytes in 0.069 second response time [02:38:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 28.63, 22.83, 18.36 [02:38:53] RECOVERY - phorge171 php-fpm on phorge171 is OK: PROCS OK: 9 processes with command name 'php-fpm8.2' [02:38:57] RECOVERY - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is OK: HTTP OK: Status line output matched "HTTP/1.1 200" - 17718 bytes in 0.036 second response time [02:39:54] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 27.39, 22.53, 18.19 [02:42:25] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:44:06] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 1 backends are down. mw181 [02:44:18] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 2 backends are down. mw181 mw182 [02:44:39] PROBLEM - cp27 Varnish Backends on cp27 is CRITICAL: 1 backends are down. mw181 [02:44:55] PROBLEM - cp41 Varnish Backends on cp41 is CRITICAL: 1 backends are down. mw181 [02:46:03] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [02:46:16] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [02:46:35] RECOVERY - cp27 Varnish Backends on cp27 is OK: All 19 backends are healthy [02:46:54] RECOVERY - cp41 Varnish Backends on cp41 is OK: All 19 backends are healthy [02:49:43] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 16.19, 22.58, 21.02 [02:49:59] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 1 backends are down. mw181 [02:51:41] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 13.83, 19.60, 20.12 [02:51:56] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [02:51:58] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 13.11, 21.59, 22.03 [02:52:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 12.99, 20.98, 22.39 [02:52:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 10.88, 19.62, 21.25 [02:53:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 13.72, 20.93, 23.50 [02:54:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 10.07, 16.42, 19.88 [02:55:53] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 10.34, 15.60, 19.50 [02:56:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 13.66, 16.56, 20.24 [02:59:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 10.56, 14.19, 19.60 [03:00:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 15.28, 18.47, 23.08 [03:08:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 16.87, 16.08, 19.91 [03:15:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [03:17:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [03:37:25] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:38:13] [02mw-config] 07The-Voidwalker pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/mw-config/compare/cd8c00be1a19...66301932f893 [03:38:14] [02mw-config] 07The-Voidwalker 036630193 - Math install no longer requires sql [03:39:11] miraheze/mw-config - The-Voidwalker the build passed. [03:39:33] !log [@mwtask181] starting deploy of {'config': True} to all [03:39:46] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [03:39:46] !log [@mwtask181] finished deploy of {'config': True} to all - SUCCESS in 13s [03:40:04] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [03:53:27] !log [@test151] starting deploy of {'config': True} to test151 [03:53:28] !log [@test151] finished deploy of {'config': True} to test151 - SUCCESS in 0s [03:53:38] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [03:53:46] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [04:07:30] !log [@mwtask171] starting deploy of {'config': True} to all [04:07:36] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [04:07:42] !log [@mwtask171] finished deploy of {'config': True} to all - SUCCESS in 12s [04:07:54] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [04:11:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:12:25] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:13:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [04:22:25] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:29:24] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 28%, RTA = 178.94 ms [04:31:24] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 178.87 ms [05:01:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:03:30] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:03:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [05:03:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.81, 3.42, 1.40 [05:04:19] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:26] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.076 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:05:30] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 179.62 ms [05:05:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [05:06:13] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:06:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:07:30] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 179.00 ms [05:08:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:09:50] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:11:39] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 179.59 ms [05:13:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:13:39] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 180.02 ms [05:13:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.51, 3.44, 2.70 [05:15:43] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:15:58] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.079 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:17:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.71, 3.37, 2.88 [05:19:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 16.86, 20.96, 23.95 [05:20:43] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:21:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:21:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.59, 22.42, 24.08 [05:23:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.89, 22.58, 23.93 [05:25:53] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.61, 24.06, 24.31 [05:26:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:40:46] PROBLEM - cp41 Puppet on cp41 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/reception] [06:04:53] RECOVERY - cp41 Puppet on cp41 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:16:05] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 29.23, 22.10, 17.68 [06:18:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 28.53, 20.64, 15.26 [06:18:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 28.11, 21.64, 16.26 [06:19:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.20, 21.35, 15.65 [06:21:16] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 23.21, 18.91, 14.76 [06:21:38] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 23.13, 19.88, 15.21 [06:23:35] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 17.55, 19.26, 15.55 [06:24:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 17.70, 22.62, 18.86 [06:27:16] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 15.51, 18.91, 16.30 [06:28:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.12, 23.88, 19.98 [06:28:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 28.36, 24.99, 20.63 [06:30:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 18.99, 22.69, 20.33 [06:31:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 19.15, 23.30, 20.66 [06:32:53] PROBLEM - phorge171 php-fpm on phorge171 is CRITICAL: PROCS CRITICAL: 0 processes with command name 'php-fpm8.2' [06:32:57] PROBLEM - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 502 Bad Gateway [06:34:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 25.22, 22.28, 20.37 [06:34:26] PROBLEM - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.013 second response time [06:34:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:34:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 25.81, 23.63, 21.17 [06:35:29] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [06:37:14] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 26.59, 20.99, 17.70 [06:37:16] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 24.49, 21.40, 18.46 [06:37:22] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 29.26, 24.38, 21.58 [06:37:27] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.001147836447 secs [06:38:26] RECOVERY - phorge171 issue-tracker.miraheze.org HTTPS on phorge171 is OK: HTTP OK: HTTP/1.1 200 OK - 19644 bytes in 0.089 second response time [06:38:53] RECOVERY - phorge171 php-fpm on phorge171 is OK: PROCS OK: 9 processes with command name 'php-fpm8.2' [06:38:57] RECOVERY - phorge171 phorge-static.wikitide.net HTTPS on phorge171 is OK: HTTP OK: Status line output matched "HTTP/1.1 200" - 17718 bytes in 0.038 second response time [06:41:10] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [06:43:09] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [06:43:22] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 21.48, 20.00, 16.29 [06:45:22] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 18.38, 19.61, 16.63 [06:45:55] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:47:09] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [06:49:07] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [06:49:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:50:56] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 12.88, 21.84, 22.01 [06:51:16] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 11.46, 20.90, 22.01 [06:51:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 12.58, 22.10, 23.74 [06:52:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 14.05, 22.02, 23.91 [06:54:19] !log [@test151] starting deploy of {'folders': '1.42/extensions/MirahezeMagic'} to test151 [06:54:20] !log [@test151] finished deploy of {'folders': '1.42/extensions/MirahezeMagic'} to test151 - SUCCESS in 0s [06:54:25] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [06:54:30] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:54:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 12.68, 19.41, 23.05 [06:54:33] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [06:54:50] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 11.99, 16.34, 19.68 [06:55:16] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 9.13, 15.12, 19.43 [06:56:05] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 12.99, 17.71, 23.23 [06:56:59] !log mitigate traffic spikes on phab causing outages [06:57:05] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [06:57:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 11.09, 13.97, 19.43 [06:58:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 10.99, 14.25, 19.73 [06:58:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 9.83, 14.74, 20.36 [07:02:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 12.86, 14.19, 19.86 [07:05:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 13.30, 18.75, 23.21 [07:07:43] PROBLEM - ping6 on cp41 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:08:05] PROBLEM - cp41 HTTPS on cp41 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Connection timed out after 10004 milliseconds [07:08:15] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 1 datacenter is down: 2400:d320:2161:9775::1/cpweb [07:08:38] PROBLEM - cp41 SSH on cp41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:08:56] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 1 datacenter is down: 2400:d320:2161:9775::1/cpweb [07:09:40] PROBLEM - Host cp41 is DOWN: PING CRITICAL - Packet loss = 100% [07:09:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 9.72, 13.69, 20.11 [07:22:27] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 177.84 ms [07:24:20] RECOVERY - Host cp41 is UP: PING OK - Packet loss = 79%, RTA = 123.49 ms [07:24:26] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 178.03 ms [07:24:51] RECOVERY - ping6 on cp41 is OK: PING OK - Packet loss = 0%, RTA = 122.46 ms [07:24:56] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [07:25:06] RECOVERY - cp41 SSH on cp41 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [07:26:09] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [07:26:25] PROBLEM - cp41 Puppet on cp41 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [07:26:37] RECOVERY - cp41 HTTPS on cp41 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3821 bytes in 1.035 second response time [07:27:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.78, 19.38, 17.90 [07:31:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.69, 22.61, 19.35 [07:33:49] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [07:35:45] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [07:36:27] RECOVERY - cp41 Puppet on cp41 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:37:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.60, 22.81, 20.64 [07:39:30] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:39:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.75, 23.98, 21.30 [07:41:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.04, 23.83, 21.62 [07:45:32] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:36] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 177.87 ms [07:47:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.01, 22.48, 21.53 [07:48:36] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 177.69 ms [07:52:43] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 178.43 ms [07:54:43] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 177.24 ms [07:57:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.76, 23.46, 23.05 [07:59:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.05, 24.40, 23.47 [08:04:49] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 178.22 ms [08:06:48] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 178.73 ms [08:14:06] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [08:17:00] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [08:18:19] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0003886520863 secs [08:23:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [08:25:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [08:33:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.63, 21.92, 23.59 [08:35:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.30, 22.91, 23.75 [08:36:07] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 178.43 ms [08:37:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.37, 22.48, 23.46 [08:40:07] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 176.64 ms [08:43:47] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:43:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 27.69, 23.31, 23.23 [08:45:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.11, 22.12, 22.75 [08:49:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 27.84, 24.22, 23.37 [09:11:31] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [09:13:33] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [09:42:43] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [09:44:40] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0004657506943 secs [09:46:50] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [09:49:52] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.41, 22.66, 23.87 [09:51:52] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 30.57, 25.42, 24.72 [09:55:11] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 29.46, 23.17, 17.99 [10:00:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 21.16, 19.16, 14.92 [10:02:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 19.18, 19.04, 15.39 [10:10:22] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [10:11:29] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [10:12:20] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [10:13:27] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 9.024143219e-05 secs [10:15:42] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:21:49] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 28.78, 22.63, 18.08 [10:21:54] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 28.59, 21.45, 17.62 [10:22:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 27.44, 20.51, 16.97 [10:22:46] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 24.55, 19.20, 16.05 [10:22:59] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 27.04, 19.71, 16.27 [10:23:22] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 23.61, 17.99, 13.79 [10:25:22] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 17.07, 17.66, 14.21 [10:25:28] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 1 backends are down. mw181 [10:26:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:27:27] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [10:33:36] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 22.54, 23.83, 21.76 [10:34:34] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 18.35, 22.46, 20.46 [10:34:45] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 17.60, 22.55, 20.72 [10:36:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 17.56, 22.61, 21.71 [10:38:32] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 27.34, 23.27, 21.12 [10:38:45] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 16.81, 19.35, 19.80 [10:41:27] PROBLEM - mw161 Current Load on mw161 is CRITICAL: LOAD CRITICAL - total load average: 24.29, 22.12, 21.44 [10:42:06] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 26.34, 22.61, 21.64 [10:42:57] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 1 backends are down. mw181 [10:44:11] PROBLEM - mw181 HTTPS on mw181 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10004 milliseconds with 0 bytes received [10:44:46] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 28.68, 23.47, 21.25 [10:44:54] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [10:45:22] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 23.70, 19.75, 16.30 [10:45:27] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 1 backends are down. mw171 [10:46:17] RECOVERY - mw181 HTTPS on mw181 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3655 bytes in 9.213 second response time [10:47:22] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 25.82, 21.52, 17.34 [10:47:27] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [10:47:44] PROBLEM - cp27 Varnish Backends on cp27 is CRITICAL: 1 backends are down. mw182 [10:49:44] RECOVERY - cp27 Varnish Backends on cp27 is OK: All 19 backends are healthy [10:50:48] PROBLEM - mw181 MediaWiki Rendering on mw181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:51:11] PROBLEM - cp41 Varnish Backends on cp41 is CRITICAL: 2 backends are down. mw181 mw182 [10:52:52] RECOVERY - mw181 MediaWiki Rendering on mw181 is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 9.588 second response time [10:53:52] PROBLEM - cp26 Varnish Backends on cp26 is CRITICAL: 2 backends are down. mw181 mw182 [10:55:09] RECOVERY - cp41 Varnish Backends on cp41 is OK: All 19 backends are healthy [10:57:16] [02puppet] 07redbluegreenhat pushed 031 commit to 03T11407-swift [+0/-0/±1] 13https://github.com/miraheze/puppet/compare/65beb1bf82b7...a84e80488885 [10:57:16] [02puppet] 07redbluegreenhat 03a84e804 - render zone [10:57:17] [02puppet] 07redbluegreenhat synchronize pull request 03#3890: T11407: Swift: Support Phonos - 13https://github.com/miraheze/puppet/pull/3890 [10:57:22] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 15.03, 21.89, 20.45 [10:57:48] RECOVERY - cp26 Varnish Backends on cp26 is OK: All 19 backends are healthy [10:58:43] miraheze/puppet - redbluegreenhat the build passed. [10:58:46] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 13.08, 22.69, 23.90 [10:59:22] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 12.02, 18.32, 19.30 [11:00:32] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 13.48, 20.91, 24.00 [11:01:16] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 14.12, 18.99, 22.71 [11:02:06] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 16.36, 20.61, 23.45 [11:03:19] PROBLEM - ping6 on cp26 is CRITICAL: PING CRITICAL - Packet loss = 28%, RTA = 176.73 ms [11:04:27] PROBLEM - db171 Backups SQL on db171 is CRITICAL: FILE_AGE CRITICAL: /var/log/sql-backup.log is 1209601 seconds old and 139855 bytes [11:05:18] RECOVERY - ping6 on cp26 is OK: PING OK - Packet loss = 0%, RTA = 175.84 ms [11:05:22] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 18.82, 20.39, 23.79 [11:06:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:06:45] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 12.85, 15.39, 19.81 [11:07:16] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 12.82, 14.57, 19.52 [11:07:40] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [11:08:32] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 11.99, 14.39, 19.45 [11:09:36] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [11:10:06] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 11.64, 15.01, 19.65 [11:15:41] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:16:18] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.15, 4.72, 2.93 [11:17:22] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 12.57, 16.43, 20.09 [11:17:44] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 7.010 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [11:20:09] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.37, 3.81, 2.96 [11:21:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:23:58] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.60, 4.06, 3.21 [11:26:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:27:49] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.01, 3.81, 3.35 [11:29:46] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.17, 3.21, 3.18 [11:31:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:36:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:41:30] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:41:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 7.39, 5.29, 4.05 [11:42:27] [02mw-config] 07redbluegreenhat created branch 03T11704 - 13https://github.com/miraheze/mw-config [11:42:29] [02mw-config] 07redbluegreenhat pushed 031 commit to 03T11704 [+0/-0/±4] 13https://github.com/miraheze/mw-config/commit/837def861e2b [11:42:32] [02mw-config] 07redbluegreenhat 03837def8 - T11704: Install Phonos [11:42:59] [02puppet] 07redbluegreenhat edited pull request 03#3890: T11704: Swift: Support Phonos - 13https://github.com/miraheze/puppet/pull/3890 [11:43:20] [02mw-config] 07redbluegreenhat opened pull request 03#5627: T11704: Install Phonos - 13https://github.com/miraheze/mw-config/pull/5627 [11:44:18] miraheze/mw-config - redbluegreenhat the build passed. [11:45:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.21, 3.67, 3.65 [11:46:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:47:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.36, 4.77, 4.06 [11:49:21] [02mediawiki-repos] 07redbluegreenhat pushed 031 commit to 03T11704 [+0/-0/±1] 13https://github.com/miraheze/mediawiki-repos/commit/cf0eed1794aa [11:49:23] [02mediawiki-repos] 07redbluegreenhat 03cf0eed1 - T11704: Install Phonos [11:49:24] [02mediawiki-repos] 07redbluegreenhat created branch 03T11704 - 13https://github.com/miraheze/mediawiki-repos [11:49:38] [02mediawiki-repos] 07redbluegreenhat opened pull request 03#29: T11704: Install Phonos - 13https://github.com/miraheze/mediawiki-repos/pull/29 [11:49:44] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.29, 3.80, 3.80 [11:50:52] [02mediawiki-repos] 07coderabbitai[bot] commented on pull request 03#29: T11704: Install Phonos - 13https://github.com/miraheze/mediawiki-repos/pull/29#issuecomment-2262845975 [11:51:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:51:50] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.075 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [11:53:26] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:53:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.17, 3.77, 3.78 [11:55:34] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.000145226717 secs [11:56:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [11:57:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.93, 3.31, 3.60 [11:59:44] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.94, 4.21, 3.88 [12:00:10] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [12:01:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.22, 3.43, 3.64 [12:02:07] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 3.609061241e-05 secs [12:09:44] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.99, 3.39, 3.49 [12:10:57] [02ImportDump] 07translatewiki pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ImportDump/compare/6056caa52dea...2ebf8acd329b [12:10:59] [02ImportDump] 07translatewiki 032ebf8ac - Localisation updates from https://translatewiki.net. [12:11:00] [02CreateWiki] 07translatewiki pushed 031 commit to 03master [+0/-0/±2] 13https://github.com/miraheze/CreateWiki/compare/17d0f8166c57...4fb0389bb00c [12:11:01] [02CreateWiki] 07translatewiki 034fb0389 - Localisation updates from https://translatewiki.net. [12:11:02] [02MirahezeMagic] 07translatewiki pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/MirahezeMagic/compare/559257078e1f...10a580bf44ec [12:11:05] [02MirahezeMagic] 07translatewiki 0310a580b - Localisation updates from https://translatewiki.net. [12:11:43] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.10, 3.13, 3.40 [12:14:20] miraheze/CreateWiki - translatewiki the build has errored. [12:15:42] miraheze/MirahezeMagic - translatewiki the build has errored. [12:16:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:17:43] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.05, 3.89, 3.69 [12:20:12] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:43] miraheze/ImportDump - translatewiki the build passed. [12:21:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [12:22:08]