[00:01:41] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 25.35, 24.54, 23.29 [00:19:59] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.35, 21.58, 19.34 [00:21:56] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.20, 20.15, 19.09 [00:31:41] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.45, 20.84, 19.88 [00:36:03] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 22.51, 20.15, 18.59 [00:37:58] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 17.18, 19.06, 18.39 [00:39:00] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:39:27] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.80, 20.37, 20.04 [00:41:48] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 21.52, 20.89, 19.33 [00:44:00] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:45:37] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 17.08, 19.92, 19.41 [00:48:14] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.27, 20.66, 20.15 [00:50:11] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.44, 21.58, 20.53 [00:52:07] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 17.05, 20.63, 20.37 [01:01:50] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.40, 19.00, 20.00 [01:09:38] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 29.10, 23.77, 21.55 [01:13:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [01:13:31] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.58, 23.64, 22.13 [01:31:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.11, 19.45, 20.39 [01:31:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 20.29, 20.51, 18.53 [01:33:19] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 25.07, 22.04, 19.32 [01:35:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.24, 21.62, 21.06 [01:35:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 21.60, 21.49, 19.42 [01:45:19] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 17.47, 19.95, 19.72 [01:49:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.23, 23.14, 23.27 [02:13:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.52, 22.74, 22.20 [02:15:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.93, 22.38, 22.09 [02:19:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.69, 22.51, 22.15 [02:27:19] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 25.18, 21.45, 18.42 [02:31:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 17.15, 23.37, 23.43 [02:31:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 16.17, 21.26, 19.24 [02:33:19] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 14.15, 18.98, 18.66 [02:45:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.68, 22.22, 21.85 [02:47:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.18, 21.49, 21.62 [02:49:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.46, 22.88, 22.13 [02:51:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.78, 21.06, 21.53 [02:57:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.67, 19.04, 20.35 [03:01:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.03, 22.91, 21.51 [03:01:24] RECOVERY - mon181 Backups Grafana on mon181 is OK: FILE_AGE OK: /var/log/grafana-backup.log is 65 seconds old and 93 bytes [03:02:30] PROBLEM - kagaga.jp - LetsEncrypt on sslhost is CRITICAL: No address associated with hostnameHTTP CRITICAL - Unable to open TCP socket [03:05:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.90, 23.16, 22.08 [03:06:04] PROBLEM - kagaga.jp - reverse DNS on sslhost is WARNING: rDNS WARNING - reverse DNS entry for kagaga.jp could not be found [03:08:30] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:15:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.36, 18.38, 20.18 [03:21:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.54, 20.21, 20.47 [03:23:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.72, 22.11, 21.09 [03:26:09] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 26.57, 22.57, 19.30 [03:27:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.92, 23.50, 22.07 [03:27:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:31:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.31, 23.37, 22.27 [03:33:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.56, 23.21, 22.39 [03:35:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.17, 24.05, 22.82 [03:37:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.19, 23.31, 22.68 [03:39:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.67, 24.30, 23.15 [03:41:26] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 19.47, 22.56, 22.12 [03:45:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.31, 22.57, 22.87 [03:47:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.50, 23.65, 23.19 [03:47:19] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 25.67, 21.64, 21.48 [03:49:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.61, 22.98, 22.99 [03:49:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 21.25, 21.48, 21.46 [03:51:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.14, 24.02, 23.39 [03:52:25] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:53:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 15.14, 20.96, 22.37 [03:53:20] PROBLEM - wiki.gab.pt.eu.org - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - wiki.gab.pt.eu.org All nameservers failed to answer the query. [03:55:19] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 19.06, 19.08, 20.37 [04:07:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.26, 21.76, 21.54 [04:08:09] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 28.77, 23.54, 21.42 [04:10:03] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 21.87, 22.54, 21.30 [04:11:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.37, 22.30, 21.93 [04:17:42] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 15.22, 18.31, 19.89 [04:21:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 27.80, 22.88, 22.02 [04:22:11] RECOVERY - wiki.gab.pt.eu.org - reverse DNS on sslhost is OK: SSL OK - wiki.gab.pt.eu.org reverse DNS resolves to cp36.wikitide.net - CNAME OK [04:24:11] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [04:26:11] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [Inlet Temp = Critical, 323 system event log (SEL) entries present] [04:31:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.43, 22.46, 22.82 [04:33:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.02, 22.85, 22.90 [04:35:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.27, 22.76, 22.87 [04:37:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.70, 23.25, 23.01 [04:39:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.60, 22.04, 22.59 [04:42:25] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:45:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.33, 22.25, 22.40 [04:47:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.32, 22.71, 22.57 [04:55:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.65, 17.79, 20.28 [05:02:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:03:55] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.41, 3.20, 1.32 [05:05:56] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.90, 2.71, 1.36 [05:07:41] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 17.72, 20.59, 23.41 [05:09:55] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.26, 5.12, 2.66 [05:12:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:13:55] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.74, 3.30, 2.47 [05:16:37] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [05:17:41] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 13.26, 16.00, 20.10 [05:36:19] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [05:38:20] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [Inlet Temp = Critical, 325 system event log (SEL) entries present] [05:45:08] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:14:58] PROBLEM - cloud18 Puppet on cloud18 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[ulogd2] [06:15:49] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:38:24] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [06:40:25] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [Inlet Temp = Critical, 327 system event log (SEL) entries present] [06:40:58] RECOVERY - cloud18 Puppet on cloud18 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:45:46] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:10] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 20.19, 20.73, 18.99 [06:50:07] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 17.07, 19.21, 18.76 [07:24:41] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 25.64, 22.22, 19.54 [07:30:36] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 20.29, 22.42, 20.76 [07:32:35] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 27.36, 24.13, 21.57 [07:33:58] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 29.52, 24.00, 19.94 [07:36:31] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 16.44, 21.79, 21.31 [07:39:48] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 15.51, 22.84, 21.16 [07:40:28] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 24.37, 22.23, 21.51 [07:42:27] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 20.55, 21.00, 21.11 [07:44:25] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 26.81, 22.93, 21.78 [07:46:24] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 23.77, 23.69, 22.24 [07:50:21] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 9.25, 17.17, 20.05 [07:51:27] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 12.76, 17.61, 19.67 [08:05:07] PROBLEM - mw152 Current Load on mw152 is CRITICAL: LOAD CRITICAL - total load average: 26.99, 23.78, 21.23 [08:13:51] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.90, 21.32, 19.55 [08:21:37] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.94, 19.81, 19.71 [08:25:32] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.26, 23.75, 21.28 [08:29:25] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.25, 23.48, 21.77 [08:31:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:39:08] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.34, 23.14, 21.91 [08:41:04] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.02, 21.66, 21.50 [08:47:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 15.24, 18.30, 20.11 [08:49:43] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:37] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [08:51:42] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0003411769867 secs [08:52:37] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [Inlet Temp = Critical, 330 system event log (SEL) entries present] [08:53:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.18, 21.31, 20.75 [08:55:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.68, 22.08, 21.12 [08:58:49] PROBLEM - cp27 Varnish Backends on cp27 is CRITICAL: 1 backends are down. mw152 [09:00:44] RECOVERY - cp27 Varnish Backends on cp27 is OK: All 19 backends are healthy [09:01:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.96, 22.73, 21.53 [09:05:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.86, 21.44, 21.25 [09:06:30] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:11:03] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.85, 19.07, 20.26 [09:17:14] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [09:21:03] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.93, 24.02, 21.91 [09:21:19] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 24.30, 19.16, 17.40 [09:23:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 20.59, 19.69, 17.82 [09:31:19] PROBLEM - mw151 Current Load on mw151 is CRITICAL: LOAD CRITICAL - total load average: 26.04, 22.18, 19.73 [09:33:19] PROBLEM - mw151 Current Load on mw151 is WARNING: LOAD WARNING - total load average: 20.48, 21.24, 19.68 [09:34:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:35:03] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.60, 23.55, 23.76 [09:35:19] RECOVERY - mw151 Current Load on mw151 is OK: LOAD OK - total load average: 19.73, 20.39, 19.53 [09:45:17] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:07] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o