[00:02:09] PROBLEM - cloud4 Current Load on cloud4 is CRITICAL: CRITICAL - load average: 33.22, 22.17, 17.75 [00:04:09] PROBLEM - graylog2 Puppet on graylog2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [00:04:17] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 7.12, 5.45, 4.70 [00:06:08] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 23.43, 23.28, 19.19 [00:06:17] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 4.49, 5.03, 4.64 [00:08:09] PROBLEM - cloud4 Current Load on cloud4 is CRITICAL: CRITICAL - load average: 27.05, 25.33, 20.48 [00:10:16] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 7.90, 6.54, 5.37 [00:11:25] PROBLEM - mon2 Current Load on mon2 is WARNING: WARNING - load average: 3.46, 3.19, 2.70 [00:13:20] RECOVERY - mon2 Current Load on mon2 is OK: OK - load average: 2.21, 2.85, 2.64 [00:32:07] RECOVERY - graylog2 Puppet on graylog2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:46:07] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 14.65, 18.67, 23.24 [00:56:08] RECOVERY - cloud4 Current Load on cloud4 is OK: OK - load average: 14.23, 16.63, 20.37 [00:58:33] PROBLEM - gluster3 Current Load on gluster3 is WARNING: WARNING - load average: 3.02, 4.62, 5.97 [01:04:37] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 8.02, 5.26, 5.70 [01:06:36] PROBLEM - gluster3 Current Load on gluster3 is WARNING: WARNING - load average: 5.23, 5.22, 5.64 [01:16:37] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 3.01, 4.33, 5.00 [01:27:49] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 9.53, 6.49, 5.45 [01:29:50] PROBLEM - gluster3 Current Load on gluster3 is WARNING: WARNING - load average: 4.98, 5.63, 5.25 [01:31:51] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 6.72, 6.32, 5.57 [01:35:14] PROBLEM - studentwiki.ddns.net - reverse DNS on sslhost is WARNING: rDNS WARNING - reverse DNS entry for studentwiki.ddns.net could not be found [01:39:56] PROBLEM - gluster3 Current Load on gluster3 is WARNING: WARNING - load average: 3.95, 5.32, 5.50 [01:45:54] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 2.51, 3.86, 4.87 [01:53:19] PROBLEM - mon2 Current Load on mon2 is WARNING: WARNING - load average: 3.51, 3.35, 3.14 [01:53:59] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 6.20, 4.96, 4.90 [01:57:10] RECOVERY - mon2 Current Load on mon2 is OK: OK - load average: 2.61, 3.06, 3.08 [01:58:00] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 4.12, 4.80, 4.88 [02:36:16] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 7.50, 6.44, 5.25 [02:36:20] PROBLEM - cloud4 Current Load on cloud4 is CRITICAL: CRITICAL - load average: 24.08, 22.26, 18.42 [02:36:20] PROBLEM - mw9 Current Load on mw9 is WARNING: WARNING - load average: 7.05, 5.69, 4.21 [02:38:19] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 20.72, 21.54, 18.62 [02:38:19] RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 4.03, 4.92, 4.09 [02:40:14] PROBLEM - gluster3 Current Load on gluster3 is WARNING: WARNING - load average: 4.96, 5.62, 5.18 [02:40:17] RECOVERY - cloud4 Current Load on cloud4 is OK: OK - load average: 18.82, 20.32, 18.50 [02:44:13] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 3.85, 4.65, 4.89 [03:16:20] SRE, 503 backend fetch failed on `onlyonewiki` [03:16:21] ```Error 503 Backend fetch failed, forwarded for , 127.0.0.1 [03:16:21] (Varnish XID 434474458) via cp15.miraheze.org at Wed, 13 Oct 2021 03:15:26 GMT.``` [03:37:05] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 14.32, 8.39, 5.20 [03:39:02] RECOVERY - mw10 Current Load on mw10 is OK: OK - load average: 4.46, 6.75, 4.98 [03:40:15] A lot of users are reporting issues when accessing pages [03:40:46] moviepediawiki, onibuswiki and closinglogosgroupwiki [04:20:11] PROBLEM - lcn.zfc.id.lv - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'lcn.zfc.id.lv' expires in 7 day(s) (Thu 21 Oct 2021 04:19:31 GMT +0000). [04:22:48] !sre At least 4 wikis have some pages where they receive internal errors while trying to edit [04:25:09] PROBLEM - wiki.landev.vn - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.landev.vn' expires in 7 day(s) (Thu 21 Oct 2021 04:18:15 GMT +0000). [04:27:56] having the same issue at closinglogosgroup [04:32:05] hb1290: We apologize for the inconvenience. We hope to have this fixed soon, thank you for your patience. [05:04:34] PROBLEM - ping6 on dbbackup1 is WARNING: PING WARNING - Packet loss = 0%, RTA = 165.74 ms [05:05:40] Agent: I raised task to UBN. I'm currently mobile though so no graylog access, to see exact error. [05:06:07] https://phabricator.miraheze.org/T8160 [05:06:08] [url] ⚓ T8160 Fatal exception of type "TypeError" | phabricator.miraheze.org [05:06:16] Oh alright, thanks. That's unfortunate but I'm sure someone else will come soon to deal with this [05:06:37] RECOVERY - ping6 on dbbackup1 is OK: PING OK - Packet loss = 0%, RTA = 92.77 ms [05:09:58] RhinosF1: When possible do you mind trying to look at ^? I'll be unavailable for the rest of the night. [05:11:31] Reception123: mind looking into https://phabricator.miraheze.org/T8160 since your here now? I'll be unavailable till tomorrow. Its been reported on around 5 wikis already. [05:11:32] [url] ⚓ T8160 Fatal exception of type "TypeError" | phabricator.miraheze.org [05:12:28] CosmicAlpha: yes I just saw that and was just logging into Graylog [05:12:35] Thanks! [05:12:41] Argument 1 passed to Wikimedia\IPSet::__construct() must be of the type array, string given, called in /srv/mediawiki/w/extensions/StopForumSpam/includes/DenyListManager.php on line 70 [05:12:50] https://www.irccloud.com/pastebin/pE9yn3Xi/ [05:14:00] ^ CosmicAlpha [05:15:23] I don't see any recent updates for StopForumSpam either [05:18:40] Reception123: yeah I just looked. I don't see anything either, to that or vendor (that would cause this) [05:18:52] yeah, I also check vendor and that was 12 days ago [05:18:59] so I'm really unsure what could cause this if nothing was updatd [05:19:19] test3 on 1.37 is also unaffected [05:20:21] Yeah no idea. [05:20:46] No changes in config that I saw either. [05:21:49] CosmicAlpha: hmm, I just enabled it on testwiki and there's no errors [05:22:43] What schema changes did RhinosF1 do today? It might be database related. [05:23:03] let's check SAL [05:23:33] https://www.irccloud.com/pastebin/zZLNygj8/ [05:24:10] though that should just be for test3 [05:24:24] only flaggedrevs stuff should've been global [05:24:55] Reception123: https://github.com/wikimedia/mediawiki-extensions-StopForumSpam/blob/REL1_36/includes/DenyListManager.php#L66-L67 which in turn loads some database stuff, which is used in L70 (doUpdate) so it might be those schema changes as that's the only change I can think of. But I don't know how really. [05:24:56] [url] mediawiki-extensions-StopForumSpam/DenyListManager.php at REL1_36 · wikimedia/mediawiki-extensions-StopForumSpam · GitHub | github.com [05:25:30] CosmicAlpha: yeah, though they only did happen on test3 according to SAL [05:26:40] PROBLEM - gluster3 Current Load on gluster3 is CRITICAL: CRITICAL - load average: 6.30, 5.76, 4.50 [05:26:53] Reception123: That's not the impression I got earlier though maybe I misunderstood. Try running https://github.com/wikimedia/mediawiki-extensions-StopForumSpam/blob/REL1_36/maintenance/updateDenyList.php on a single broken wiki, as I'm not really sure what that will do. It would be helpful if it were reproducible on a test wiki. [05:26:53] [url] mediawiki-extensions-StopForumSpam/updateDenyList.php at REL1_36 · wikimedia/mediawiki-extensions-StopForumSpam · GitHub | github.com [05:27:14] CosmicAlpha: ok, let's see. But yeah, too bad we can't reproduce on test3 [05:28:41] RECOVERY - gluster3 Current Load on gluster3 is OK: OK - load average: 3.24, 4.97, 4.39 [05:28:43] !log [reception@mwtask1] sudo -u www-data php /srv/mediawiki/w/extensions/StopForumSpam/maintenance/updateDenyList.php --wiki=moviiepediawiki (END - exit=2) [05:28:47] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [05:28:52] !log [reception@mwtask1] sudo -u www-data php /srv/mediawiki/w/extensions/StopForumSpam/maintenance/updateDenyList.php --wiki=moviepediawiki (END - exit=65280) [05:28:53] if I had to guess, SFS 1.36 and 1.37 caching formats don't support each other (and SFS does not set cache version to detect it) [05:28:54] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [05:29:22] CosmicAlpha: running that just gives me the same error [05:29:48] majavah: oh, but 1.37 is only active on test3 so how could it affect production wikis? [05:31:13] https://github.com/wikimedia/mediawiki-extensions-StopForumSpam/blob/REL1_36/includes/DenyListManager.php#L62 [05:31:14] [url] mediawiki-extensions-StopForumSpam/DenyListManager.php at REL1_36 · wikimedia/mediawiki-extensions-StopForumSpam · GitHub | github.com [05:31:28] yeah I figured it was likely schema-related [05:31:32] Yeah that was what I was just getting at. (What majavah said, though not exactly there yet) but it does seem different, so it makes sense. I do know for sure it's related to the 1.37 update on test3 though. [05:32:23] hmm - so what can we do to fix that then? [05:33:00] check RhinosF1's bash history and see what keystrokes he issued to change the db tables, and revert them? [05:33:19] well that's already in SAL [05:33:25] No idea tbh. dmehus: it's not DB related. I was wrong about that. [05:33:26] ah, true [05:33:33] CosmicAlpha, oh [05:34:08] majavah: do you have any idea about how we could fix it? Even as a temporary measure [05:34:13] It is 1.37 related though test3 is overriding the global cache set from 1.36 I think, and 1.37 and 1.36 cache is incompatible. [05:34:46] though how would it override it if there's two separate branches? [05:36:13] Reception123: manually purge the cache key via shell.php, and ensure SFS is not loaded in 1.37 [05:36:41] we might be able to fix it in the extension code too, but that'll take time [05:37:09] I see, how would I purge the cache key with shell.php, I've never done that before [05:37:19] I'll disable SFS on test3 now [05:37:30] Reception123: Because its not branch specific. I'm not to knowledgeable when it comes to Object Cacheing, but I do believe it's because the object cache is set from tesr3, not production because that's what it last expired on, its not set from a production wiki. But yeah you'll have to manually purge the cache key in eval.php, we don't use shell.php as we don't have one of the dev dependencies installed to use that. [05:38:51] CosmicAlpha: I see, I'm looking at docs but can't find the command needed to purge the cache key [05:39:25] There are no docs for it. Let me see if I can find out what it needs to be. I can't think of it right now either. [05:40:33] $wan = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache(); $wan->delete( $wan->makeGlobalKey( 'sfs-denylist-set ' ) ); iirc, although I'm on mobile so can't confirm [05:40:34] Reception123: `MediaWiki\StopForumSpam\DenyListUpdate::purgeDenyListIPs()` possibly. [05:41:07] majavah: ^ shouldn't just running that function do it? [05:41:10] no space inside the quotes though, oops [05:41:48] maybe, but that requires the class loads fine, which is not a guarantee if the cache is broken [05:43:07] would I have to run that on a production wiki then, or would test3 be fine? [05:43:42] If not then yeah `$wanCache = MediaWikiServices::getInstance()->getMainWANObjectCache(); return $wanCache->delete( $wanCache->makeGlobalKey( 'sfs-denylist-set' ) );` should work. Which is just that function without SFS dependency. [05:43:59] Reception123:Any server should work I think (including test3) [05:44:21] ^that command without return. [05:44:27] ok, let me try [05:45:13] CosmicAlpha: should it be \MediaWiki\MediaWikiServices? [05:45:15] And also \MediaWiki\MediaWikiServices [05:45:17] Yeah. [05:45:20] yeah, I thought so too [05:45:23] My bad. [05:45:55] CosmicAlpha: majavah that fixed it, thanks to both! :) [05:46:02] !log > $wanCache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();$wanCache->delete($wanCache->makeGlobalKey( 'sfs-denylist-set' )); on test3 [05:46:05] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [05:46:20] No problem. Glad it's fixed! [05:48:40] I'm awake [05:49:34] Reception123: I closed task. We can't do anymore on our side unless upstream does. It's not a huge deal though, it will work when we make production 1.37. [05:50:00] Morning RhinosF1 [05:50:12] RhinosF1: TL;DR StopForumSpam on test3 messed with object cache and made wikis with SFS on prod go down partially [05:50:36] Okay [05:50:58] dmehus: for the record: yes they are 2 production schema changes running but they add stuff [05:52:27] We need to raise that [05:52:41] It should use a cache version [05:52:50] It's an upgrade blocker [05:54:24] It'll work when production is 1.37 (almost certain of that). But it can be raised upstream I guess. [05:58:21] Wont work during the update though [06:17:18] RhinosF1, ack, okay, yeah no problem [06:17:30] Reception123: ack [06:17:38] I'll take your word for it :P [06:17:58] dmehus: also no way they'd hit moviepedia wiki [06:18:10] They are taking forever [06:18:17] * RhinosF1 curses the image table [06:18:48] RhinosF1, are you talking about the SFS issue, or something else? [06:18:52] * dmehus is a bit confused here [06:19:02] dmehus: the database changes [06:19:33] yeah, but what's moviepediawiki got to do with it? [06:26:11] dmehus: that was one with the error [06:26:18] Reception123: can you run select * from performance_schema.metadata_locks where db = '1134819wiki'; [06:26:36] the schema change is stuck [06:26:54] dmehus: we've performed the change on 1 wiki in 8. hours [06:28:19] RhinosF1: which DB is that part of? [06:29:01] RhinosF1, oh yeah, makes sense Coco would have SNS enabled [06:29:11] his wiki had a lot of spambots [06:29:22] s/SNS/SMS [06:29:22] dmehus meant to say: RhinosF1, oh yeah, makes sense Coco would have SMS enabled [06:29:27] err [06:29:31] that's B.RC.3 :P - before ReCaptcha3! [06:29:35] s/SNS/SFS [06:29:35] dmehus meant to say: RhinosF1, oh yeah, makes sense Coco would have SFS enabled [06:29:49] R123, yeah [06:31:39] Reception123: c3 [06:32:05] db12 [06:33:11] is that supposed to be a table in 1134819wiki? because I don't see one [06:34:08] Reception123: no it's a metadata table [06:34:18] it's from mysql [06:34:37] ok, night [06:36:24] oh I see [06:36:26] dmehus: good night :) [06:36:50] RhinosF1: hmm ERROR 1146 (42S02): Table 'performance_schema.metadata_locks' doesn't exist [06:37:20] paladox: can you help? my schema change is stuck [06:38:14] yeah there's all kinds of tables in performance_schema but metadata_locks isn't one of them [06:38:23] Reception123: like what [06:38:39] https://www.irccloud.com/pastebin/obN3nc29/ [06:39:03] no idea [06:40:39] Reception123: https://phabricator.miraheze.org/T8163 [06:40:40] [url] ⚓ T8163 Unstick schema change | phabricator.miraheze.org [06:41:52] ack [06:55:52] Reception123: can you deploy https://github.com/miraheze/SpriteSheet/pull/8 [06:55:52] [url] Replace usages of `$wgUser` by Universal-Omega · Pull Request #8 · miraheze/SpriteSheet · GitHub | github.com [06:55:55] to 1.37 [06:57:01] Oh, I meant to do that today, but forgot. [06:57:36] I've got to go in 5 mins so can't as that won't be enough time :( (I'd also have to reclone MW) [06:58:07] [02SpriteSheet] 07Universal-Omega edited pull request 03#8: Replace usages of `$wgUser` - 13https://git.io/JKL64 [06:58:22] [02SpriteSheet] 07Universal-Omega edited pull request 03#8: Replace usages of `$wgUser` - 13https://git.io/JKL64 [07:07:25] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 23.24, 20.35, 17.26 [07:07:27] PROBLEM - cloud5 Current Load on cloud5 is WARNING: WARNING - load average: 20.50, 17.29, 14.63 [07:09:23] RECOVERY - cloud4 Current Load on cloud4 is OK: OK - load average: 16.25, 18.86, 17.11 [07:09:27] RECOVERY - cloud5 Current Load on cloud5 is OK: OK - load average: 14.40, 16.33, 14.62 [07:21:36] PROBLEM - wiki.elgeis.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:29:42] !sre All requests are timing out. [07:31:17] To what wiki [07:31:30] Any, Phab included. [07:31:46] Ye [07:31:51] Includes icinga [07:31:56] Hang by [07:32:00] We're dead [07:32:05] Reception123: ^ [07:33:10] .op [07:33:10] Attempting to OP... [07:34:30] This outage makes no sense [07:34:38] But I got to get to college [07:34:39] RhinosF1: mobile only for the entire day I'm afraid [07:34:53] Reception123: can you get Owen to message John [07:35:02] I might be able to look in around 3 hours [07:35:19] I'm blaming OVH [07:35:25] RhinosF1: yeah I could DM him [07:35:30] Looks to be complete network loss [07:35:39] Well at least I know that it's not just my site. Whew. [07:35:49] What's down, exactly? [07:38:24] Where did everyone go? [07:38:47] Everything is down. Matomo, Mail, MediaWiki, NS*. [07:38:50] Is this related to the supposed "downtime?" [07:38:57] RhinosF1: back up for me [07:39:11] Able to access phab and Meta with no issues [07:39:17] I cant. [07:39:22] Meta still gives 503 [07:39:28] PROBLEM - thesimswiki.com - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - thesimswiki.com All nameservers failed to answer the query. [07:39:36] My sites are all good [07:39:42] And so does Icinga, Grafana, Webmail, Phabricator. [07:39:50] Apparently, it seems like a whole DNS server might have gone down if my windows troubleshooter is anything to go by. Several other sites not on miraheze I frequent are down as well. Seems like big names like google and youtube are fine though. [07:40:04] PROBLEM - housing.wiki - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - housing.wiki All nameservers failed to answer the query. [07:40:14] Might be an issue with our provider OVH [07:40:28] PROBLEM - wiki.elgeis.com - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'wiki.elgeis.com' expires in 14 day(s) (Thu 28 Oct 2021 03:49:16 GMT +0000). [07:40:29] my wiki works [07:40:44] Ah, my wiki is back up. Probably rolling outage that's just passing through. [07:40:54] Wonder if there was a weird storm surge somewhere. [07:41:12] PROBLEM - pj-masks-info.cf - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'pj-masks-info.cf' expires in 10 day(s) (Sat 23 Oct 2021 08:19:02 GMT +0000). [07:42:03] Reception123: I've never seen an outage like this before. We went completely and entirely down across all (at least web-accessible) services. Usually one or two services is all. Not everything. [07:42:33] PROBLEM - cp12 Stunnel Http for mon2 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:42:36] PROBLEM - cp15 Stunnel Http for mw10 on cp15 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:42:42] PROBLEM - nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - nocyclo.tk All nameservers failed to answer the query. [07:42:44] PROBLEM - ping4 on cp15 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:42:45] It's not DNS [07:42:50] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 4 datacenters are down: 51.222.25.132/cpweb, 167.114.2.161/cpweb, 2607:5300:205:200::1c30/cpweb, 2607:5300:201:3100::1d3/cpweb [07:42:51] PROBLEM - cp12 Varnish Backends on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:02] PROBLEM - cp12 Stunnel Http for mw11 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:02] PROBLEM - Host cp15 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:05] CosmicAlpha: it's unlikely us [07:43:09] PROBLEM - cp12 Stunnel Http for mw8 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:17] PROBLEM - ping4 on cp12 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:43:18] I'm assuming a provider incident [07:43:21] This is scary, needless to say. [07:43:26] PROBLEM - cp12 Disk Space on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:36] PROBLEM - cp12 Stunnel Http for mw13 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:41] PROBLEM - cp12 NTP time on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:46] PROBLEM - cp12 Stunnel Http for mw12 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:53] PROBLEM - incubator.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - incubator.nocyclo.tk All nameservers failed to answer the query. [07:43:53] PROBLEM - cp12 Puppet on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:43:53] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 4 datacenters are down: 51.222.25.132/cpweb, 167.114.2.161/cpweb, 2607:5300:205:200::1c30/cpweb, 2607:5300:201:3100::1d3/cpweb [07:44:00] PROBLEM - cp12 SSH on cp12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:44:05] DEFCON 1 [07:44:05] PROBLEM - cp12 HTTP 4xx/5xx ERROR Rate on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:44:10] PROBLEM - cp12 Stunnel Http for mw10 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:44:16] PROBLEM - test3 Puppet on test3 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_MediaWiki config],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:44:19] PROBLEM - cp12 Stunnel Http for mw9 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:44:24] PROBLEM - mw11 Puppet on mw11 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_MediaWiki config],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:44:25] PROBLEM - cp12 HTTPS on cp12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:44:41] PROBLEM - phab2 Puppet on phab2 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_phabricator-extensions] [07:44:50] PROBLEM - cp12 PowerDNS Recursor on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:44:50] PROBLEM - mw8 Puppet on mw8 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:44:57] PROBLEM - Host cp12 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:02] PROBLEM - mw13 Puppet on mw13 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:45:23] PROBLEM - en.nocyclo.tk - reverse DNS on sslhost is WARNING: Traceback (most recent call last): File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 148, in main() File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 116, in main rdns_hostname = get_reverse_dnshostname(args.hostname) File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 103, in get_reverse_dnshostname rev_host = str(resolver.query(ptr_record, "PTR")[0] [07:45:23] rip('.') File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1102, in query lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 992, in query timeout = self._compute_timeout(start, lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 799, in _compute_timeout raise Timeout(timeout=duration)dns.exception.Timeout: The DNS operation timed out after 30.002227783203125 seconds [07:45:25] PROBLEM - es.nocyclo.tk - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:45:28] PROBLEM - dreamsit.com.br - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:45:29] PROBLEM - es.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - es.nocyclo.tk All nameservers failed to answer the query. [07:45:31] PROBLEM - meta.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - meta.nocyclo.tk All nameservers failed to answer the query. [07:45:37] PROBLEM - mw9 Puppet on mw9 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 10 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:45:38] PROBLEM - incubator.nocyclo.tk - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:45:41] PROBLEM - housing.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:45:45] PROBLEM - en.nocyclo.tk - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:46:08] PROBLEM - mw10 Puppet on mw10 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 10 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:46:09] Looks like the network [07:46:33] seems like OVH is having issues [07:46:35] PROBLEM - mw12 Puppet on mw12 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_JobRunner],Exec[git_pull_landing],Exec[git_pull_ErrorPages] [07:46:49] majavah: yep [07:48:07] majavah: any incident page? [07:48:15] Normal one is down [07:48:27] nothing I'm aware of [07:48:40] Fun [07:48:50] Reception123: can you get on the phone [07:49:46] searching "OVH" on twitter is what I was using to make that conclusion [07:49:58] Ye [07:50:44] Very impactful for a no-op [07:52:13] PROBLEM - incubator.nocyclo.tk - reverse DNS on sslhost is WARNING: Traceback (most recent call last): File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 148, in main() File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 116, in main rdns_hostname = get_reverse_dnshostname(args.hostname) File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 103, in get_reverse_dnshostname rev_host = str(resolver.query(ptr_record, "P [07:52:13] 0]).rstrip('.') File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1102, in query lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 900, in query timeout = self._compute_timeout(start, lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 799, in _compute_timeout raise Timeout(timeout=duration)dns.exception.Timeout: The DNS operation timed out after 30.002792596817017 seconds [07:53:29] PROBLEM - es.nocyclo.tk - reverse DNS on sslhost is WARNING: Traceback (most recent call last): File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 148, in main() File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 116, in main rdns_hostname = get_reverse_dnshostname(args.hostname) File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 103, in get_reverse_dnshostname rev_host = str(resolver.query(ptr_record, "PTR")[0] [07:53:29] rip('.') File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1102, in query lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 992, in query timeout = self._compute_timeout(start, lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 799, in _compute_timeout raise Timeout(timeout=duration)dns.exception.Timeout: The DNS operation timed out after 30.00310468673706 seconds [07:57:06] PROBLEM - thesimswiki.com - reverse DNS on sslhost is WARNING: Traceback (most recent call last): File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 148, in main() File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 116, in main rdns_hostname = get_reverse_dnshostname(args.hostname) File "/usr/lib/nagios/plugins/check_reverse_dns.py", line 103, in get_reverse_dnshostname rev_host = str(resolver.query(ptr_record, "PTR")[ [07:57:06] strip('.') File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1102, in query lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 900, in query timeout = self._compute_timeout(start, lifetime) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 799, in _compute_timeout raise Timeout(timeout=duration)dns.exception.Timeout: The DNS operation timed out after 30.00238847732544 seconds [07:59:06] PROBLEM - dreamsit.com.br - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - dreamsit.com.br All nameservers failed to answer the query. [08:00:11] PROBLEM - incubator.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - incubator.nocyclo.tk All nameservers failed to answer the query. [08:01:29] PROBLEM - es.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - es.nocyclo.tk All nameservers failed to answer the query. [08:01:35] PROBLEM - en.nocyclo.tk - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - en.nocyclo.tk All nameservers failed to answer the query. [08:03:39] RECOVERY - test3 Puppet on test3 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:03:41] cp12/15 are still down [08:04:04] !log downtimed sslhost for 2 hours to stop flapping alerts [08:04:12] RECOVERY - phab2 Puppet on phab2 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:04:28] RECOVERY - mw8 Puppet on mw8 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:05:36] RECOVERY - mw9 Puppet on mw9 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:05:54] RECOVERY - mw10 Puppet on mw10 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:05:58] We're up but our Canadian front ends are still offline [08:06:08] Please be patient especially if you're in America [08:06:16] RECOVERY - mw12 Puppet on mw12 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:07:14] RECOVERY - mw11 Puppet on mw11 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:07:50] RECOVERY - mw13 Puppet on mw13 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:15:23] * RhinosF1 has to go to a singal blackspot so back later [08:21:05] RECOVERY - Host cp15 is UP: PING OK - Packet loss = 16%, RTA = 85.07 ms [08:21:14] RECOVERY - cp15 Stunnel Http for mw10 on cp15 is OK: HTTP OK: HTTP/1.1 200 OK - 15752 bytes in 0.326 second response time [08:21:26] RECOVERY - ping4 on cp15 is OK: PING OK - Packet loss = 0%, RTA = 79.39 ms [08:22:47] RECOVERY - Host cp12 is UP: PING OK - Packet loss = 0%, RTA = 82.46 ms [08:22:49] RECOVERY - cp12 HTTP 4xx/5xx ERROR Rate on cp12 is OK: OK - NGINX Error Rate is 3% [08:22:49] RECOVERY - cp12 HTTPS on cp12 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 2987 bytes in 0.311 second response time [08:22:49] RECOVERY - cp12 Puppet on cp12 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [08:22:49] RECOVERY - cp12 SSH on cp12 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:22:49] RECOVERY - cp12 Stunnel Http for mw10 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15744 bytes in 0.332 second response time [08:22:54] RECOVERY - cp12 Stunnel Http for mw9 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15757 bytes in 0.312 second response time [08:23:08] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [08:23:09] RECOVERY - cp12 PowerDNS Recursor on cp12 is OK: DNS OK: 0.039 seconds response time. miraheze.org returns 167.114.2.161,2607:5300:201:3100::1d3 [08:23:17] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [08:23:20] RECOVERY - cp12 Varnish Backends on cp12 is OK: All 9 backends are healthy [08:23:34] RECOVERY - cp12 Stunnel Http for mw8 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15743 bytes in 0.323 second response time [08:23:34] RECOVERY - cp12 Stunnel Http for mw11 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15744 bytes in 0.322 second response time [08:23:34] RECOVERY - cp12 Stunnel Http for mon2 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 34599 bytes in 0.343 second response time [08:23:39] RECOVERY - cp12 Disk Space on cp12 is OK: DISK OK - free space: / 12819 MB (33% inode=96%); [08:23:44] RECOVERY - ping4 on cp12 is OK: PING OK - Packet loss = 0%, RTA = 82.30 ms [08:23:44] RECOVERY - cp12 NTP time on cp12 is OK: NTP OK: Offset -0.005567342043 secs [08:23:44] RECOVERY - cp12 Stunnel Http for mw12 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15744 bytes in 0.305 second response time [08:23:49] RECOVERY - cp12 Stunnel Http for mw13 on cp12 is OK: HTTP OK: HTTP/1.1 200 OK - 15744 bytes in 0.329 second response time [08:28:22] That's everything [08:53:46] JohnLewis: we seem back alive [08:54:22] Yeah, seems like schedules maintenance gone wrong [08:54:57] JohnLewis: yep [08:55:11] I haven't checked icinga but cp12/15 came back and that's all actually broke [08:55:20] I downtimed sslhost to stop it flapping [08:58:47] From before I had to do some work [08:58:48] From my last check [09:07:44] PROBLEM - gluster3 APT on gluster3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [09:11:51] RECOVERY - gluster3 APT on gluster3 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [09:11:52] Everything looks good [09:12:05] We'll get a lot of alerts when the sslhost downtime expires [09:13:40] Network equipment failure in the US [09:20:05] what happened? [09:21:06] 10:13:40 Network equipment failure in the US [09:21:56] https://www.bleepingcomputer.com/news/technology/ovh-hosting-provider-goes-down-during-planned-maintenance/ [09:21:57] [url] OVH hosting provider goes down during planned maintenance | www.bleepingcomputer.com [09:36:05] PROBLEM - mw11 APT on mw11 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [09:38:03] RECOVERY - mw11 APT on mw11 is OK: APT OK: 17 packages available for upgrade (0 critical updates). [09:56:42] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 11.01, 8.43, 5.18 [09:58:40] PROBLEM - mw10 Current Load on mw10 is WARNING: WARNING - load average: 4.22, 6.81, 4.97 [10:00:38] RECOVERY - mw10 Current Load on mw10 is OK: OK - load average: 3.03, 5.42, 4.68 [10:02:00] PROBLEM - dreamsit.com.br - reverse DNS on sslhost is WARNING: SSL WARNING - rDNS OK but records conflict. {'NS': ['ns82.domaincontrol.com.', 'ns81.domaincontrol.com.'], 'CNAME': None} [10:02:00] RECOVERY - incubator.nocyclo.tk - LetsEncrypt on sslhost is OK: OK - Certificate 'incubator.nocyclo.tk' will expire on Wed 17 Nov 2021 04:06:46 GMT +0000. [10:02:00] RECOVERY - es.nocyclo.tk - LetsEncrypt on sslhost is OK: OK - Certificate 'es.nocyclo.tk' will expire on Mon 10 Jan 2022 05:03:48 GMT +0000. [10:02:00] RECOVERY - incubator.nocyclo.tk - reverse DNS on sslhost is OK: SSL OK - incubator.nocyclo.tk reverse DNS resolves to cp12.miraheze.org - CNAME FLAT [10:02:00] PROBLEM - thesimswiki.com - reverse DNS on sslhost is WARNING: SSL WARNING - rDNS OK but records conflict. {'NS': ['ns2.vultr.com.', 'ns1.vultr.com.'], 'CNAME': None} [10:02:01] PROBLEM - housing.wiki - reverse DNS on sslhost is WARNING: SSL WARNING - rDNS OK but records conflict. {'NS': ['ns2.dreamhost.com.', 'ns3.dreamhost.com.', 'ns1.dreamhost.com.'], 'CNAME': None} [10:02:01] RECOVERY - nocyclo.tk - reverse DNS on sslhost is OK: SSL OK - nocyclo.tk reverse DNS resolves to cp12.miraheze.org - CNAME FLAT [10:02:02] RECOVERY - en.nocyclo.tk - reverse DNS on sslhost is OK: SSL OK - en.nocyclo.tk reverse DNS resolves to cp12.miraheze.org - CNAME FLAT [10:02:02] RECOVERY - meta.nocyclo.tk - reverse DNS on sslhost is OK: SSL OK - meta.nocyclo.tk reverse DNS resolves to cp12.miraheze.org - CNAME FLAT [10:02:03] RECOVERY - en.nocyclo.tk - LetsEncrypt on sslhost is OK: OK - Certificate 'en.nocyclo.tk' will expire on Mon 10 Jan 2022 06:03:44 GMT +0000. [10:02:03] RECOVERY - housing.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'housing.wiki' will expire on Mon 10 Jan 2022 05:59:53 GMT +0000. [10:02:04] RECOVERY - dreamsit.com.br - LetsEncrypt on sslhost is OK: OK - Certificate 'dreamsit.com.br' will expire on Sat 06 Nov 2021 13:36:51 GMT +0000. [10:02:30] PROBLEM - wiki.minecraftathome.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.minecraftathome.com' expires in 6 day(s) (Wed 20 Oct 2021 08:57:08 GMT +0000). [10:02:30] PROBLEM - lcn.zfc.id.lv - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'lcn.zfc.id.lv' expires in 7 day(s) (Thu 21 Oct 2021 04:19:31 GMT +0000). [10:02:30] PROBLEM - wiki.cyberfurs.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.cyberfurs.org' expires in 6 day(s) (Wed 20 Oct 2021 06:31:20 GMT +0000). [10:02:30] PROBLEM - biblestrength.net - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'biblestrength.net' expires in 10 day(s) (Sun 24 Oct 2021 04:16:04 GMT +0000). [10:02:30] PROBLEM - private.revi.wiki - Sectigo on sslhost is WARNING: WARNING - Certificate 'private.revi.wiki' expires in 24 day(s) (Sat 06 Nov 2021 23:59:59 GMT +0000). [10:02:30] PROBLEM - wiki.mikrodev.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'wiki.mikrodev.com' expires in 6 day(s) (Wed 20 Oct 2021 05:16:53 GMT +0000). [10:02:30] PROBLEM - spiral.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Certificate 'spiral.wiki' expires in 6 day(s) (Wed 20 Oct 2021 01:40:39 GMT +0000). [10:03:43] PROBLEM - mw13 Current Load on mw13 is WARNING: WARNING - load average: 7.61, 5.39, 4.16 [10:04:42] RhinosF1: resolved your problem [10:05:06] PROBLEM - mon2 Current Load on mon2 is CRITICAL: CRITICAL - load average: 4.31, 5.29, 3.28 [10:05:16] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 15.49, 9.37, 6.41 [10:05:39] PROBLEM - mw13 Current Load on mw13 is CRITICAL: CRITICAL - load average: 8.03, 5.80, 4.43 [10:05:54] PROBLEM - mw8 Current Load on mw8 is WARNING: WARNING - load average: 6.17, 6.84, 5.14 [10:06:23] JohnLewis: did it survive the network issues then? [10:06:32] Yeah [10:06:44] It's running across every wiki in screen via foreachwiki [10:07:21] I'll have a look when I get home how far it's moved [10:07:43] RECOVERY - mw13 Current Load on mw13 is OK: OK - load average: 4.96, 5.37, 4.45 [10:07:53] RECOVERY - mw8 Current Load on mw8 is OK: OK - load average: 3.88, 5.84, 4.98 [10:08:14] PROBLEM - mw11 Current Load on mw11 is CRITICAL: CRITICAL - load average: 6.71, 9.04, 6.43 [10:09:24] RECOVERY - mon2 Current Load on mon2 is OK: OK - load average: 1.53, 3.26, 2.92 [10:12:41] PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 10.86, 10.23, 7.03 [10:12:48]