[09:43:30] I might have missed something so possibly a dumb question - but did something happen during the switchover that caused some services to end up pooled in both DCs? I see kartotherian pooled in both codfw and eqiad https://config-master.wikimedia.org/discovery/discovery-basic.yaml [09:49:13] <_joe_> hnowlan: most services should be active-active [09:49:27] <_joe_> so yes, probably we repooled some services during the switchover [09:50:24] cool, I think maps previously wasn't but no harm in keeping it consistent [15:19:14] anyone looking into "Icinga/Check for large files in client bucket"? we are getting lots of failures on that check on most bare metal hosts [15:20:46] dcaro: see -cloud-admin majavah just ssent a fix [15:20:54] ack, thanks! [15:20:57] sorry for the noise [15:21:09] my patch only affects cloud vms, not bare metal [15:23:43] ack i see now, looking [15:26:16] thanks, I'm in a meeting, let me know if you need anything urgent (it's not bothering us much so far though, so no rush on our side) [15:36:21] jbond: around? [15:36:36] * volans running puppet on alert1001 to see if it removed a bunch of the checks [15:36:44] I think it gets removed by puppet on the target host and then icinga still try to run the check via NRPE [15:36:48] seems so from the puppet run [15:37:09] I'll try to downtime the remaining ones [15:37:32] ah, so it wasn't a logical issue, just a race condition, right? [15:38:09] volans: for now i have sent a patch to absent the nrpe check so in theory it should all get cleaned up within the next ~30 minutes [15:38:38] just looking for the correct syntax to use shell expansion in an nrpe check (if its possible) [15:38:56] I've downtimed for 1h [15:39:02] ack thanks [15:39:13] and will run puppet again on alert1001 [15:39:17] so that they should get removed [15:39:22] or th erecovery will be in 1h [15:39:42] SGTM thanks [16:57:43] fyi i sent a fix to this issue but i set the wrong return code so we are now seeing warnings instead of unknows. i sent a fix for the retrn code about 15 minutes ago so it should clear in another 15 mins [16:58:12] I'm going to repool codfw now [16:58:13] :) looks good [18:16:54] Are there progress reports/queries, etc regarding Stretch deprecation in prod? [18:27:57] andrewbogott: yea, mwmaint2002 is buster, which completed MW appserver migration and therefore strech support was dropped from mediawiki module the other day [18:28:32] ok, so prod is effectively stretch-free? That's good to know. [18:28:35] andrewbogott: https://phabricator.wikimedia.org/T247045 [18:28:40] thanks! [18:28:46] see open subtasks though [18:28:58] * andrewbogott nods [18:48:34] andrewbogott: not quite, there's 373 hosts left :-) [18:49:13] five of those in WMCS (4x NFS and one clouddb test host in codfw) [22:06:55] !log wdqs1004 - HTTP/1.1 503 Service Unavailable - systemctl restart wdqs-blazegraph [22:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:36] ^ it was like that since 18 hours, but that did it apparently: RECOVERY - Query Service HTTP Port on wdqs1004 is OK: [22:08:59] though now also WDQS high update lag on wdqs1004 is CRITICAL [22:09:08] dcausse: ^ CC: based on lastlog [22:12:57] also restarted wdqs-updater service afterwards as docs told me