[09:43:30] <hnowlan>	 I might have missed something so possibly a dumb question - but did something happen during the switchover that caused some services to end up pooled in both DCs? I see kartotherian pooled in both codfw and eqiad https://config-master.wikimedia.org/discovery/discovery-basic.yaml
[09:49:13] <_joe_>	 hnowlan: most services should be active-active
[09:49:27] <_joe_>	 so yes, probably we repooled some services during the switchover
[09:50:24] <hnowlan>	 cool, I think maps previously wasn't but no harm in keeping it consistent
[15:19:14] <dcaro>	 anyone looking into "Icinga/Check for large files in client bucket"? we are getting lots of failures on that check on most bare metal hosts
[15:20:46] <jbond>	 dcaro: see -cloud-admin majavah just ssent a fix
[15:20:54] <dcaro>	 ack, thanks!
[15:20:57] <jbond>	 sorry for the noise
[15:21:09] <majavah>	 my patch only affects cloud vms, not bare metal
[15:23:43] <jbond>	 ack i see now, looking
[15:26:16] <dcaro>	 thanks, I'm in a meeting, let me know if you need anything urgent (it's not bothering us much so far though, so no rush on our side)
[15:36:21] <volans>	 jbond: around?
[15:36:36] * volans running puppet on alert1001 to see if it removed a bunch of the checks
[15:36:44] <volans>	 I think it gets removed by puppet on the target host and then icinga still try to run the check via NRPE
[15:36:48] <volans>	 seems so from the puppet run
[15:37:09] <volans>	 I'll try to downtime the remaining ones
[15:37:32] <jynus>	 ah, so it wasn't a logical issue, just a race condition, right?
[15:38:09] <jbond>	 volans: for now i have sent a patch to absent the nrpe check so in theory it should all get cleaned up within the next ~30 minutes
[15:38:38] <jbond>	 just looking for the correct syntax to use shell expansion in an nrpe check (if its possible)
[15:38:56] <volans>	 I've downtimed for 1h
[15:39:02] <jbond>	 ack thanks
[15:39:13] <volans>	 and will run puppet again on alert1001
[15:39:17] <volans>	 so that they should get removed
[15:39:22] <volans>	 or th erecovery will be in 1h
[15:39:42] <jbond>	 SGTM thanks
[16:57:43] <jbond>	 fyi i sent a fix to this issue but i set the wrong return code so we are now seeing warnings instead of unknows.  i sent a fix for the retrn code about 15 minutes ago so it should clear in another 15 mins 
[16:58:12] <legoktm>	 I'm going to repool codfw now
[16:58:13] <mutante>	 :) looks good
[18:16:54] <andrewbogott>	 Are there progress reports/queries, etc regarding Stretch deprecation in prod?
[18:27:57] <mutante>	 andrewbogott: yea, mwmaint2002 is buster, which completed MW appserver migration and therefore strech support was dropped from mediawiki module the other day
[18:28:32] <andrewbogott>	 ok, so prod is effectively stretch-free?  That's good to know.
[18:28:35] <mutante>	 andrewbogott: https://phabricator.wikimedia.org/T247045
[18:28:40] <andrewbogott>	 thanks!
[18:28:46] <mutante>	 see open subtasks though
[18:28:58] * andrewbogott nods
[18:48:34] <moritzm>	 andrewbogott: not quite, there's 373 hosts left :-)
[18:49:13] <moritzm>	 five of those in WMCS (4x NFS and one clouddb test host in codfw)
[22:06:55] <mutante>	 !log wdqs1004 - HTTP/1.1 503 Service Unavailable - systemctl restart wdqs-blazegraph
[22:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:36] <mutante>	 ^ it was like that since 18 hours, but that did it apparently: RECOVERY - Query Service HTTP Port on wdqs1004 is OK:
[22:08:59] <mutante>	 though now also WDQS high update lag on wdqs1004 is CRITICAL
[22:09:08] <mutante>	 dcausse: ^ CC: based on lastlog
[22:12:57] <mutante>	 also restarted wdqs-updater service afterwards as docs told me