[00:14:58] 10serviceops, 10Performance-Team, 10Code-Health-Objective, 10Platform Team Initiatives (Session Management Service (CDP2)), and 2 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) 05Open→03Resolved [01:06:26] 10serviceops, 10SRE, 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7 - https://phabricator.wikimedia.org/T216676 (10Krinkle) [01:07:44] 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-ObjectCache, 10Performance-Team, 10User-jijiki: Use php-hrtime monotonic clock instead of microtime for perf measure in MW - https://phabricator.wikimedia.org/T245464 (10Krinkle) 05Open→03Declined a:05dpifke→03None In favour of {T271736} [01:09:24] 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-ObjectCache, 10Performance-Team, 10User-jijiki: Use php-hrtime monotonic clock instead of microtime for perf measure in MW - https://phabricator.wikimedia.org/T245464 (10Krinkle) 05Declined→03Open Re-opening for original purpose. The php-hrtime bloc... [05:29:41] <_joe_> jelto: ping me when you're around and we can depool codfw [07:36:08] _joe_: I'm here in ~30m. I'll ping you [07:36:19] <_joe_> sure, no rush [08:01:53] _joe_ I'm here now and looking at your dc-maint.sh script [08:06:29] <_joe_> jelto: I have a small cosmetic update [08:09:50] mw2289 timed out for kart's deploy ongoing. Not sure if known/expected. [08:12:40] _joe_ Is there a diff/change for the update? :) [08:13:08] <_joe_> RhinosF1: it should be set to pooled=inactive already let me check [08:15:10] Thanks joe! [08:46:33] _joe_: yep [08:47:01] <_joe_> hnowlan: if you take another look at the etherpad, there's some stuff there [08:51:44] _joe_: will do [08:52:08] I am actually going to be on and off a few times in the afternoon but I can set things up in advance, it should be fine [09:00:44] _joe_: whats the plan with depooling codfw? I saw you prepared all of our hosts. Do we wait until restbase, maps and sessionstore are DOWN too? [09:00:58] <_joe_> no [09:01:01] <_joe_> let's depool early [09:01:42] <_joe_> jelto: wait a sec, though, I'll copy a new version of the script with a bit of comments an some cosmetic improvements to the output [09:01:53] ack :) [09:02:06] <_joe_> actually, no, let me check one thing, and let's go with the current version [09:02:17] works for me too [09:02:27] <_joe_> before we launch it, we need to check if there are services currently just pooled in codfw [09:02:45] <_joe_> my way of doing that is [09:05:22] <_joe_> for dc in eqiad codfw; do confctl --object-type discovery select "name=$dc' get | grep -F '"pooled": true' | jq .tags | sort | uniq > $dc.pooled; done [09:06:45] <_joe_> diffing the files, the only thing that's pooled only in codfw is kartotherian [09:06:51] <_joe_> which we have in the exclude list [09:08:35] I can confirm that. But what about docker-registry? That service is also pooled in codfw but not in eqiad [09:08:57] ah docker-registry is also in the exclude list [09:09:44] <_joe_> yes [09:13:07] <_joe_> so, I think we're good to go [09:13:12] <_joe_> proceed at your convenience [09:13:20] <_joe_> and !log to #operations when you start [09:13:34] <_joe_> I decided against flooding the chat with stuff [09:15:02] _joe_: okay. Just to make sure I don't miss anything, it's just /home/oblivian/dc-maint.sh depool codfw ? [09:15:15] <_joe_> jelto: yes [09:15:52] <_joe_> sorry, gods of automation. We'll finish the kubernetes rolling restart cookbook to placate you [09:18:10] _joe_ is the cookbook running currently? I did not find anything in SAL [09:18:41] <_joe_> no I mean [09:18:51] <_joe_> we'll finish making it work as it should :P [09:20:11] ah you mean the logic of dc-maint.sh should move to a kubernetes cookbook at some point? [09:30:15] _joe_: I'm ready. If you are ok I'm going start the script and log in -operations [09:30:27] <_joe_> jelto: +1 [09:38:51] _joe_: done, script finished [09:39:21] <_joe_> ack, I'll check results in a bit [09:39:51] great thanks a lot :) I updated the status in the pad [09:40:30] <_joe_> cool, thanks [09:51:00] hi, it looks like mediawiki_access_log.mtail could be impacted by https://phabricator.wikimedia.org/T314922 [09:52:46] <_joe_> it would, in theory, but it won't in practice [09:52:59] <_joe_> apache timings are in nanoseconds so we never get a proper 0 value [09:53:14] <_joe_> but yeah, we should probably still add that bucket [09:57:56] <_joe_> jelto: lgtm [09:59:20] thanks! [10:34:53] maps2010 and restbase stuff is down, sessionstore is still up because I want to leave that until closer to the work [10:35:54] aiui we're still 100% safe to lose a sessionstore host but given that there's one host per rack (hence ownership of 100%) I would like to avoid the exposure to risk for now [10:43:55] <_joe_> +1 [13:37:06] _joe_, jelto: A heads up that Reuven may be late / unavailable today. If that's the case I'll reach out to Daniel. [14:56:40] I'm around after all :) appreciate it though [14:56:45] catching up now [16:32:54] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gerrit2001.wikimedia.org` - gerrit2001.wikimedia.org (**... [17:46:21] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [17:46:29] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) 05In progress→03Resolved gerrit2002 is production https://gerrit-replica.wikimedia.org gerrit2001 is shut down and fully decom'ed. [18:31:50] weird, "Host parse[2019,2020] is not in mediawiki-installation dsh group" are both still crit despite the hosts being repooled [18:36:34] rzl: the same thing happened previously with other appservers.. then I told Icinga to reschedule and waited longer and eventually it resolved [18:36:48] I think it's just that the check doesnt run very often [18:37:07] or it needs puppet run on conf* and alert* both [18:37:58] yeah, rescheduling didn't work but I figured it was something like that puppet run -- I'll check back in a little bit [18:39:55] yea, I had the exact same pattern.. "why is it still crit".."why does reschedule still not do it".. then ..it did [18:59:21] 10serviceops, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10SRE: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) >>! In T314789#8139432, @Legoktm wrote: > I would recommend... [20:57:41] 10serviceops, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10SRE: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) >>! In T314789#8140256, @TheDJ wrote: > You can use lame and/or...