[13:29:36] marostegui: kormat: I believe switchover (in ~30min?) would by default kill the maintenance scripts [13:29:47] which means it would yet again not reach the far end of the servers [13:29:52] Krinkle: that's how it usually goes yeah [13:30:07] Krinkle: and as far as I know, the switchover is still happening yes [13:30:15] should we do something about it? [13:30:38] or do you think it's fine to let slide a third time and restart? [13:30:39] Krinkle: If you can kill it yourself now, maybe that'd be good, they won't finish anyways [13:30:58] well, I'd rather they keep running :) [13:31:08] Krinkle: they'll be killed anyways [13:32:12] unless we say they shouldn't in which case we'd skip that step and improvise something that leaves these running to completion [13:32:24] it's a risk assessment that I'm not sure how to make with the information I have. [13:33:01] Krinkle: I would prefer if we follow the normal switchover path, that is killing everything :) [13:33:28] ok, so that means we're pretty much guruanteed to go over 80% sometime this week, right? [13:33:50] Krinkle: I don't know the current status [13:33:53] kormat: do you? ? [13:34:03] Hiya, qq, what is the switchover schedule? How long will codfw be primary? [13:34:30] ottomata: all i've heard is at least a month [13:34:40] ok good to know thank you [13:34:54] marostegui: we're at 27% right now [13:35:03] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_June_2021_switch [13:35:05] ottomata: ^ [13:35:05] ottomata: https://phabricator.wikimedia.org/T281515 no date for the back, but at least a month it says [13:35:27] marostegui: oh, you meant re: disk space. let me see. [13:35:52] estimate is good, we have an alert that I'm deciding if it is worth a patch for...and a month is definitely worth it [13:35:53] thank you [13:36:20] marostegui: https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=pc1007&var-datasource=thanos&var-cluster=mysql&from=now-7d&to=now&viewPanel=12 doesn't look like we're about to hit a crisis at least [13:36:52] let's not worry about it for now in any case [13:36:54] kormat: pc2 at 27%, pc1 at 11%, and pc3 estimated at 27*3=81% [13:38:23] we also have new hosts in codfw, which have bigger disks, we can simply move everything there next week [13:38:35] kormat: for your review: https://gerrit.wikimedia.org/r/702128 [13:41:17] marostegui: i guess it doesn't matter which nodes you pick for es1/2/3 [13:41:58] yeah, just picked the ones that appear as masters in codfw.json [13:42:51] _joe_: So, patch ready! [13:43:26] <_joe_> ack, thnks [13:45:53] The banner is wrong by the way [13:45:55] At least on eswiki [13:46:02] it says 05:00 UTC - 05:30 UTC [13:46:11] Going to report that [13:47:49] all wikis [13:47:50] On enwiki too [13:47:52] yeah [13:48:53] someone already created a tmux named "switchdc"? [13:49:47] https://meta.wikimedia.org/w/index.php?title=MediaWiki%3ACentralnotice-template-read_only_banner&type=revision&diff=21661913&oldid=21655347 [13:49:56] nvm, see -operations [13:51:57] :) [13:52:11] was my job still, wasn't it legoktm= [13:53:05] yep :D [13:54:01] pcc [13:54:12] (wrong window) [15:39:04] godog: I've acked an alert in alart manager NavtimingStaleBeacon and it says it will expire in 15min (ok), but it seems to not actually expire after 15min. I think it's renewing automatically which is fine too, but I'm a bit confused as to whether this is working correctly or not. [15:39:17] (will it stop renewing if I close the browser window?) [16:03:32] Krinkle: yeah clicking the 'tick' icon does the right thing, namely an auto-renewing silence as you noticed [16:04:07] Krinkle: the long version is at https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements and I'd love feedback and/or fixes to it [16:04:12] godog: hm.. so this will indefinitely ignore it even if it resolves and then fires again a week later? [16:04:39] Krinkle: no, if there are no matching alerts the silence/ack is deleted [16:04:46] ok. tjat [16:04:49] ok, tat [16:04:56] ugh, what's wrong with my keyboard [16:04:59] ok, that's good enough for me [16:05:55] yeah the renewing-silence as ack isn't AM native functionaly but it's been built on top, it's good enough tho [16:19:37] I'm going to create a silence for the matching critical alert that will fire after 24 hours. Rather than auto-renew, I'm going to set it to expire when we're scheduled to switch back. That's 19 July, right? [16:21:42] dpifke: we don't have a switch back date as far as I know [16:21:50] wkandek: can you confirm? ^ [16:23:52] Is it a known issue that the "view in AlertManager" link on the expanded silence points to localhost:9093? [16:24:19] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_June_2021_switch says switchback is "TBD. Sometime after August 1st" [16:24:46] Ack, thanks. [17:30:38] I'll start working on a switchback date, it's a bunch of cross-referencing of calendars with various people trying to pick the least worst time [18:03:51] kormat: shall I restart/resume the purge now? [18:05:53] Krinkle: SGTM [18:48:24] {{done}} [20:10:47] legoktm: re: T285806, hate saying this but probably announcing to Slack is another idea [20:10:48] T285806: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 [20:11:35] I for one, am not subscribed to wikitech-l. I am not sure why [20:25:27] sukhe: was it not on your onboarding checklist? [20:26:31] heh, it's not checked off