[03:58:37] <mpham>	 hm, not looking good (to my untrained eye) on https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&orgId=1&refresh=1m&from=now-2d&to=now
[07:24:04] <Lydia_WMDE>	 yeah very not good :(
[07:25:37] <zpapierski>	 gehel: it seems the issue is solely with codfw instances, can we depool them to help them catch up? lag seems to be dropping down
[07:49:04] <gehel>	 zpapierski: not yet in the office
[07:49:25] <gehel>	 Can you check if both DC are pooled ?
[07:50:31] <zpapierski>	 I need to get into office (meaning in front of my computer) too, actually, but once I do I'll check 
[07:51:03] <gehel>	 If only codfw is pooled, that explains the issue (and confirms that we need to increase capacity - or decrease compute load via the new updater)
[08:28:29] <gehel>	 zpapierski: you're not in #wikimedia-operations ?
[08:29:11] <gehel>	 I've just repooled eqiad, we were overly confident in depooling it earlier (as part of the DC switch)
[08:30:08] <zpapierski>	 Apparently I was disconnected at some point (irccloud should be more verbose about it) 
[08:30:30] <zpapierski>	 You're leaving codfw in the pool? 
[08:31:35] <zpapierski>	 (still not home, and looks It will take at least 1.5h :( ) 
[08:33:20] <gehel>	 Waiting to see if eqiad is doing well before depoooling codfw
[08:35:45] <zpapierski>	 Makes sense 
[08:53:20] <addshore>	 Hows the flink dpeloy going
[09:17:26] <gehel>	 ryankemper (for when around): I've repooled wdqs eqiad and depooled codfw. We'll need to repool codfw once it has catched up.
[09:21:36] <zpapierski>	 addshore: I'm waiting for a review on my potential fix for task managers networking woes 
[09:22:03] <addshore>	 gotcha!
[09:22:21] <zpapierski>	 After that we are planning an upgrade after the next switchover some where in the second half of September 
[10:22:39] <gehel>	 lunch
[12:58:23] <zpapierski>	 break
[13:50:57] <zpapierski>	 gehel, mpham - are we dropping QS&W checkpoint today? our WMDE friends all declined
[13:51:18] <gehel>	 yeah, I think we should
[13:51:27] <mpham>	 yes, let's drop
[14:28:54] <zpapierski>	 relocating
[16:55:55] <justinl>	 Hi all. The CirrusSearch docs recommend reindexing after upgrades. How big of an upgrade would necessitate a need for reindexing?
[16:57:07] <justinl>	 Like if I were upgrading MW from 1.35.3 to 1.35.4 and so pulling down all gerrit updates since my pulls from REL1_35 for my 1.35.3 wikis would likely get some minor changes to CS, possibly Elastica and/or relevant PHP libraries.
[16:57:29] <justinl>	 More simply put, under what conditions should I need to reindex my wikis?
[16:58:44] <ebernhardson>	 justinl: reindexing is generally necessary any time the underlying mappings change. These don't change particularly fast, and i dont think we ever change them in patches to REL branches
[16:59:20] <ebernhardson>	 justinl: so you should be safe to not run any reindexing process until you move to REL1_36. 
[17:00:04] <ebernhardson>	 even if you miss one, in 99% of cases the effect wont be noticable. It's often various language specific handling thats changing
[17:04:28] <justinl>	 Cool, thank you! And since the reindexing (at least certain methods) seems to be transparent enough that it can be done without downtime, for the major version upgrades that do always require downtimes, at least reindexing shouldn't cause an extension to that downtime.
[17:06:26] <justinl>	 My dev AWS ES setup still seems to be working ok, though without any real load, so I'm currently planning on a live deploy on Sep 1. I did have a meeting with an AWS specialist about our needs and properly sizing a cluster for perf and HA, and while it doesn't seem either cheap or terribly expensive, I really hope it works out okay since rolling my
[17:06:26] <justinl>	 own with sufficient HA would be much more complex.
[17:07:01] <justinl>	 Their recs for our needs are 3 data nodes (50 GB each) and 3 master nodes spanning 3 AZs.
[17:09:37] <ebernhardson>	 justinl: i can see how that would add up :) We run similar min specs, although with only 2 AZ-equivalents
[17:10:09] <ebernhardson>	 hope it works out, indeed maintaining it all yourself can add up to quite a bit of effort over itme
[17:16:08] <justinl>	 Yeah. next year I finally plan to start learning how to script/automate AWS stuff with Terraform so I can stop with so much manual creation and management of all of the AWS components. It's a ton for one person to manage, on top of MW's complexity, and on top of everything else on my plate.
[17:16:38] <justinl>	 Next year though I'm really hoping to move my EC2 stuff into EKS. That'll help in a lot of ways but it will also fundamentally change my wiki management procedures.
[17:19:32] <ebernhardson>	 we're also on a big move to put mediawiki into containers...it's getting close there are finally live test instances we can run requests through
[17:19:42] <ebernhardson>	 (although i'm not a part of any of that, just see the update s:)
[17:31:57] <justinl>	 There are definitely going to be challenges, for sure. Even just thinking through the high-level design, I have concerns. Like sharing the MW code between k8s nodes, probably have to be done with an EFS filesystem with provisioned IOPs, and it will require redesigning my wiki configs to (finally) use a single copy of MW for all wikis rather than
[17:31:57] <justinl>	 one full copy each, but then this should allow for a single copy, period, rather than one per server. So down from 24 copies to 1 would really be nice.
[21:43:32] <ebernhardson>	 figures.  The problem with rebuilding cirrus completion indices on wikitech came about because someone fixed a bug in the some locking code where it would attempt to take a lock, fail, and then report success :)