[01:08:55] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 9.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:09:53] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:10:33] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:11:31] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[08:26:13] <Amir1>	 marostegui: hi, okay if I merge this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/913662/
[08:31:03] <Amir1>	 oh it's public holiday in Spain, so Jaime is out too. I just merge it to avoid a page
[09:00:26] <jynus>	 no, I am not out
[09:02:11] <jynus>	 today's only bank holiday in Madrid, maybe somewhere else
[09:09:40] <jynus>	 so db1132 and db1106 are corrupted/depooled; db2184 is down; db2139 up but crashed, last 2 waiting hw servicing
[09:09:50] <jynus>	 am I missing something?
[10:48:03] <Amir1>	 ah, sorry
[10:48:34] <Amir1>	 yeah, that's my understanding
[10:49:17] <Amir1>	 jynus: quick q: With the maint today, if a codfw master goes down, would that break things given that we are on circular replication? 
[10:49:47] <Amir1>	 *break things in eqiad
[10:51:49] <jynus>	 mmmm
[10:52:47] <Amir1>	 If we want to be on the safe side, I suggest breaking circular replication but I honestly have no idea how Manuel does it
[10:53:12] <jynus>	 it is just stopping it from the eqiad primary
[10:53:33] <Amir1>	 and then probably getting rid of heartbeat rows?
[10:53:52] <jynus>	 well, if it is temporary, no need- but acking alerts
[10:54:21] <jynus>	 codfw will be depooled from mw, right?
[10:54:48] <Amir1>	 yeah, that part is fine
[10:55:08] <Amir1>	 I have to downtime sections
[10:55:16] <Amir1>	 but beside that, nothing major
[10:55:18] <jynus>	 I agree you better do it manually in advance than without network later
[10:55:39] <Amir1>	 downtime or stopping replication?
[10:55:41] <jynus>	 it should be literally doing STOP SLAVE; on the eqiad master
[10:55:42] <jynus>	 both
[10:55:48] <jynus>	 downtime first
[10:56:08] <jynus>	 that way it can be observed in advance and restarted if something goes wrong
[10:56:18] <jynus>	 and the start it for consistency after maintenance and network is back
[10:56:21] <jynus>	 *then
[10:56:58] <Amir1>	 hmm, sounds good to me
[10:57:31] <Amir1>	 maybe only stop replication on sections that have master in row C
[10:57:41] <jynus>	 yeah
[10:57:48] <jynus>	 but downtime first
[10:59:23] <jynus>	 remember the stop would be run on eqiad
[10:59:45] <jynus>	 and then make sure alerts and mw is still happy
[11:02:24] <Amir1>	 sure
[11:05:28] <marostegui>	 stop slave and downtime things 
[11:05:33] <marostegui>	 I'm breaking circular tomorrow 
[11:05:38] <marostegui>	 today is a public holiday here 
[11:29:00] <Amir1>	 awesome
[11:29:08] <Amir1>	 now go back to holdaying
[11:30:54] <Amir1>	 let's see what section masters are in row C.
[11:31:04] <Amir1>	 db2112 -> s1. Starting strong
[11:33:02] <jynus>	 replication downtime on codfw replicas should also be added, to avoid alert spam
[11:33:32] <Amir1>	 yeah yeah, I do that all the time
[11:37:08] <Amir1>	 so we have s1, s5, x1, m5 (doesn't need breaking), x1 and s8
[11:37:24] <Amir1>	 basically what section's master is not in row c is a better a question
[11:37:38] <Amir1>	 :D
[11:38:06] <jynus>	 that may need rebalancing
[11:39:21] <Amir1>	 yeah, the biggest complexity is with switchovers but we can take candidate master into account and find a way. I want to do it but keep forgetting and things happening
[11:44:49] <jynus>	 yeah, not in a rush, now that codfw won't be primary can be done with time
[11:45:13] <jynus>	 I'm focusing on mediabackups, ping me if you need help with dbs
[11:52:04] <Amir1>	 I stopped slave on master of s5 db1130, orch is not happy
[11:52:10] <Amir1>	 https://orchestrator.wikimedia.org/web/cluster/alias/s5
[11:52:46] <Amir1>	 started it again to avoid page
[11:54:15] <Amir1>	 yeah there is a heartbeat row in codfw, should I delete that row?
[11:55:18] <Amir1>	 but then restarting replication after the maint gonna be fun
[12:03:56] <Amir1>	 jynus: sorry to ping but do you have ideas for this? :(
[12:04:31] <jynus>	 why do you care about orchestrator? downtime alerts and done
[12:05:17] <Amir1>	 hmm, sounds okay to me. Two concerns would be: MW picking it up for heartbeat, eqiad alerting 
[12:05:30] <Amir1>	 but I can try and see
[12:06:05] <jynus>	 in any case, codfw has been depooled already: https://grafana.wikimedia.org/goto/AbyCj0sVz?orgId=1
[12:06:46] <Amir1>	 yeah but the heartbeat is going to eqiad replicas too
[12:06:55] <jynus>	 MW picking it up for heartbeat - that is why we have an extenstion to heartbeat
[12:07:07] <jynus>	 with the primary heartbeat
[12:07:28] <jynus>	 just double check with mw errors on 1 host
[12:07:39] <jynus>	 *1 eqiad section
[12:08:24] <Amir1>	 yeah, so far it's clean
[12:09:03] <jynus>	 and in theory alerts shouln't fire, because we have the right logic
[12:09:08] <jynus>	 neither mw
[12:09:23] <Amir1>	 fingers crossed
[12:09:23] <jynus>	 the issue is orchestrator is, I belive, very picky
[12:09:47] <Amir1>	 I did only s5 for now to be sure, then will do the rest of sections before the maint window
[12:09:53] <Amir1>	 yup
[12:11:27] <jynus>	 in any case, orch complains that the master is not replicating, which is to be expected
[12:11:36] <jynus>	 worry if the replicas give that error :-D
[12:11:52] <Amir1>	 haha, fair
[12:12:14] <Amir1>	 it makes all of replicas yellow too (=replication lag) but yeah, red would be worrying 
[12:12:19] <Amir1>	 https://de.wikipedia.org/w/index.php?title=Benutzer:ASarabadani_(WMF)/test&action=history  s5 works fine
[12:12:21] <jynus>	 this is because normally, we wouldn't left the master stopped, we would remove its replication info
[12:13:07] <jynus>	 just to be clear, this is is indeed confusing, but of all people, you would be the one that I would expect to know the mw internals to not be worried! :-)
[12:13:26] <jynus>	 about how the mw load balancing and replication control works!
[12:13:43] <Amir1>	 mw's internals is so weird and overly complex that it's fragile
[12:13:49] <Amir1>	 that's what worries me
[12:14:21] <Amir1>	 and specially it's quite aggressive in going read only and basically fataling
[12:14:21] <jynus>	 so think that the only thing we are stopping is a channel from codfw to eqiad, that should only provide useless heartbeat rows
[12:14:43] <jynus>	 as we should be reading the ones from eqiad everywhere, including on codfw
[12:15:19] <jynus>	 it is just that orchestrator detects a weird situtation: "replication is configured but stopped"
[12:15:32] <jynus>	 because it doesn't have the context that mw or custom alerts have
[12:16:18] <jynus>	 and if mw is happy, it is ok, as it is quite verbose when there is lag
[12:17:06] <jynus>	 can I give you an actionable for later?
[12:18:10] <Amir1>	 sure
[12:18:21] <jynus>	 there is 2 old dashboards for DBQuery DBReplication (for when those channels existed, not something we chose)
[12:18:36] <jynus>	 I am guessing those are now unified or changed into DBError
[12:18:48] <Amir1>	 in logstash?
[12:18:48] <jynus>	 sorry, rdbms
[12:18:51] <jynus>	 yep
[12:18:58] <jynus>	 those should be deleted or updated 
[12:19:04] <Amir1>	 yeah, delete them
[12:19:32] <Amir1>	 the db error is linked from main page, that's what matters the most to me
[12:21:01] <jynus>	 up to you, just trying to not leave broken dashboards around or updating them if they are used by some of you
[12:21:42] <jynus>	 evreything ok now, Re: stopping/downtiming?
[12:22:04] <Amir1>	 seems so
[12:22:21] <Amir1>	 I stopped slave on other sections too
[12:23:03] <jynus>	 ok, ping me again if you need more discussion. when programming, at least me, I don't read the chat very often
[12:23:18] <Amir1>	 sure. Thanks <3
[12:23:52] <jynus>	 and I am superhappy to help if you have doubts- I have many sometimes and I bug m*nuel all the time :-D
[12:25:25] <jynus>	 the main issue here I'd say was orchestrator not being super-fit for our needs, I remember some things were done to make it work for us (but at least we don't have to maintain the codebase!)
[12:41:29] <Amir1>	 yeah, it's always a hard choice to whether build something in-house or adapt an existing solution (sometimes the adaption itself is as much as work as building one)
[13:08:00] <Amir1>	 https://usercontent.irccloud-cdn.com/file/fbS1u8Gj/grafik.png
[13:08:05] * Amir1 sweats heavily 
[13:08:23] <Amir1>	 I know it's not anything to be worried about
[13:08:28] <Amir1>	 just.. scary
[16:10:49] <jynus>	 not sure this is PEP-complient 😬 https://phabricator.wikimedia.org/T327157#8819637
[18:18:27] <Amir1>	 I always love a good escaping issue
[20:42:50] <ottomata>	 urandom: o/ still looking for some guidance on https://phabricator.wikimedia.org/T330693.  we're getting closer to deploying in wikikube and would love to be able to try out checkpointing before that.  How can I help move that along?