[01:08:55] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 9.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:09:53] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:33] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:11:31] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [08:26:13] marostegui: hi, okay if I merge this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/913662/ [08:31:03] oh it's public holiday in Spain, so Jaime is out too. I just merge it to avoid a page [09:00:26] no, I am not out [09:02:11] today's only bank holiday in Madrid, maybe somewhere else [09:09:40] so db1132 and db1106 are corrupted/depooled; db2184 is down; db2139 up but crashed, last 2 waiting hw servicing [09:09:50] am I missing something? [10:48:03] ah, sorry [10:48:34] yeah, that's my understanding [10:49:17] jynus: quick q: With the maint today, if a codfw master goes down, would that break things given that we are on circular replication? [10:49:47] *break things in eqiad [10:51:49] mmmm [10:52:47] If we want to be on the safe side, I suggest breaking circular replication but I honestly have no idea how Manuel does it [10:53:12] it is just stopping it from the eqiad primary [10:53:33] and then probably getting rid of heartbeat rows? [10:53:52] well, if it is temporary, no need- but acking alerts [10:54:21] codfw will be depooled from mw, right? [10:54:48] yeah, that part is fine [10:55:08] I have to downtime sections [10:55:16] but beside that, nothing major [10:55:18] I agree you better do it manually in advance than without network later [10:55:39] downtime or stopping replication? [10:55:41] it should be literally doing STOP SLAVE; on the eqiad master [10:55:42] both [10:55:48] downtime first [10:56:08] that way it can be observed in advance and restarted if something goes wrong [10:56:18] and the start it for consistency after maintenance and network is back [10:56:21] *then [10:56:58] hmm, sounds good to me [10:57:31] maybe only stop replication on sections that have master in row C [10:57:41] yeah [10:57:48] but downtime first [10:59:23] remember the stop would be run on eqiad [10:59:45] and then make sure alerts and mw is still happy [11:02:24] sure [11:05:28] stop slave and downtime things [11:05:33] I'm breaking circular tomorrow [11:05:38] today is a public holiday here [11:29:00] awesome [11:29:08] now go back to holdaying [11:30:54] let's see what section masters are in row C. [11:31:04] db2112 -> s1. Starting strong [11:33:02] replication downtime on codfw replicas should also be added, to avoid alert spam [11:33:32] yeah yeah, I do that all the time [11:37:08] so we have s1, s5, x1, m5 (doesn't need breaking), x1 and s8 [11:37:24] basically what section's master is not in row c is a better a question [11:37:38] :D [11:38:06] that may need rebalancing [11:39:21] yeah, the biggest complexity is with switchovers but we can take candidate master into account and find a way. I want to do it but keep forgetting and things happening [11:44:49] yeah, not in a rush, now that codfw won't be primary can be done with time [11:45:13] I'm focusing on mediabackups, ping me if you need help with dbs [11:52:04] I stopped slave on master of s5 db1130, orch is not happy [11:52:10] https://orchestrator.wikimedia.org/web/cluster/alias/s5 [11:52:46] started it again to avoid page [11:54:15] yeah there is a heartbeat row in codfw, should I delete that row? [11:55:18] but then restarting replication after the maint gonna be fun [12:03:56] jynus: sorry to ping but do you have ideas for this? :( [12:04:31] why do you care about orchestrator? downtime alerts and done [12:05:17] hmm, sounds okay to me. Two concerns would be: MW picking it up for heartbeat, eqiad alerting [12:05:30] but I can try and see [12:06:05] in any case, codfw has been depooled already: https://grafana.wikimedia.org/goto/AbyCj0sVz?orgId=1 [12:06:46] yeah but the heartbeat is going to eqiad replicas too [12:06:55] MW picking it up for heartbeat - that is why we have an extenstion to heartbeat [12:07:07] with the primary heartbeat [12:07:28] just double check with mw errors on 1 host [12:07:39] *1 eqiad section [12:08:24] yeah, so far it's clean [12:09:03] and in theory alerts shouln't fire, because we have the right logic [12:09:08] neither mw [12:09:23] fingers crossed [12:09:23] the issue is orchestrator is, I belive, very picky [12:09:47] I did only s5 for now to be sure, then will do the rest of sections before the maint window [12:09:53] yup [12:11:27] in any case, orch complains that the master is not replicating, which is to be expected [12:11:36] worry if the replicas give that error :-D [12:11:52] haha, fair [12:12:14] it makes all of replicas yellow too (=replication lag) but yeah, red would be worrying [12:12:19] https://de.wikipedia.org/w/index.php?title=Benutzer:ASarabadani_(WMF)/test&action=history s5 works fine [12:12:21] this is because normally, we wouldn't left the master stopped, we would remove its replication info [12:13:07] just to be clear, this is is indeed confusing, but of all people, you would be the one that I would expect to know the mw internals to not be worried! :-) [12:13:26] about how the mw load balancing and replication control works! [12:13:43] mw's internals is so weird and overly complex that it's fragile [12:13:49] that's what worries me [12:14:21] and specially it's quite aggressive in going read only and basically fataling [12:14:21] so think that the only thing we are stopping is a channel from codfw to eqiad, that should only provide useless heartbeat rows [12:14:43] as we should be reading the ones from eqiad everywhere, including on codfw [12:15:19] it is just that orchestrator detects a weird situtation: "replication is configured but stopped" [12:15:32] because it doesn't have the context that mw or custom alerts have [12:16:18] and if mw is happy, it is ok, as it is quite verbose when there is lag [12:17:06] can I give you an actionable for later? [12:18:10] sure [12:18:21] there is 2 old dashboards for DBQuery DBReplication (for when those channels existed, not something we chose) [12:18:36] I am guessing those are now unified or changed into DBError [12:18:48] in logstash? [12:18:48] sorry, rdbms [12:18:51] yep [12:18:58] those should be deleted or updated [12:19:04] yeah, delete them [12:19:32] the db error is linked from main page, that's what matters the most to me [12:21:01] up to you, just trying to not leave broken dashboards around or updating them if they are used by some of you [12:21:42] evreything ok now, Re: stopping/downtiming? [12:22:04] seems so [12:22:21] I stopped slave on other sections too [12:23:03] ok, ping me again if you need more discussion. when programming, at least me, I don't read the chat very often [12:23:18] sure. Thanks <3 [12:23:52] and I am superhappy to help if you have doubts- I have many sometimes and I bug m*nuel all the time :-D [12:25:25] the main issue here I'd say was orchestrator not being super-fit for our needs, I remember some things were done to make it work for us (but at least we don't have to maintain the codebase!) [12:41:29] yeah, it's always a hard choice to whether build something in-house or adapt an existing solution (sometimes the adaption itself is as much as work as building one) [13:08:00] https://usercontent.irccloud-cdn.com/file/fbS1u8Gj/grafik.png [13:08:05] * Amir1 sweats heavily [13:08:23] I know it's not anything to be worried about [13:08:28] just.. scary [16:10:49] not sure this is PEP-complient 😬 https://phabricator.wikimedia.org/T327157#8819637 [18:18:27] I always love a good escaping issue [20:42:50] urandom: o/ still looking for some guidance on https://phabricator.wikimedia.org/T330693. we're getting closer to deploying in wikikube and would love to be able to try out checkpointing before that. How can I help move that along?