[11:24:45] hi all im looking to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/968667/6 ssl change for mariadb::misc [11:25:13] one of the post actions is to restart mysql on one of the misc servers. [11:25:25] the documentation i have found simply says use dbproxy to depool [11:25:37] dose anyone have any more information? [11:25:56] this is where im looking https://wikitech.wikimedia.org/wiki/Service_restarts#MySQL/MariaDB [12:02:14] did you talk to Manuel or Amir about that? They should be able to help you with that [12:46:39] thanks jynus ill wait for Amir1 or marostegui to respond [13:58:31] hey jbond [13:58:36] I can take care of that yes [14:00:06] marostegui: thanks, if you can let me know when a host has been depooled i can deploy the hange restart mysql and test. unless you would prefer to merge the change etc [14:02:26] jbond: Let's go for codfw master, it is unused, let me downtime it [14:02:50] And let's also do phabricator codfw master [14:03:13] So it will be db2135 and db2134, let me downtime [14:03:35] ack [14:04:47] jbond: Downtimed, let me know when merged, so I can run puppet and restart [14:05:49] ack one sec [14:07:12] marostegui: ok merged [14:07:19] ok, going for it [14:07:29] ok cool [14:10:59] everything seems good [14:11:10] AS there were no backups running I have also restarted misc_multiinstance in codfw [14:11:13] And that looks good too [14:11:45] thats great news. i have another two changes that will likley need yur help https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668/6 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668/6 [14:12:02] no sure i fyou want to do them now or leave it untill next week to eak ouy an issues? [14:12:30] *and https://gerrit.wikimedia.org/r/c/operations/puppet/+/968669/6 [14:13:01] Amir1 arnaudb what's up with the old s1 master? can we ack that alert [14:13:03] ? [14:13:07] let me check jbond [14:13:27] thanks <3 [14:13:46] jbond: We can do this for the database that we own: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668/6 (parsercache, sanitarium and db_inventory) [14:13:54] let me review it [14:14:26] jbond: by the way, the links are the same XD [14:14:37] marostegui: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968669/6 is the second one [14:14:44] we can do this too https://gerrit.wikimedia.org/r/c/operations/puppet/+/968669/6 [14:14:47] let me review [14:14:48] copy past erro r :facepalm: [14:14:58] yes that was the second one [14:15:14] jbond: let's go for https://gerrit.wikimedia.org/r/c/operations/puppet/+/968668 now [14:15:24] ack ill merge now [14:16:43] marostegui: ok thats merged [14:16:50] ok let me try [14:19:31] marostegui: db2112 downtimed as it's currently depooled to undergo a schema update [14:25:15] arnaudb: can you ack the alert in icinga? [14:25:27] jbond: all good with that first change, including the production host :) [14:25:33] Although let me try one more thing in production [14:25:40] nice thanks, ack [14:25:52] I thought downtiming was doing that 🤔 will do [14:26:52] arnaudb: It is still showing up on criticals, maybe it was downtimed after it first alerted? [14:27:07] mmm it is gone now [14:27:11] (I acked) [14:27:14] thanks [14:28:44] jbond: all good, you will need btullis to test dbstore hosts, but from my side all good [14:29:02] jbond: go for https://gerrit.wikimedia.org/r/c/operations/puppet/+/968669/6 [14:29:19] marostegui: ack ill merge the last one then ping ben [14:29:45] sure [14:30:26] marostegui: ok thats merged [14:30:48] ok, testing [14:30:52] ack thanks [14:32:54] jbond: it looks good [14:33:14] marostegui: awesome [14:33:37] marostegui: i have one more for ocestrator https://gerrit.wikimedia.org/r/c/operations/puppet/+/972367/3 do you want to do that one now as well? [14:33:39] jbond: you've got something else? [14:33:40] yeah [14:33:41] let me see [14:33:53] thanks [14:34:01] jbond: let's merge [14:34:05] ack [14:35:33] marostegui: thats merged now [14:35:41] ok, let me restart and check orchestrator [14:35:48] see you orchestrator, it was nice using you [14:35:59] lol :D [14:36:15] 👀 [14:36:58] btw i love the name of the ocastrator server sounds like debauched [14:37:11] all good jbond! [14:37:18] haha :-) [14:37:29] marostegui: awesome thanks for all the help <3 [14:37:38] thanks for the smooth change! [14:37:42] :) [15:08:54] I've restarted mariadb@s7 on dbstore1003 - All good. I will restart the others later, when I get a chance. [15:09:52] btullis: probably one is enough for the test [15:10:04] btullis: probably test matomo as it is a special case though [15:10:13] Oh yes, will do. [15:11:50] Also on the wikireplicas we restarted clouddb1021 and it was fine. I'm still a little unsure of how best to drain and restart the other wikireplica servers clouddb10[13-20] without causing issues for connected clients, so I haven't done that yet. [15:12:59] I know that there's a cookbook which automates some haproxy commands on dbproxy10[18-19] but I didn't feel confident using it yet and I haven't tried draining them manually either. I was hoping to ask you if you have a procedure that you use when applying schema updates to these wikireplica servers. [15:14:06] btullis: did you start replication after restart- it seems to be growing ATM [15:14:12] (lag, I mean) [15:15:02] "PROBLEM - MariaDB Replica Lag: s7 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.92 seconds" [15:16:23] btullis: Which schema changes to wikireplicas? They go thru replication froom production [15:23:40] yeah, I saw one with s1 as well [15:23:45] what's going on? [15:24:53] Oh yes, sorry I forgot to start the replication threads. [15:26:52] Would someone be so kind as to start the replication threads on dbstore1003 s7 for me please? I'm afk for a few minutes. [15:27:47] Amir1: s1 is unrelated to me, I believe. I'm not aware of that. [15:44:14] dr0ptp4kt: clouddb1021's replication has been down for like 24h already and the alert isn't even acked, so I don't know if WMCS or whoever owns this now is even aware of this [15:44:23] btullis: I can do it [15:44:57] btullis: done Seconds_Behind_Master: 2484 [15:48:29] Thanks marostegui [15:48:55] OK, so clouddb1021 replication lag was also my fault. [15:49:17] Forgetting that we need to restart the replication threads. [15:49:46] It has been on icinga for 24h, there's no page for those things? [15:51:04] Thx marostegui , btullis . milimetric , we okay on any jobs running against clouddb1021 in the window since replication down? [15:51:08] I didn't see the alert because it's not being routed to data-eningeering (I don't think). The impact for db1021 won't have been great. [15:51:36] dr0ptp4kt: The data is availanbnle, but not updated [15:51:40] dr0ptp4kt: No, I don't believe any jobs were running against this host during this window. [15:51:45] btullis: That probably merits a task :) [15:52:43] marostegui: Yes, I agree. There are quite a few alerting things that haven't kept up with ownership changes and team reorgs. This is definitely one of them. [15:54:57] in not unrelated news, something i mentioned with btullis yesterday - i was hoping we could do a session to have btullis step us all through a maintain-views run after merge of some schema change, then schedule another maintain-views for another day on another schema change (the point being for wmcs to be able to run it, but also any of us in case of people unavailability). i was thinking btullis taavi andrewbogott Amir1 [15:56:03] and marostegui and me if that makes sense. so i'm looking to put something on the calendar for that at a mutually avialable time soon. [16:00:07] I would definitely show up for that [16:01:06] And sorry about missing the alert, our team is down a few folks right now (including me, intermittently) [16:02:28] btullis: looks like wednesday, 15-Nov next week would be good - would that work for you? i was thinking with https://gerrit.wikimedia.org/r/c/operations/puppet/+/966213 having already been merged, it might be a good time. i'm assuming no one has run maintain-views lately. [16:03:17] and then for a follow up session we could do https://gerrit.wikimedia.org/r/c/operations/puppet/+/958543 , which is yet to be merged (and we have a little time, so it's actually kind of a nice one, assuming no one's changed out the underlying table) [16:06:03] dr0ptp4kt: I am not sure I will be attending that session, feel free to send me the invite though, but we'll see if I make it there [16:11:43] dr0ptp4kt: +1 that no jobs were affected by this, we only sqoop monthly, so as long as the replag doesn't extend to the previous month, we're good [16:17:01] I've restarted all of the replication threads for all instances on clouddb1021 and they're all dropping. [16:17:17] dr0ptp4kt: Yep, Wednesday next week is fine for me. [16:17:36] phew! thx milimetric, thx btullis [16:18:17] (re: replag being non-issue, that is) [16:19:05] dr0ptp4kt: Could you also include brouberol from our side please in that meeting please? [16:19:31] will do, thx [16:19:37] (reading) [16:20:33] andrewbogott: How about if I were to change the contactgroup for these alerts from wmcs to data-engineering - so that we would be alerted instead of your team? Good idea or bad idea? [16:20:49] btullis: can we do both? [16:21:06] Probably. I'll have a look now. [16:21:13] I mean, I wouldn't mind being taken off entirely but I don't want to leave y'all stranded :) [17:08:40] 973203: Update the contact info for the wikireplica servers | https://gerrit.wikimedia.org/r/c/operations/puppet/+/973203 [17:10:19] Apologies if I was a bit vague earlier, by mentioning schema updates. The important bit about what I was trying to say was this: [17:10:44] > I'm still a little unsure of how best to drain and restart the other wikireplica servers clouddb10[13-20] without causing issues for connected clients, so I haven't done that yet. [17:12:32] So this is merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961829 but puppet is currently disabled on clouddb10[13-20] and I don't know how to restart all of the mariadb instances without adversely affecting users. I was hoping that someone knows of a good way to do it. [17:19:15] btullis: The way I do it is always depooling from dbproxy18 and dbproxy1019 [17:19:41] And for what is worth, we don't depool for schema changes as they come through replication and affect all the replicas at the same time [17:19:52] So all get lagged, if the schema change is big [17:22:33] OK, thanks. Do you always do the depooling from dbroxy servers with a sequence of puppet changes, or do you run commands, or a cookbook? [17:24:58] btullis: I used to do it via puppet changes [17:25:14] Just to be consistent with what we have in puppet [17:25:31] And also because my changes are usully somewhat long, not like 1 minute depool or something quick [17:25:45] So someone could get confused as why a host looks pooled in puppet but depooled in reality [17:29:11] Yes, I see. OK thanks, that makes sense. [17:44:41] jbond: I'm seeing various puppet alerts on clouddb hosts, are you still working on your patch rollout? (I'm happy to just ignore the warnings if so) [17:45:16] andrewbogott: now btullis handled that for m i belive this morning [17:45:41] e.g. 1019 says 'deploy mysql change gerrit:961829' [17:45:51] I mean, puppet is disabld [18:00:20] Yes, I will reenable puppet on clouddb10[13-20] tomorrow and drain/restart/repool them each in sequence. [18:00:36] ok! I'll happily ignore in the meantime [18:01:15] hm, wonder why I'm getting a 'constant change' alert rather than a 'puppet disabled' alert... [18:07:39] btullis: it is probably a good idea to upgrade their kernels too and reboot: https://phabricator.wikimedia.org/T344590 [18:07:52] And same for https://phabricator.wikimedia.org/T344591 [18:23:20] marostegui: ack, thanks. Will do. [18:24:10] andrewbogott: I also saw one of those today, for something else. [18:24:49] it was raised for cloudcumin1001 but that was legit