[01:09:01] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 13.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:09:15] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 10.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:37] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:29] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [08:01:09] Amir1: On s1 that table, unless it is read A LOT I am planning to run it with replication too [08:01:34] nah, it's only for new pages patrolling [08:01:39] ah cool [08:01:58] Also: https://phabricator.wikimedia.org/T334536#8774686 [08:01:58] it is good to go, right? [08:01:59] The code is merged [08:02:05] So I assume it is [08:02:25] yup, I gave my +1 on the schema last night and got merged immediately [08:02:40] right [08:02:42] thanks! [08:03:48] <3 [08:04:17] marostegui: I'm running a schema change on s4, you're productionizing a couple of hosts there, the new hosts might miss the schema change, let me know once you're done [08:04:48] Amir1: Will do - only db1121 is affected [08:04:56] Or "could be" affected [08:05:01] Which is the sanitarium master [08:05:08] noted [08:05:37] Amir1: Where is https://phabricator.wikimedia.org/T334455#8774718? You take care of that right? [08:05:58] switchmaster.toolforge.org/ [08:06:00] yup [08:06:03] Yeah, I mean the repo :) [08:06:14] But I'll ping you once I am confident to do the switch anyways [08:06:29] damn it, I thought I added the link to the repo at the bottom [08:06:38] I'll fix it, it should be somewhere in gitlab [08:06:48] no worries! [08:06:51] ping me once you're done [08:53:52] Amir1: db1121 is back up and db1221 is the new host, which is up too [08:54:23] s4? [08:54:28] yes, both [08:54:36] I am not going to switchover the replicas yet [08:54:45] So you can proceed anytime with the schema change of both of them [08:54:48] Just let me know when done [08:54:53] okay, thanks. I will let the script finish and then run a check everywhere [08:55:00] I want to make sure db1221 is stable first for a few days before going to take the role of sanitarium master [08:55:03] sure, no rush [08:55:12] cool [08:55:36] is it the one using the new RAID controller? or all of them are using the new one [08:55:49] all the new ones use the new controller [08:55:54] the old one isn't shipped anymore [08:56:08] sad [08:56:16] All the tests with db1206 were ok [08:56:20] So we went ahead and bought the rest [08:56:44] ah I see, next step, ARM [08:56:47] We also had to stop using megacli, as the new controller uses perccli [08:56:51] * marostegui runs [08:59:33] s4 is almost done already [09:19:00] Amir1: Can I repool db1121? [09:19:07] db1221 will stay out today [09:19:47] give me a min, I was hoping it would finish but it actually started codfw, let me run a check quickly and tell you [09:20:06] sure, no rush [09:20:48] Result: {"already done in all dbs": ["db1121", "db1138", "db1141", "db1142", "db1143", "db1144:3314", "db1145:3314", "db1146:3314", "db1147", "db1148", "db1149", "db1150:3314", "db1190", "db1199", "db1221", "dbstore1007:3314"]} [09:20:53] it's done already [09:21:11] Can you check db1221 too? [09:21:14] I assume it is done too [09:21:17] But double check just in case [09:22:56] it did that as well (last one before dbstore) [09:23:12] oh I see [09:23:17] I stopped reading after db1121 XD [09:23:22] I didn't check codfw though, let me know if there is anything happening there [09:23:26] nope [09:23:29] cool [09:23:30] codfw is not being touched [09:23:38] thank goodness [09:23:45] Reminder that thursday NEXT WEEK is the last day for maintenance [09:24:13] and you shouldn't do any maintenance this week since I'm oncall [09:25:29] I was planning to migrate everything to mysql 8 today [09:25:45] xD [09:53:48] Amir1: does switchmaster generate any writes into zarcillo? Probably not I guess? [09:54:10] yeah, it's in toolforge, it shouldn't have any access in the first place [09:54:26] right, can you go ahead and replace db1115 with db1215 there then? [09:54:36] sure [09:54:43] let me know when done :) [10:00:33] Amir1: you aren't touching s3, are you? [10:00:48] not in eqiad [10:00:53] cool [10:24:40] marostegui: does this look good? https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/commit/2060295d8ea90ea5fbdbde9b88fd8818e53592a7 [10:25:35] checking [10:25:46] it does [10:25:53] awesome deploying [10:26:36] k, let me know so I can test it [10:26:36] done [10:26:49] let's see [10:27:12] got a 500 [10:27:20] but the task was created [10:27:24] https://phabricator.wikimedia.org/T334564 [10:27:32] and the patch is there too [10:28:13] ah, I think some of the fixes I made have been deleted accidentally [10:28:17] let me fix those [10:28:34] ok, apart from the 500 it looks good, the task and patch [10:29:27] I am not going to test for codfw, as the query is essentially the same [10:29:38] yeah [10:29:41] awesome [10:30:05] one fix that is now gone is the extra \ in \G [10:30:35] cool thanks [10:38:22] sigh, marostegui it thinks eqiad is the primary dc, it tried to make the dns patch and that broke it but it also made some mistakes here and there, I think I fixed all of them [10:38:42] Amir1: but is that a consequence of changing zarcillo? [10:38:47] I can't see how [10:38:47] I pushed the outdated version to git and the pulled it in toolforge destroying all of these fixes [10:38:52] Ah right XDDDDD [10:39:02] no, no, it was the mistake I made when trying to put it in git [10:39:15] ok let me migrate to mysql 8 [10:39:17] that might fix it [10:39:46] :D if you maintain it, sure :P [10:39:54] XD [12:18:18] * Emperor resists temptation to edit the transcript [12:22:57] everything green in the db backups dashboard again! [12:23:14] \o/ [16:06:56] urandom re https://phabricator.wikimedia.org/T330693#8755060, any updates on talking with kofori and mvernon? [16:07:14] I see there is an sre capex office hours today, wondering if we should go and discuss long term hardware stuff? [16:09:29] ottomata: we did talk to kwakuofori, next steps were to have a chat with serviceops re: pvc (TTBMK that hasn't happened yet) [16:09:42] ottomata: ..but what long term hardware stuff? [16:12:27] urandom: next steps here: [16:12:31] https://phabricator.wikimedia.org/T330693#8755060 [16:12:38] > Data Persistence to follow up about long term plan. @Eevans and Amir will talk with @KOfori and @MatthewVernon. [16:12:41] so asking about that i guess [16:12:56] i'm loooking into PVs now, trying to find the docs gabriele found about not being about to use them with zookeeper...i think we can [16:13:04] if we can then i'll follow up with serviceops about them [16:14:18] long term plan == how to offer a menu of storages for commonly seen use cases at wmf & more specifically: object storage needs over the next FY [16:14:34] right, k.wakuofori was planning to set something up to chat with serviceops about their feelings on this, and what the time line would be [16:16:04] k, cool. is there anything i can/should do about that now, given that there are some upcoming deadlines for capex requests? [16:17:10] oooph, I know we only get a shot at this once a year, but it seems really premature to be requesting hardware for anything [16:19:55] we can see if kwakuofori wants to chime in here, but without knowing for certain *what* we're going to do —temporarily/permanently, and on what scale— any request would be really speculative, no? [16:22:09] Hm, I suppose, just seems like somethign that happens often: there are needs for storage, folks ask for them, DP (or others say) we don't have that now, we can't do it, so folks go away and don't ask anymore. but. anyway. that's a bigger convo [16:22:09] okay [16:22:10] so [16:22:32] let's talk short term then [16:22:45] > For short term solution, we will keep corresponding on ticket here. @Ottomata + @gmodena to answer questions in this ticket. [16:23:08] what can/should we be doing for short term? [16:23:18] are there more questions we can answer? [16:23:43] I think the thing we're waiting on is serviceops feedback [16:23:52] hm, for short term? [16:24:06] otherwise we've pretty much been painted into a Swift corner, haven't we? [16:25:24] yeah, short term: talk to serviceops to find out if a) they're on board with pvc, and b) if yes, is this something we could do in the short term [16:25:42] this == pvc? [16:25:47] yes [16:25:49] okay [16:25:53] will do and get back to you then [16:26:22] check with k.wakuofori though, he was going to set something up with them [16:26:32] (them == serviceops) [16:26:50] okay. FWIW, here is a list of use cases: https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Event_Driven_Use_Cases [16:42:18] urandom: looks like PVs are not an option for us for now: https://phabricator.wikimedia.org/T330693#8776772 [20:00:08] sorry for the late response. having the long weekend and the alex and giuseppe being out, the conversation hasn't happened. sorry for the delay... [20:00:26] s/the//