[07:58:47] kormat Amir1 can someone support me on https://phabricator.wikimedia.org/T301219 next Tuesday at 08:00 AM UTC? [08:18:25] I am going to start taking a logical dump of zarcillo and shutdown tendril DB to create a backup from it [10:03:57] marostegui: i can [10:04:04] with the usual caveat [10:04:06] thanks! [10:12:58] Amir1: as our schema czar, do we have any tables or columns that don't fit into `[a-zA-Z0-9_]`? [10:13:36] not that I know of, but there are some that are mysql keyword (e.g. user) [10:16:26] Amir1: does that require special handling? [10:17:02] kormat: proper quoting in queries [10:17:08] otherwise it goes kaboom [10:17:12] so, backticks? [10:17:22] yup ` [10:28:53] Tendril next steps: https://phabricator.wikimedia.org/T297605#7692528 [10:30:35] marostegui: what is the plan for shutting down the VM? dbmonitor1002.wikimedia.org [10:31:34] Amir1: if it is not used for anything else apart from tendril, we can probably stop its webserver now, and on thursday decommission it [10:32:08] the role is only tendril [10:32:12] but let me double check [10:33:57] hmm, it has another role debmonitor::server [10:34:18] Amir1: dbmonitor != debmonitor ;) [10:34:20] we have both [10:34:35] 🤦‍♂️ [10:34:52] it's to check if you're paying attention :D [10:34:53] volans: did you put them next to each other in site.pp to confuse people? [10:35:08] I guess alphabetical order did that [10:35:28] yeah, blame it on the alphabet [10:35:51] * Amir1 drinks rest of his coffee [10:36:06] always blaming Alphabet Inc. [10:36:20] I think moritz can get rid of php 5.6 package for bust, I'll ask him [10:36:24] *buster [10:44:14] wait, db1115 already has mysql shutdown..? [10:45:25] marostegui: the monitoring jobs that scrape zarcillo are broken [10:46:02] they're all alerting on icinga [10:50:29] yes, thats expected [10:50:34] cause db1115 dB is down [10:52:12] i don't understand this approach [10:52:50] we have an es5 primary switchover tomorrow, db-switchover will fail to update zarcillo [10:53:03] and monitoring will fail to update, too [10:53:29] why do we need to leave db1115 down? [10:53:41] I didn't think of that switchover [10:53:49] why was it scheduled then? [10:54:13] db1115 was meant to be downt to see if something unexpected would still use it [10:54:27] can't we just drop the grants that aren't needed any more? [10:54:41] it's a pretty central service [10:54:43] feel free to do so. the plan was on the task [10:54:55] I can't talk now, I'm with some family issues at the moemn [10:55:02] ah crap, ack [10:55:29] feel free to take over my assigned parts [11:00:01] ok, will do. [11:00:50] I can postpone the switchover for later [11:03:38] i'll look at changing puppet so the monitoring stuff scrapes db2093 instead, so at least that won't be angry [11:08:48] Amir1: does auto-schema depend on zarcillo at all? [11:09:03] kormat: omg does [11:09:14] (the grant inventory code) [11:09:25] let me check auto_schema [11:10:00] auto_schema doesn't, [11:10:24] for getting list of replicas, it hits the master and gets replicas [11:10:50] I'm not planning to run omg.py atm [11:11:10] ok cool. [11:11:30] WTB proxysql in front of db_inventory [11:14:54] if we are confident that we don't need tednril anymore we can simply start the server and drop tendril db [11:15:06] and I will reimage it tomorrow after the switch [11:18:09] the reason for that is that tendril is on toku, and 10.4 doesn't have it. so I want to drop any Toku crap before going for 10.4 [11:18:37] i vote for moving the es5 switch. it's an extra complication we don't need [11:18:50] otherwise we need to do another zarcillo backup _after_ the switch [11:19:16] zarcillo won't change [11:19:21] BTW, zarcillo (not tendril) backups happen every week [11:19:22] I mean won't get deleted or anything [11:19:30] so if we're not doing the primary switch, and my cr makes monitoring happy in the meantime, i think we're ok to go with the original plan from the task [11:19:44] tendril needs to be removed with either binlogs disabled or with: if exists [11:19:50] otherwise db2093 will break [11:20:03] marostegui: data in it will change, from db-switchover. we could do this by hand, but again, i'd rather keep it simple [11:20:06] +1 to move the switch [11:20:07] ack [11:20:49] kormat: yeah I meant the reimage once tendril is dropped won't delete zarcillo or anything [11:21:09] marostegui: ah i see. i thought the plan was to restore the data on db1115 from backups in any case [11:21:14] I also have an interview tomrorw right before the switch so I might be late [11:21:28] ^same [11:21:30] kormat: yeah, but we can avoid that if we simply drop tendril [11:21:35] Amir1: tl;dr, please do reschedule the switch [11:22:13] but we need to drop it before we reimage it to avoid 10.1 toku stuff before attempting to bring up 10.4 [11:22:48] ... why is there an 'officewiki' db on db1115? [11:23:13] kormat: welcome to my world [11:23:22] I will reschedule [11:23:27] Amir1: <3 [11:26:13] Amir1: can i get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/760911, please? [11:26:41] done [11:26:46] (sorry I'm cooking) [11:26:55] Amir1: you need aircon [11:28:38] merged. testing on prometheus1004 (p*1003 has puppet disabled..) [11:31:29] db1115 probably just old dusty crap from years ago, it is not used [11:39:11] prom1004 successfully connected to db2093. [11:39:20] heading for lunch now, ping me if something crazy happens [11:49:23] kormat: it's cold here :P [12:09:01] marostegui: sobanski kormat it seems we have 97 db hosts on bullseye now \o/ https://debmonitor.wikimedia.org/packages/wmf-mariadb104 [12:09:23] \o/ indeed [12:31:35] So nothing to be done to db1115 anymore? we can wait till thursday to continue with the original plan after kormat's patch was merged? [12:31:55] (I am back to work-model now) [12:33:41] My only question would be if there's any risk of tools misbehaving in an incident response scenario until db1115 is back online? [12:36:31] I don't think so cause dbctl doesn't use zarcillo for anything and the db-switchover won't work with a dead master [12:36:49] If we feel more confident, I can bring up db1115 and drop tendril now [12:36:55] so mysql remains alive [12:38:38] My only reason to keep mysql stopped for 48h is to make sure nothing unexpected would break with tendril fully gone [12:41:20] [sorry for the interjection] do we have an nc listning on 3306 (or tendril's port) to see if anything is trying to connect to it? [12:45:00] I could set one [12:56:24] marostegui: based on the above I'd say we're ok to leave things as-is and wait, I think we're all aware of what the situation is and there doesn't seem to be any obvious risk. [12:56:45] ok [13:00:13] marostegui: if you're picking up another section for the schema change, I'm doing schema change on s4 and bullseye upgrade on s2 [13:00:29] Amir1: no worries, I am moving to s1 [13:01:50] I will write the bot parsing SAL and creating maint map soon [13:09:27] agreed re: keeping things as-is [13:44:58] I am going to drop the tendril database from db2093 (which doesn't even have all the tables) [13:45:47] 👍 [13:46:03] and db1115 doesn't replicate tendiril downstream anyways [13:46:11] binlog-do-db = zarcillo [13:46:11] binlog-do-db = heartbeat [13:46:29] there are so many moving parts involved in shutting down tendril that it is pretty scary [13:46:49] haha [13:46:49] mysql:root@localhost [(none)]> drop database tendril; [13:46:50] ERROR 1030 (HY000): Got error 1 "Operation not permitted" from storage engine partition [13:46:51] jesus [13:47:57] wow [13:48:49] ERROR 1033 (HY000): Incorrect information in file: './tendril/client_statistics_log.frm' [13:48:51] You are not root enough [13:48:53] this is going to be create [13:48:57] great [13:49:20] I am sure this is all because of references to tokudb and such on the tablespace [13:49:26] which is of course not supported on 10.4 [13:51:13] If I can drop it from db1115 and leave that one fine, I will reclone db2093 from db1115 [13:51:30] If cleaning up db1115 is impossible I guess I will nuke all the data and recover zarcillo from a backup [13:52:12] in case it helps: "Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2022-02-08 03:19:32 (0 GB)" [13:52:19] 0GB? [13:52:22] "Last dump for zarcillo at eqiad (db1115.eqiad.wmnet) taken on 2022-02-08 03:09:54 (0 GB)" [13:52:28] yeah, it is very small [13:52:52] I took a mysqldump today and it took like 5 seconds, so I think before dropping it on thursday I will run a manual one too [13:53:41] 400KB after compression [13:56:55] don't worry, backups lower than 300KB are always considered failed as a safeguard (in adition to the percentage size compared to the last one) [15:21:56] just depooled the old codfw swift frontends; keeping an eye out for 🔥 [16:27:11] I've stopped and disabled swiftrepl-mw.timer on ms-fe2005 ; I'm also going to remove the .timer file and the swiftrepl.conf (with the credential in) which really should stop any unexpected restarts on that host [16:27:42] 🤞 [16:28:53] [having triple checked permissions, ownership, and checksum of the swiftrepl.conf on ms-fe2009] [16:35:04] Assuming no 🔥 overnight, I'll aim to get them out of LVS config and on their way to decommissioning tomorrow (by which point we should be able to verify that the swiftrepl timer has fired on ms-fe2009, though it should do nothing since eqiad is primary) [16:36:50] (not just because Papaul nagged yesterday about old kit you understand :) ) [16:56:05] PROBLEM - MariaDB sustained replica lag on s4 on db2110 is CRITICAL: 4.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [16:58:11] RECOVERY - MariaDB sustained replica lag on s4 on db2110 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [19:58:06] PROBLEM - MariaDB sustained replica lag on s4 on db1149 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [19:59:14] RECOVERY - MariaDB sustained replica lag on s4 on db1149 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104