[11:35:39] federico3: https://phabricator.wikimedia.org/T401906#11110300 [11:37:09] Amir1: it's different schema changes [11:37:51] all three have the same ticket attached [11:37:56] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [11:38:33] and they get depooled and nothing is happening, that schema change shouldn't take three hours and on top of that none of them have replication stopped so the schema change is not running [11:42:01] https://www.irccloud.com/pastebin/aK9muhZQ/ [11:42:33] the check is broken [11:42:53] please stop the scripts, repool them, fix the check [11:43:03] please add the progress checklist on the ticket too [11:43:04] yes, i'll repool [11:45:53] I don't think checks should be very sophisticated. Checking one alter/aspect is enough and reduces the chance of mistakes like this. We have drift tracker to catch issues [11:46:45] it seems the script should stop immediately if a check fails [11:47:11] it does [11:47:53] and afaict it only supports a bool return value but what should we return if there's an unexpected state e.g. an incomplete change? [11:48:52] incomplete change is so rare that we never had issues like that. So I don't want to focus on fixing a non-existing problem [11:49:01] always the issue was a bad check written [11:49:48] and as I said, we have another layer of defence with drift tracker that periodically compares production schema and on code [11:51:25] ok [11:53:20] so I just check for one column here https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/45/diffs#bce6ccd8fabbfa20d4d57c3fceeaaa22da1254c4_0_27 [11:54:17] yeah, that looks good to me [11:59:47] https://www.irccloud.com/pastebin/pJnkPRit/ [12:00:32] (i should have logged with repr for clarity...) [12:06:15] I would force a print() and run it with --check on on host to see how that works [12:07:11] the more detailed version is that it comes from pymysql reading information schema table. How It's set up, how it responds, how it gets decoded, I need to read the tables [12:20:18] Amir1: opened https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/46 - I'm a bit unsure why you are doing str(default) in https://wikitech.wikimedia.org/wiki/Auto_schema/examples [12:40:12] Thanks. My internet is really bad right now [12:51:52] federico3: please repool db2151 too. [12:51:57] ok [13:38:26] Amir1: what's the error on b1154 ? https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=instance%3D~%5E(db%7Cpc%7Ces%7Cms%7Can-redacteddb%7Cclouddb)%5B12%5D.* [13:38:33] summary: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: 'Table 'azwikimedia.loginnotify_seen_net' doesn't exist' [13:40:19] I handle it [13:43:53] the tables have been dropped but the sanitarium haven't been restarted to pick up the replication filter [13:44:27] there's the sanitarium restart cookbook [13:44:59] yup, running it [13:45:25] not sure if it fixes the issue or I have to get the replication going and then run it again and drop the tables again [13:47:03] it fixed the issue and sanitarium but the cloud replicas are broken still. I need to start them by hand [13:47:12] PROBLEM - MariaDB sustained replica lag on s5 on db1154 is CRITICAL: 20 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [13:48:47] fixed [13:49:12] RECOVERY - MariaDB sustained replica lag on s5 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [16:30:17] Team, I have added few things to the agenda for Monday meeting especially to go over the planned work for this quarter, work in MariaDB space, and given the absences what are the things that can wait versus needs additional help. Please spend some time to review the roadmap and also to list in-progress/upcoming items. I have also added a section to list and chat about the list of issues / incidents - dealing [16:30:17] with putting of fire in the previous week. This is not for blaming anyone rather for all to understand and learn so that we can spend time to chat/understand how we can avoid these kind of situations in future. [16:30:54] https://docs.google.com/document/u/0/d/17xEIOzV22-cvAXYbHe6F6XDJhk-qCh1g3994ZAL3dh8/edit?usp=meetingnotes&showmeetingnotespromo=true [16:53:15] I'm signing off for the day, let's not run anything to have a quiet weekend. I stopped mine. [16:54:14] Also, if we have time on Monday I would love to learn the norm / processes related to releases/running new scripts that we have in place for DP - given that we deal with critical databases and infrastructure. I will add it to the agenda