[05:16:40] I'm out today but can someone investigate this [05:16:45] 06:05:34 (SystemdUnitFailed) firing: (2) check-private-data.service Failed on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:24] That's a sanitarium host [05:18:06] It's falling in all cloud hosts too [05:18:34] So it might be something related to the python or certificates or something [05:18:52] Please check it cause it's important [08:26:43] will do [08:29:28] on db1155, from what I can see: `pymysql.err.OperationalError: (1054, "Unknown column 'user_options' in 'where clause'")` [08:34:25] That's probably something related to amir's schema change I'd say and maybe it wasn't included in the filtered columns [09:20:44] i'll silence those alerts until monday [09:23:19] Ideally we should really try to get it fixed asap [09:23:37] That's the tool we use to ensure no private data is being leaked to wiki replicas [09:24:53] It's probably just a matter of a puppet patch being missed [09:41:08] my guess would be on the private column list [10:30:43] I've added https://wikitech.wikimedia.org/wiki/Swift/How_To#Find_an_object_in_swift since I thought it might be useful knowledge to have next time :) [10:33:25] nice- I would also add a link to query-media-file, as it has a nice step-by-step query system, but for backups (but it is not a substitute) [10:34:37] what you wrote could also be a starting point for a recovery system [10:36:08] Mmm, I don't think I want to start documenting the backups stuff under Swift/ :) [10:36:44] No, I meant a starting point for future work [10:37:35] I only suggested you add "See also: " [10:37:45] as you yourself used for a past ticket, IIRC [11:08:32] have added https://wikitech.wikimedia.org/wiki/Swift/How_To#Find_an_object_in_the_backups_database [11:13:41] My schema changes can't really cause that issue [11:13:46] I will dig [13:17:32] oh I think I know what's going on [13:25:45] marostegui: arnaudb: I know what's causing this and it's pymysql upgrade. The column_has_private_data function does a query and catches except pymysql.err.InternalError: which is literally commented "# Ignore "field doesn't exist" errors" and now with the upgrade pymysql is throwing another error object which the script is not catching [13:26:13] 😬 [13:26:27] two things we should do, update check_private_data.py to catch the new error type, and probably clean up the old dropped columns from the list maybe [13:26:41] not sure about the latter [13:26:52] user_options have been dropped from user ages ago [13:26:53] Amir1: maybe a rollback on those hosts could help us go through the weekend? [13:27:02] nah, the fix should be easy [13:27:24] I'll make a puppet patch [13:28:53] my only worry is that it might other things that gets broken by the pymysql upgrade, I'll run the script [13:29:17] what's the risk with the second solution you suggested Amir1? old databases with that col could still exist? [13:29:31] yeah [13:29:43] lets avoid the risk then! [13:30:20] what version did we upgrade to? [13:30:27] for commit message :D [13:31:43] 0.11.2 [13:31:51] wmfmariadbpy==0.11.2 [13:32:05] 0.1.4 for wmfdb [13:32:24] I found it for pymysql: 1.1.0 [13:32:34] oh that one [13:32:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/993088 [13:32:51] 1.0.2 for deb11 hosts [13:33:20] ah ok [13:33:53] added to the validation [13:43:19] arnaudb: all are sanitarium hosts using 1.0.2? if some are on the old one, this is going to break them [13:44:09] iirc moritzm deployed on all s6 hosts + cumin [13:44:45] so there might be a split, moritzm do you think it's feasible to upgrade all sanitarium hosts to the last pymysql/wmfdb/mariadbpy packages? [13:44:55] that'd break it on other hosts like db1154 [13:45:46] https://debmonitor.wikimedia.org/hosts/db1154.eqiad.wmnet [13:45:50] yup, it has the old version [13:46:43] only these hosts were updated (apart of cumin*): [13:46:55] clouddb[1015,1019,1021].eqiad.wmnet,db[2097,2114,2187].codfw.wmnet,db[1155,1225].eqiad.wmnet [13:47:11] the rest of s6 is on bookworm and already had 1.0.2 natively [13:47:46] I made the change backward compatible as it will break in other sections (s5 for example) [13:50:38] arnaudb: Would you mind re-reviewing it again? :D [13:50:51] i would not, please send away :D [13:51:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/993088 :P [13:51:26] TIL https://realpython.com/python-catch-multiple-exceptions/ [13:51:48] you can put them in a tuple *mind blown* [13:54:10] TIL as well :o [13:54:16] ha ha [13:54:55] that is an interesting compatibility issue- many of my scripts query mysql, and I will have to make sure they don't break with the new library as well [13:56:01] Do you want to create a joint ticket to review scripts? [13:59:10] maybe we should to create a landing strip for newly identified bugs [13:59:51] or simply collect them at https://phabricator.wikimedia.org/T355531 ? [14:00:03] after all that version is also what is in bookworm [14:00:11] what do you mean? I was thinking of creating one for "upgrade pymysql version" and listing (at least on my side) the packages using it to make sure they work [14:01:16] I upgraded some to bookworm but may not have noticed the particular issue, as it won't show up on unit testing [15:36:57] https://www.irccloud.com/pastebin/vuyhJ4b4/ [15:37:20] fixed now [16:06:59] and in db1154 it hasn't broken [16:08:59] Double check codfw sanitarium too please [16:10:54] sure [16:11:56] Thanks [16:25:42] I see "Uncommitted dbctl configuration changes- check dbctl config diff" [16:25:53] for 2h30m, expected? [16:26:50] it is a removal of db2194:3316 / db2194:3317 [16:28:00] That's probably arnaudb cloning [16:28:48] oh damn sorry I might have forgotten a commit, fixing [16:32:15] fixed, but this is weird as I added those this morning and the missing commit was a removal 🤔 sorry for the alerts