[06:11:44] urandom: you aware of this? https://phabricator.wikimedia.org/T355549 [09:05:45] I am going to start stopping and shutting down hosts in codfw a1 [10:04:35] I've done a backport of pymysql 1.0.2 for bullseye, where should I add it? to "main", then it would be updateable to all bullseye hosts which have it installed? or first on cumin1002 only for some further tests? [10:12:08] thanks moritzm! I'm not sure about the level of confidence we could have in all the other scripts that rely on pymysql and there is quite a few: https://codesearch.wmcloud.org/search/?q=pymysql&files=&excludeFiles=&repos= I think the safest way would be to first test it on cumin1002, wdyt? [10:18:30] I can also initially only install in on cumin1002 with dpkg and we keep it off apt.wikimedia.org for now? [10:18:48] one other option is to use a discrte archive component, then we can make it opt-in for some roles only [10:18:56] but that might be overly cautioud [10:19:21] unles we know of specific API breaks between 0.9 and 1.0 [10:19:37] your first idea seems the most reasonable, I've checked the changelog and saw no breaking change but "you never know" [10:23:55] ok, I'll install it on cumin1002 now and when you done some tests, we can move on and figure out how to upload it to apt.wikimedia.org [10:25:41] amazing, thanks, will keep you posted :-) [10:36:52] cumin1002 is upgraded now [10:55:11] thanks, will try it after lunch! [10:55:53] A quick glance shows that db-compare still doesn't work though [10:56:23] In cumin1002 that is, so I assume the rest won't either [11:00:35] marostegui: I've tried before lunch, it's a fix that I have to implement on my end, will release it this pm [11:00:52] ah excellent! [11:01:02] good news then! [12:01:59] I am starting s6 eqiad switchover [12:20:52] jynus: hi, Is it okay if I drop securepoll related rows (non-useful ones) from dbbackups.backup_files older than a year? https://phabricator.wikimedia.org/T349360 [12:21:28] from backups? [12:21:38] no, from the db table backup_files_history [12:22:08] ah, from the live db? [12:22:12] yup [12:22:21] it'll be a couple of millions of rows I think [12:22:28] one sec [12:24:35] that's m1, right? [12:25:05] actually I need to double check [12:26:03] Yeah, you can delete all rows if you want for there, it was only blocked on if you wanted to keep them [12:26:05] based on the grants, yes [12:26:13] from history, not from backup_files [12:26:26] I want to keep some of them but the securepoll ones are just taking space [12:26:43] 8K rows for each s3 backup in each dc [12:26:45] I can back them up on the long term backups if you want and keep them for 5 years [12:26:48] securepoll? [12:26:54] yeah [12:27:01] https://phabricator.wikimedia.org/T355594 [12:27:18] tables for elections done in 2009 [12:27:47] as in the registers that those were backed up, only for those rows? [12:27:54] that's a bit weird [12:28:11] I thought you wanted to basically drop/truncate the full tables [12:28:17] but not a problem to me [12:28:23] no, just these rows [12:28:32] but not the main table, only the history one [12:28:32] let me tell you why it is weird [12:28:48] the backup contains the size of the full backup [12:29:08] so there will be size unaccounted for [12:29:21] so it will be not consistent- better to delete everything [12:29:33] are there many rows with that? [12:29:45] many many [12:30:00] 8k per each backup per dc of s3 [12:30:37] sure, go ahead, just I think we can archive and drop everything [12:30:50] but I guess something > nothing [12:31:37] one problem is that even if we archive and drop it, when I reuse the table to build stats, it'll take a lot of space in the analysis and table scan [12:31:39] e.g. I can run now bacula and copt the entire archive tables [12:31:48] *copy [12:32:01] from the backups to long term backups and keep them for 5 years there [12:32:35] then drop everytrything, then when you need them, recover only the parts you need [12:32:48] somewhere outside m1 [12:33:01] so this is a long way to say- no problem on my side [12:33:26] please log it, trace it on a ticket, etc [12:33:33] yeah [12:33:39] let me think about it [12:34:01] the reason why I am a bit surprised is because I thought the initial issue was too much wasted space [12:34:25] so offering to drop more and keep it at least somewhere where it doesn't hurt (long term backups) [12:34:39] until there is a better archival solution [12:35:47] it is wasted space but also for my analysis too, I will always have to exclude them, etc. [12:54:37] Amir1: FYI this php -ddisplay_errors=On /srv/mediawiki/multiversion/MWScript.php extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=enwiki --statsd --dbshard s1 doesn't re-load the config [12:54:59] Do you want me to create a task for the Growth team? [12:55:13] let me check [12:55:58] marostegui: yeah sure [12:57:40] there is 366442024 records in total, and 38020767 like '%securepoll%' in that table [12:58:14] marostegui: wmfmariadb should be fixed! :-) https://www.irccloud.com/pastebin/R1SxOenM/ [13:00:22] \o/ [13:00:25] nice one! [13:00:31] we can test db-switchover tomorrow :) [13:00:40] can you also check all the other scripts from the task? [13:00:49] * Amir1 suddenly feels sick 👿 [13:01:05] sure thing [13:01:05] Amir1: if it fails, it cannot even move a slave :) [13:01:08] so....! [13:09:23] marostegui: I am now, thank you. [14:27:22] as for the rollout of the updated pymysql: Is the updated version only needed in conjunction with wmfdb? [14:27:56] that said, I can't see where wmfdb gets setup in Puppet? [14:28:56] https://debmonitor.wikimedia.org/packages/python3-pymysql is the list where pymysql is installed fleet-wide [14:29:25] 0.9.3-2 is the version from bullseye, so those 240 systems have the current version [14:43:24] moritzm: in this case it's related to wmfmariadbpy, https://debmonitor.wikimedia.org/packages/python3-wmfmariadbpy which is installed with its own puppet module [14:45:13] it should be deployed everywhere connecting to mariadb as the new ssl verification disabling stuff will be required until we swap PKIs unfortunately [14:46:47] ok, if we need to universally then I'd simply upload my 1.0.2 backport to the main component of apt.wikimedia.org? [14:47:48] and then we can roll it out using cumin or debdeploy [14:47:53] I think it'll be a good solution yep, Amir1 if you have an opinion on this please voice it! I don't see any breaking change in the changelog between those version so afaict it's safe! [14:48:15] we don't have automated updates of packages [14:48:24] so this can be upgraded in slow manner [14:48:37] e.g. by first updating a few db* hosts only [14:48:42] good to know, good idea indeed [14:48:56] only freshly installed hosts come up with the latestr version, but OTOH that is what we want anyway [14:49:08] exactly [14:49:55] I always fear the legacy script that's run every other year and has a critical importance, those hosts are not concerned ^^' [14:51:08] I think Mortiz is the authoritative voice in this area, I have no comments [14:51:12] I need to complete something else, will upload in 10-15 min and I'll sync up for the next rollout steps [14:55:29] Hey urandom! I am in the process of deploying this: https://phabricator.wikimedia.org/T339865 and while discussing it with the team we wanted to know if we have a way to remove stored responses from cassandra on RESTBase given two timestamps. [14:56:20] I am not very familiar with the schema but i think we have timeuuids so maybe a fallback scenario in case the deployment goes wrong, we can remove corrupted entries by removing the rows between two timestamps [14:59:38] nemo-yiannis: not easily, no. the only indexing is primary key (which is a combination of the project and title) [14:59:53] ok [14:59:55] basically, we'd have to do a table scan and delete matching entries [15:00:09] yeah that doesn't sound like an option [15:00:17] probably not a big deal for a small table, but not something you'd want to do on the regular [15:00:46] truncating the entire table might be an option, depending on the severity a cold cache would create [15:24:39] nemo-yiannis: sorry, ducked out for a quick meeting...circling back: you want to do this to remove restbase generated entries, while keeping the mw ones? [15:25:10] no i want to remove PCS entries in cassandra in case for some reason one deployment (that looks risky) goes wrong [15:25:38] so given a time window that we know that PCS responses could be problematic i want to remove those entries [15:25:57] so on a rollback, you'd purge the entries created after the (faulty) deployment...right [15:27:10] yes [15:27:35] that's possible, but would require indexes on the timestamps. having those indexes would create overhead on writes, and increase storage size, so without a requirement to do this, we wouldn't have created those. [15:27:57] ok, i was wondering if we already had this in place [15:28:51] it's really no different than say a mariadb database, except that a table is distributed over hosts in a cluster, rather than being on an fs backed by fast storage. [15:30:06] if you didn't index an attribute, you can still query against it, but behind the scenes, it'll require some scanning. but when that scanning, collation of results, etc, has to happen over a cluster of machines, it's ....well worse [15:30:55] Back in the days in a similar use case Petr used a script to read the events between 2 timestamps from kafka and purge content on RESTBase/Cassandra [15:31:10] Do you know if this exists somewhere ? [15:31:19] the script, no I don't [15:31:36] first I heard of it (but that sounds like a good approach for an ad hoc event like this) [16:35:03] Amir1: are you done with s4? [16:36:28] or at least done with one of the s4 dcs [16:53:58] marostegui: I'm done [16:54:13] only masters left [16:54:32] Oh great [16:54:37] I'll start mine then now [17:55:36] urandom: are you by chance familiar with cassandra-dev2001 ? [17:55:59] it's in rack b5, we're hoping to move all the servers in that rack from old to new switches tomorrow [17:56:37] so short interruption expected (60 sec max), I'm not sure if we should take some action in advance for this host? [17:57:01] Task is T355549 [17:57:02] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [18:57:47] topranks: oh, is that the only one from that list slated for tomorrow? [18:58:06] but to answer your question, nothing needs to be done with it [19:00:23] topranks: I gather from the ticket that there are two restbase hosts, and a sessionstore too. Assuming they'll only briefly lose connectivity (seconds), I don't think we need to do anything with them either. [19:01:24] but I have some work to do to restbase that I will want to pause prior to/until afterward [19:22:13] urandom: oh great! those were the last I was trying to work out, yes they will only briefly lose comms [19:22:55] re cassandra-dev2001 that's good to confirm, I'll make a note on the sheet