[08:53:20] db2157 errored during a schema change (without making changes), I'm repooling the host for now and wait for Amir1 to look into it [10:05:36] federico3: you may want to check moritzm's comment from yesterday about cumin1003 and check if it will affect you [10:08:16] @moritzm on my side I can move to a different cumin host, just let us know when you plan to reboot it [10:21:36] ok! we're aiming for next Tues at 9 UTC; I'll send a mail to sre-at-large later [12:06:36] clouddb1019 is down, I am checking with dhinus [12:08:25] I didn't touch it... [12:08:34] so it went down on its own [12:08:46] creating a task [12:08:59] thx [12:09:17] * marostegui https://phabricator.wikimedia.org/T422813 [12:11:18] HW related it seems: https://phabricator.wikimedia.org/T422813#11804002 [12:11:40] I'll make sure it's depooled [12:11:56] though haproxy should have detected it as down [12:13:49] Thanks [12:13:50] manually set as depooled [12:14:41] Thank you! [12:16:09] I'm gonna raise the query timeout on clouddb1015, as per https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#web_vs_analytics_considerations [12:16:28] Cool thank you [12:16:33] I pinged ops-eqiad to double check [12:17:13] thanks, let's see what they say. I'll keep an eye on clouddb1015 to make sure it can handle all the traffic to s4/s6 [12:17:25] it did handle it in the past, so we should be fine [12:17:31] It should be okay [12:18:43] side note: this will block my task of checking the grants on ALL clouddbs :P [12:18:54] haha [12:19:30] I hope the new hw arrives soon once they start having these issues, they could be frequent :( [12:27:26] @marostegui meanwhile can I start the rolling restarts for PC and MS with the new rolling_restart_pc_ms.py? [12:27:38] yep, works for me [13:41:32] dhinus: https://phabricator.wikimedia.org/T422813 let's reimage clouddb1019 to debian trixie instead of 1015, they both belong to the same sections, so that's fine [13:41:56] marostegui: ok! [13:42:32] dhinus: I will take care of that [13:43:08] dhinus: ack [13:43:13] marostegui: ack ;) [14:33:11] any maint happening on databases? I want to deploy this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1265374 which should be noop but need to disable puppet and co just in case [14:53:35] I have disabled puppet on all dbs [14:53:40] ladsgroup@cumin1003:~$ sudo cumin 'A:db-all' 'disable-puppet "merging gerrit:1265374"' [14:56:05] testing on db2156 [15:01:40] rolling to everywhere. Please let me know if there is issues [15:02:26] moritzm: marostegui ^ [15:02:41] Thanks! [15:06:54] great! [15:07:28] there's a second patch still needed, then we can aim at migrating individual DB hosts [15:07:40] I'll look into it next week, I'm off tomorrow for on-call comp [15:09:03] moritzm: is that for ferm -> nftables ?