[05:45:22] Did any one addressed the alerts jaime pointed out? [05:45:27] federico3: ^? [05:46:39] Guys, we really cannot leave those things unaddressed. There has been zero irc traffic on this channel and that was the last message posted and yet there's been no replies, nor a ticket, nor an action on those [06:37:08] I need a bit of context on this, but I can open a task for it if you want. Anything else I can do first at the moment? [06:45:00] federico3: I'd suggest to start checking them on icinga to see what the issue may be [07:08:10] jynus: where did you receive the warnings? [07:08:58] * Emperor was OoO yesterday (and largely AFK), sorry, was a public holiday here [07:11:13] federico3: if you go to icinga.wikimedia.org and select, under Services, Warning, Unhandled problems [07:11:34] Emperor: yes I'm looking at that [07:11:59] you'll get a table which has es1035 es2038 es2039 are all WARNing for memory usage [07:12:08] thanks - I'm asking if there was a alert on irc/page/email that I have missed [07:12:42] OIC; no I wouldn't expect so for warnings [07:12:45] I'm seeing high memory usage since 25d approx, well I'm going to open a task [07:13:18] (though apropos warnings, I should ping u.random later about restbase1031 which is running out of disk) [07:20:06] federico3: can you also handle that task? Basically upgrade+reboot [07:21:21] marostegui: the May reboots you mean? I'm doing the reboots on core dbs [07:22:39] federico3: what I mean is that you need to restart mariadb on those hosts alerting. So you can take that opportunity to reboot them too (and upgrade) [07:22:53] ah, ok [07:29:23] do we know what the root cause is? It grows constantly e.g. like a memory leak [08:13:30] federico3: maybe there's a small leak, hard to say to be honest. They've been having connections issues as well so maybe it didn't have all the buffers memory back or something [08:18:16] Amir1: can I move on with reboots in another section e.g. s6 in codfw? [09:13:25] Last dump for db_inventory at codfw (db2185) taken on 2025-05-27 00:45:26 is 121 KiB, but the previous one was 109 KiB, a change of +10.9 % [09:48:26] federico3: sure! [09:48:50] Amir1: ok, starting now. Can we also run another codfw section in parallel if we thing it's safe? [09:50:06] I suggest doing eqiad in parallel [09:50:54] sorry what do you mean? [09:52:57] start on s5 in eqiad [10:48:35] Hi folks, I'm looking for some +1s, please :) https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/12 to teach the ring manager about eqiad E8 (which has some new thanos backends in) and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151159 to add the backends to hiera followed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151160 to add them to the rings (and drain the nodes they're [10:48:35] replacing) [10:48:54] Happy to answer questions about any of these [10:57:38] Also, sorry, one apus CR too - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151166 to add apus-be2004 [11:39:25] looking [11:44:23] Emperor: any testing required, any crosscheck or other validation I could do? [11:45:09] none of those should need testing, really just check I've not obviously typoed a hostname or somesuch [11:45:37] federico3: note that j.ynus has kindly reviewed a couple [11:46:20] jynus: are you going to review the remaining ones? [11:46:45] I was doing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151166/2/hieradata/codfw.yaml now [11:47:08] ok [11:47:10] not sure which are left, you can do the rest [11:47:34] I think that leaves https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151160 [11:49:03] prod24_ng is that the right one? [11:50:43] federico3: yes, you can see if you visit one of the new nodes and see that there are /srv/swift-storage/objects0 -> objects23 [11:51:06] on old-style nodes you get /srv/swift-storage/sdc1 etc. instead [11:52:37] I have to say I got stumped a bit by ipv6 prefix calculations [11:52:57] I think I am getting old [11:55:38] federico3: one thing that helped me confront reviews is the following: as a mental exercise, imagine that Emperor has intentionally created a mistake (design, typo, logic, etc.) on one of his submissions, but hasn't told you. This helps me be alert and guide my review process. [12:01:58] thanks to both of you :) [12:02:23] jynus: regarding ipv6 https://chaos.social/@vidister/113308216924456611 [12:02:54] LOL [12:03:18] jynus: reminds me of the good old underhanded C contest, and yes that's why I try to describe back what I see in the CR [12:04:05] is there any explanation for the db_inventory growth? I know it is just a few kb, but it normally doesn't do that [12:09:31] largest change seems to be: orchestrator.topology_recovery_steps, orchestrator.topology_failure_detection and orchestrator.database_instance_analysis_changelog [12:09:38] acking for now [12:10:43] jynus: there should not be any changes there [12:12:04] I left the change at: https://phabricator.wikimedia.org/P76471 but I don't think it is worth examining it more [12:16:25] FIRING: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:42] sigh, checking [12:23:13] ah, thanos-be1007 is in F8, need to add that network [12:26:32] https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/13 if someone could have a quick look, please? [I think what went wrong here is I logged into each of the new nodes and checked MOTD, but had just read E8 and so misread F8 as another node in E8] [12:31:18] I am going to take an extended lunch break- I feel I need to start giving some more movement to my ankle, plus there is a late meeting today for me [12:35:32] federico3: would you mind looking at https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/13 please? thanks (and sorry for all the reviews today, lots of hardware moves) [12:35:42] no worries, looking [12:48:49] how can I check the subnet number? [12:50:25] Am I right that the update cookbook is failing here https://phabricator.wikimedia.org/P76477 because the phabricator task is private? [12:50:29] federico3: if you visit https://netbox.wikimedia.org/ipam/prefixes/701/ (in the commit message), it should be there towards the top [12:50:58] ...and if you click on the IP Addresses tab, you'll see thanos-be1007 [12:53:03] marostegui: it is possible, I can try to reproduce it if needed [12:53:42] federico3: I think it is needs fixing because the -t is mandatory, but if the ticket is private, then it cannot work [12:54:07] do we want to make -t not mandatory or detect that the task is private and skip the update? [12:54:11] marostegui: that should have been fixed in T314917 [12:54:11] T314917: Grant ops-monitoring-bot WMF-NDA and acl*sre-term access - https://phabricator.wikimedia.org/T314917 [12:54:43] federico3: both should work, but maybe we need to follow up on the ticket volans just pasted? [12:54:58] ah the ticket is security [12:55:02] so yeah not granted (yet?) [12:56:07] could we perhaps return a more specific error? [12:56:31] Maybe, but in this situation there's no work around other than making -t non mandatory [12:57:34] federico3: IIRC the phabricator API doesn't give an easy way to know it [12:58:12] I am going to a meeting now! [12:58:30] if we want to always track the update with a dedicated task perhaps we could require opening a dedicated "public" task? Anyhow for the time being I can make -t optional [13:00:15] volans: maybe just catch the native error and return something like "Task not found or access denied"? just my 2c :) [13:00:19] yeah the error is the same of passing T123456789 [13:14:29] volans: can I check if I can access a task before doing task_comment? [13:14:55] I see only the task_comment method in the phab instance [13:15:25] correct, the phab layer of wmflib is very thin, can be expanded at will of the phab apis have that functionality [13:16:25] RESOLVED: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:10] e.g. we might risk breaking out of a cookbook somewhere during the run unless we check the task for "write permissions" at the very beginning [13:19:30] I guess we could search for it, if it doesn't exists we don't have access (or the user put a wrong id) [13:19:33] https://secure.phabricator.com/conduit/method/maniphest.search/ [13:19:55] maniphest.info that looked more suitable says This method is frozen and will eventually be deprecated. New code should use "maniphest.search" instead. [13:20:07] also if phab has a network issue or any glitch - perhaps it could be better to have a little retry logic and then log out an error and continue without raising [unless the caller sets for example raise=True] [13:21:28] e.g. if a task is set to private or tagged security *while* cookbooks are running we might not want the cookbooks to fail mid-run [13:27:59] also, for context: T283980 [13:27:59] T283980: Phacility (Maintainer of Phabricator) is winding down. Upstream support ending. - https://phabricator.wikimedia.org/T283980 [13:31:05] the latest python3-phabricator has some retry logic in the requests session, but the bookworm version doesn't even use requests (sigh) and latest release 2021 [16:51:47] Hey data folks, there's a request by Diamona in #wikimedia-operations for urgent approval for https://phabricator.wikimedia.org/T395350#10860354 - it looks busy in #wikimedia-operations so I thought I'd flag here too so it wasn't missed [16:59:41] I can take it [18:36:14] Thanks