[05:45:22] <marostegui>	 Did any one addressed the alerts jaime pointed out?
[05:45:27] <marostegui>	 federico3: ^?
[05:46:39] <marostegui>	 Guys, we really cannot leave those things unaddressed. There has been zero irc traffic on this channel and that was the last message posted and yet there's been no replies, nor a ticket, nor an action on those
[06:37:08] <federico3>	 I need a bit of context on this, but I can open a task for it if you want. Anything else I can do first at the moment?
[06:45:00] <marostegui>	 federico3: I'd suggest to start checking them on icinga to see what the issue may be
[07:08:10] <federico3>	 jynus: where did you receive the warnings?
[07:08:58] * Emperor was OoO yesterday (and largely AFK), sorry, was a public holiday here
[07:11:13] <Emperor>	 federico3: if you go to icinga.wikimedia.org and select, under Services, Warning, Unhandled problems
[07:11:34] <federico3>	 Emperor: yes I'm looking at that
[07:11:59] <Emperor>	 you'll get a table which has es1035 es2038 es2039 are all WARNing for memory usage
[07:12:08] <federico3>	 thanks - I'm asking if there was a alert on irc/page/email that I have missed
[07:12:42] <Emperor>	 OIC; no I wouldn't expect so for warnings
[07:12:45] <federico3>	 I'm seeing high memory usage since 25d approx, well I'm going to open a task
[07:13:18] <Emperor>	 (though apropos warnings, I should ping u.random later about restbase1031 which is running out of disk)
[07:20:06] <marostegui>	 federico3: can you also handle that task? Basically upgrade+reboot
[07:21:21] <federico3>	 marostegui: the May reboots you mean? I'm doing the reboots on core dbs 
[07:22:39] <marostegui>	 federico3: what I mean is that you need to restart mariadb on those hosts alerting. So you can take that opportunity to reboot them too (and upgrade)
[07:22:53] <federico3>	 ah, ok
[07:29:23] <federico3>	 do we know what the root cause is? It grows constantly e.g. like a memory leak
[08:13:30] <marostegui>	 federico3: maybe there's a small leak, hard to say to be honest. They've been having connections issues as well so maybe it didn't have all the buffers memory back or something 
[08:18:16] <federico3>	 Amir1: can I move on with reboots in another section e.g. s6 in codfw?
[09:13:25] <jynus>	 Last dump for db_inventory at codfw (db2185) taken on 2025-05-27 00:45:26 is 121 KiB, but the previous one was 109 KiB, a change of +10.9 %
[09:48:26] <Amir1>	 federico3: sure!
[09:48:50] <federico3>	 Amir1: ok, starting now. Can we also run another codfw section in parallel if we thing it's safe?
[09:50:06] <Amir1>	 I suggest doing eqiad in parallel 
[09:50:54] <federico3>	 sorry what do you mean?
[09:52:57] <Amir1>	 start on s5 in eqiad
[10:48:35] <Emperor>	 Hi folks, I'm looking for some +1s, please :) https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/12 to teach the ring manager about eqiad E8 (which has some new thanos backends in) and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151159 to add the backends to hiera followed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151160 to add them to the rings (and drain the nodes they're
[10:48:35] <Emperor>	 replacing)
[10:48:54] <Emperor>	 Happy to answer questions about any of these
[10:57:38] <Emperor>	 Also, sorry, one apus CR too - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151166 to add apus-be2004
[11:39:25] <federico3>	 looking
[11:44:23] <federico3>	 Emperor: any testing required, any crosscheck or other validation I could do?
[11:45:09] <Emperor>	 none of those should need testing, really just check I've not obviously typoed a hostname or somesuch
[11:45:37] <Emperor>	 federico3: note that j.ynus has kindly reviewed a couple
[11:46:20] <federico3>	 jynus: are you going to review the remaining ones?
[11:46:45] <jynus>	 I was doing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151166/2/hieradata/codfw.yaml now
[11:47:08] <federico3>	 ok
[11:47:10] <jynus>	 not sure which are left, you can do the rest
[11:47:34] <Emperor>	 I think that leaves https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151160
[11:49:03] <federico3>	 prod24_ng is that the right one?
[11:50:43] <Emperor>	 federico3: yes, you can see if you visit one of the new nodes and see that there are /srv/swift-storage/objects0 -> objects23 
[11:51:06] <Emperor>	 on old-style nodes you get /srv/swift-storage/sdc1 etc. instead
[11:52:37] <jynus>	 I have to say I got stumped a bit by ipv6 prefix calculations
[11:52:57] <jynus>	 I think I am getting old
[11:55:38] <jynus>	 federico3: one thing that helped me confront reviews is the following: as a mental exercise, imagine that Emperor has intentionally created a mistake (design, typo, logic, etc.) on one of his submissions, but hasn't told you. This helps me be alert and guide my review process.
[12:01:58] <Emperor>	 thanks to both of you :)
[12:02:23] <federico3>	 jynus: regarding ipv6 https://chaos.social/@vidister/113308216924456611
[12:02:54] <jynus>	 LOL
[12:03:18] <federico3>	 jynus: reminds me of the good old underhanded C contest, and yes that's why I try to describe back what I see in the CR
[12:04:05] <jynus>	 is there any explanation for the db_inventory growth? I know it is just a few kb, but it normally doesn't do that
[12:09:31] <jynus>	 largest change seems to be: orchestrator.topology_recovery_steps,  orchestrator.topology_failure_detection and orchestrator.database_instance_analysis_changelog
[12:09:38] <jynus>	 acking for now
[12:10:43] <marostegui>	 jynus: there should not be any changes there
[12:12:04] <jynus>	 I left the change at: https://phabricator.wikimedia.org/P76471 but I don't think it is worth examining it more
[12:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:21:42] <Emperor>	 sigh, checking
[12:23:13] <Emperor>	 ah, thanos-be1007 is in F8, need to add that network
[12:26:32] <Emperor>	 https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/13 if someone could have a quick look, please? [I think what went wrong here is I logged into each of the new nodes and checked MOTD, but had just read E8 and so misread F8 as another node in E8]
[12:31:18] <jynus>	 I am going to take an extended lunch break- I feel I need to start giving some more movement to my ankle, plus there is a late meeting today for me
[12:35:32] <Emperor>	 federico3: would you mind looking at https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/13 please? thanks (and sorry for all the reviews today, lots of hardware moves)
[12:35:42] <federico3>	 no worries, looking
[12:48:49] <federico3>	 how can I check the subnet number?
[12:50:25] <marostegui>	 Am I right that the update cookbook is failing here https://phabricator.wikimedia.org/P76477 because the phabricator task is private?
[12:50:29] <Emperor>	 federico3: if you visit https://netbox.wikimedia.org/ipam/prefixes/701/ (in the commit message), it should be there towards the top
[12:50:58] <Emperor>	 ...and if you click on the IP Addresses tab, you'll see thanos-be1007
[12:53:03] <federico3>	 marostegui: it is possible, I can try to reproduce it if needed
[12:53:42] <marostegui>	 federico3: I think it is needs fixing because the -t is mandatory, but if the ticket is private, then it cannot work
[12:54:07] <federico3>	 do we want to make -t not mandatory or detect that the task is private and skip the update?
[12:54:11] <volans>	 marostegui: that should have been fixed in T314917
[12:54:11] <stashbot>	 T314917: Grant ops-monitoring-bot WMF-NDA and acl*sre-term access - https://phabricator.wikimedia.org/T314917
[12:54:43] <marostegui>	 federico3: both should  work, but maybe we need to follow up on the ticket volans just pasted?
[12:54:58] <volans>	 ah the ticket is security
[12:55:02] <volans>	 so yeah not granted (yet?)
[12:56:07] <federico3>	 could we perhaps return a more specific error?
[12:56:31] <marostegui>	 Maybe, but in this situation there's no work around other than making -t non mandatory
[12:57:34] <volans>	 federico3: IIRC the phabricator API doesn't give an easy way to know it
[12:58:12] <marostegui>	 I am going to a meeting now!
[12:58:30] <federico3>	 if we want to always track the update with a dedicated task perhaps we could require opening a dedicated "public" task? Anyhow for the time being I can make -t optional
[13:00:15] <federico3>	 volans: maybe just catch the native error and return something like "Task not found or access denied"? just my 2c :) 
[13:00:19] <volans>	 yeah the error is the same of passing T123456789
[13:14:29] <federico3>	 volans: can I check if I can access a task before doing task_comment?
[13:14:55] <federico3>	 I see only the task_comment method in the phab instance
[13:15:25] <volans>	 correct, the phab layer of wmflib is very thin, can be expanded at will of the phab apis have that functionality
[13:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:17:10] <federico3>	 e.g. we might risk breaking out of a cookbook somewhere during the run unless we check the task for "write permissions" at the very beginning
[13:19:30] <volans>	 I guess we could search for it, if it doesn't exists we don't have access (or the user put a wrong id)
[13:19:33] <volans>	 https://secure.phabricator.com/conduit/method/maniphest.search/
[13:19:55] <volans>	 maniphest.info that looked more suitable says This method is frozen and will eventually be deprecated. New code should use "maniphest.search" instead.
[13:20:07] <federico3>	 also if phab has a network issue or any glitch - perhaps it could be better to have a little retry logic and then log out an error and continue without raising [unless the caller sets for example raise=True]
[13:21:28] <federico3>	 e.g. if a task is set to private or tagged security *while* cookbooks are running we might not want the cookbooks to fail mid-run
[13:27:59] <volans>	 also, for context: T283980
[13:27:59] <stashbot>	 T283980: Phacility (Maintainer of Phabricator) is winding down. Upstream support ending. - https://phabricator.wikimedia.org/T283980
[13:31:05] <volans>	 the latest python3-phabricator has some retry logic in the requests session, but the bookworm version doesn't even use requests (sigh) and latest release 2021
[16:51:47] <RhinosF1>	 Hey data folks, there's a request by Diamona in #wikimedia-operations for urgent approval for https://phabricator.wikimedia.org/T395350#10860354 - it looks busy in #wikimedia-operations so I thought I'd flag here too so it wasn't missed
[16:59:41] <jynus>	 I can take it
[18:36:14] <RhinosF1>	 Thanks