[00:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:05:25] FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:26] ^-- are these expected / being looked at? [08:03:49] marostegui, Amir1: is it a known issue or related to any ongoing event? [08:05:07] It was discussed yesterday when we were talking about those hosts that were old sanitarium hosts and have some left overs. [08:05:15] federico3: can you reimage it? [08:05:25] That's just the easiest way to get rid of those issues [08:08:01] reimage with sudo cookbook sre.hosts.reimage ? Any step I should take before running the cookbook? At least I can create a dedicated task [08:12:03] You have to depool, stop mariadb, and then go for the normal reimage [08:12:14] You can use the remove sanitarium from codfw task [08:12:20] I think you are subscribed even [08:12:42] yes I am [08:15:39] You can use that one for the reimage [08:15:43] Disable notifications too in puppet [08:21:26] federico3: Downtime won't prevent the page, please disable notifications in puppet [08:24:03] I'm confused: when we run the upgrade cookbook it does depool, silence, upgrade, reboot, remove silence, pool and we don't disable the notification in puppet during that process. What makes the reimage page? [08:24:16] The host gets removed from icinga [08:24:22] And then it gets readded [08:24:25] So the downtimes are lost [08:24:34] And it will send a p4ge with mariadb being down etc [08:25:00] thanks [10:09:01] marostegui: I'm deploying the notification change, in the meantime 1) can I go on with reboots on another section? 2) do we want to want to move forward today with the switchover of es1035 for the memory issue maybe? [10:09:45] federico3: 1) yes 2) We tend to prefer to avoid switchovers on fridays, so let's schedule for Monday [10:09:56] ok thanks [10:34:02] marostegui: I'm ready to reimage db2187, should I pick bookworm (suprisingly I see Trixie as an option :D ) ? [10:34:15] yeah, bookworm [10:34:27] moritzm: Is that available for testing already?? [10:35:00] sretest1001 is running trixie [10:35:27] I need to fix a few things in the installer next week to make it smoother to install, but end of next week you should be able to install a DB test host [10:35:34] just don't make it a master please :-) [10:35:35] Oh sweet! [10:35:36] There may be some tricksy bugs remaining ;-) [10:35:36] <_joe_> yeah i would not install trixie now on a prod system [10:35:51] <_joe_> Emperor: sigh [10:35:54] I was planning it just for enwiki master [10:35:55] <-- definitely hasn't been waiting over a year to make that pun [10:36:27] <_joe_> Emperor: you and claime will be in my personal circle of hell, making puns all the time :D [10:36:38] XD [10:37:03] marostegui: trixie works best with the 12.0.0 mariadb release! [10:37:26] that's what I needed to hear! [10:37:26] match in heaven for enwiki master [10:38:04] <_joe_> moritzm: only if we do a revival and we also run it on raid0 [10:40:09] moritzm: do you know if we can also test trixie from the internal container archive? [10:40:43] <_joe_> federico3: we can create the base image - or switch to importing debian's as bases [10:41:08] <_joe_> given there's now a commitment to maintenance [10:41:42] marostegui: the reimage tool in dry run is asking "Physical host on legacy vlan/IP, please consider re-imaging it using --move-vlan." [10:41:56] federico3: let's say no for now [10:41:57] for enabling the official prod images I'd rather wait until trixie is released, there's still things changing at quite a pace [10:42:01] (I've never seen that) [10:42:10] I think i know what it is for, but let's skip [10:42:13] so for initial experiments maybe using the official Debian images is better for now [10:43:29] marostegui: ok tnx [10:44:20] <_joe_> moritzm: ack [10:45:02] <_joe_> moritzm: what do you think of the above: importing debian official images instead of rebuilding them ourselves? [10:46:01] <_joe_> when we made the choice to rebuild, there was no commitment to maintenance / updates of the images, but AIUI that's changed [10:46:25] <_joe_> so rebuilding gave us control on when to build a new image for security updates [10:47:29] not sure. I need to check what the current status current is and how well it works, I'll make a note to have a closer look [10:48:03] there are a few cloud teams who took over maintenance of the images, but need to figure out what the exact workflow is [10:59:03] <_joe_> thanks <3 [12:47:19] marostegui: db2187.codfw.wmnet still has the silence set, should I leave it enabled while you investigate? [12:50:08] federico3: No, you can enable notificatiuons and repool if it caught up. The things we are investigating isn't affecting db2187 but just prometheus [12:58:54] ok. Do we feel confident repooling on friday or monday? [13:00:31] yes, you can go for it [13:00:33] but enable notifications [14:08:27] marostegui: for the zarcillo user/pass in the private puppet repo is this ok? https://phabricator.wikimedia.org/P76736 (of course I'll fill the passwords in) [14:08:28] Should I then place the GRANTS... in puppet? [14:10:51] do you want a year timestamp in the username e.g. "25" perhaps? [14:53:47] it shouldn't be done like that [14:54:10] the values should come from hiera as much as possible [15:01:03] Amir1: sorry, which value? I see other "class passwords::mysql" blocks that contain usernames and password, what should I do instead? [15:05:10] 1- only password should come from private puppet 2- mysql in private puppet is doing it wrong. That should also be hiera value too but the value be in private puppet [15:05:36] Check other cases [15:07:02] I'm looking at modules/passwords/manifests/init.pp for example at phabricator [15:07:54] one sec, updating the example [15:09:15] https://phabricator.wikimedia.org/P76736 updated... something like this and the associated users go in "public" puppet? I'm not sure what you mean [15:38:35] You'd usually have dummy credentials in labs/private to shadow the actually-private ones e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/hieradata/common/profile/ceph/s3/client.yaml or https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/hieradata/common/profile/swift.yaml [15:53:32] ok I can first copy https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/hieradata/common/profile/swift.yaml ( the key in the keyvalue pairs is the username? ) then create an identical entry but with the real password [16:06:30] this is how it's done on client side: https://github.com/wikimedia/operations-puppet/blob/2bad72658b49806996e27b7abd557186a62b2836/modules/mailman3/manifests/web.pp#L15 [16:06:58] then called from https://github.com/wikimedia/operations-puppet/blob/2bad72658b49806996e27b7abd557186a62b2836/modules/mailman3/manifests/init.pp#L47 [16:08:02] then https://github.com/wikimedia/operations-puppet/blob/2bad72658b49806996e27b7abd557186a62b2836/modules/profile/manifests/lists.pp#L20 [16:08:45] see https://wikitech.wikimedia.org/wiki/Puppet/Coding_and_style_guidelines [16:18:03] (just for clarity: this is meant to create users on mysql instances. The client side is on k8s and uses a different credential management) [18:40:41] I know, I'm saying this is how it should be done on puppet side :) [18:40:54] client or server-side doesn't matter