[08:46:21] Etherpad will be down in 15 minutes for around one hour - T316421 [08:46:22] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [09:13:34] FYI, idp.w.o is row running on Java 17 (previously 11), as an intermediate step to the Bookworm update. no issues were found on the idp-test* hosts, but if anything is odd related to CAS_authenticated services, please ping Simon and myself [09:43:13] Etherpad maintenance finished - T316421 [09:43:13] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [11:00:33] hi all - I had some trouble after replacing a disk in a R540 (config F) https://phabricator.wikimedia.org/T357380#9575876 - I think mainly because the storage controller is in RAID instead of HBA mode and required manual intervention after disk replacement. Is that some oversight maybe or on purpose? [11:01:13] practically there are two RAID-0 configurations (one per physical drive) - and other servers of the same model do share that config [11:01:36] mdadm then goes on top of those "raid0" virtual drives [11:04:54] jayme: the problem being that the drive has been swapped, but you need to recreate the RAID0 device? [11:05:32] jayme: if I'm right, then the swift howto docs have just what you want - https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings [11:06:37] <_joe_> Emperor: why are we not using HBA mode, even [11:06:51] Emperor: maybe - but that's to well hidden :D [11:06:54] _joe_: Swift is moving away from single-drive-RAID0 towards JBOD [11:07:03] but not very ... swiftly ;p [11:07:52] jayme: I think that'll do what you want (assuming I've understood the problem correctly) [11:08:59] Emperor: I already worked around that manually. As I rebootet the server and it did not came back up I had no chance to run megacli anymore [11:09:30] why a mw host has a raid controller? [11:09:35] do you happen to know why we use this RAID-0 config? [11:09:50] because we love pain [11:09:54] No, wait, probably not that :) [11:09:56] ah, makes sense. yes [11:10:15] honestly, no, it's been that way since before I started. I hate hardware RAID myself :) [11:10:46] and it's not even a hardware raid...it's just an obfuscated disk :D [11:11:02] yeah, they're not fun [11:12:34] looks from the ticket that you've got the system back running again, though? [11:12:42] yeah [11:12:51] but it did not felt right :) [11:13:41] so I wanted to question the status quo [11:14:23] from my naive POV that hassle could be avoided by configuring the controller as HBA instead of RAID [11:15:29] that's what I'm moving swift towards, but it does involve non-trivial migration work [11:16:02] (e.g. installer setup, maybe changing how fstab is done, supporting both old and new systems at once...) [11:16:53] yeah, can imagine... :/ [11:17:13] I thought maybe there is a specific reason for doing it this way that I'm not seeing [11:17:45] usually the hosts with just 2 disks don't have a raid controller at all and we use mdadm [11:18:12] for those with many disks we have two different approaches, those that actually take benefit of the raid like mysql hosts with raid 10 tht use the controller [11:18:33] and those that run softwares that takes care of each individual disks like cassandra, swift, etcc... [11:19:04] those do run mdadm on top of this "raid-0 with single disk" thing [11:19:33] jayme: which one are you referring to? or it's just a third category :D [11:19:57] I think some of the oldest swift nodes did that for the OS disks [11:20:25] volans: I think it's a third one - the one I'm observing [11:20:37] how many disks you have? [11:20:43] 2 [11:20:54] then the question is why it has a raid controller in the first place [11:21:23] not sure that is the question - as they do have them and we're probably not going to rip them out ;) [11:21:34] don't tempt me ;) [11:24:12] I'm not sure why they have those controllers. I've naively assumed they are just part of the "Config F" hardware [11:24:49] config F doesn't have a hw raid controller and loooking at affected hosts these are all from a single racking task: https://phabricator.wikimedia.org/T326362 [11:25:19] this is probably just a provisioning glitch? [11:25:53] If I go to https://debmonitor.wikimedia.org/packages/megacli and filter for mw, it's exactly these servers only [11:27:20] that was supposed to be Config C-1G from the procurement task [11:27:24] (T325215) [11:29:22] but Netbox lists them as Config F? at which stage is the machine type entered, manually by DC ops or via some script? [11:29:50] manually, the configs are purely logical, netbox has not enough data to distinguish them [11:30:18] the quote had a PERC H745 Controller, Front item for all 32 hosts [11:35:22] so the quote sent by Dell in T325215 was incorrect, right? config-c doesn't have a hw RAID controller after all [14:51:17] https://phabricator.wikimedia.org/T336504 is causing widespread problems. Is there any chance last Thursday's enwiki deployment could be rolled back as a stop-gap measure until a real fix can be deployed? [14:52:31] Emperor: could you point me to a server where you already ran the "unraid"-cookbook (cumin1001 is gone, so no logs ;))? [14:55:57] ah ms-be2044 maybe [14:56:37] they logs are still around, Riccardo copied them over to the new host [14:57:15] jayme: https://phabricator.wikimedia.org/T353523#9567737 [14:57:22] /var/log/cumin1001 on cumin1002 [14:57:47] roy649: #wikimedia-releng is a better place for discussing that [14:57:59] jayme: I did a bunch of ms-be nodes recently, but then immediately decommissioned them. [14:58:33] ah, thanks moritzm|voland - I just looked on 1002 not 2002 [14:58:40] taavi thanks, I'll pick it up over there. [14:58:42] *volan.s [14:58:48] Emperor: ah, damn :) [14:58:49] jayme: if you look in cookbooks_testing/logs/sre/swift/ in my ~ on cumin2002 you'll find some more logs from developing/testing the cookbook [14:59:16] I actually wanted to peek into how the raid controller config looks afterwards [14:59:20] jayme: if you want to know what the end-state looks like then see e.g. ms-be1082 [14:59:35] the new-enough ms-be nodes are setup thus [15:00:11] cool. I'll go look at ms-be1082. thanks [15:02:45] !log Disabling meta-monitoring for the alert hosts - T333615 [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:59] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [16:02:55] hi folks. I want to rmdir an etcd directory because of a schema change (and removal of a custom schema). [16:03:19] basically rmdir /conftool/v1/dnsbox. Puppet change has already been merged but: "conftool::cleanup: Not removing dnsbox objects" as expected [16:03:40] I have never done this before so I wanted to check before doing it. is it fine to do [16:03:53] etcdctl -C https://conf1007.eqiad.wmnet:4001 rmdir /conftool/v1/dnsbox for example, or is there something else I should keep in mind? [16:06:08] sukhe: from a cumin host? [16:06:21] volans: yes [16:07:00] I use sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ... [16:07:01] DEPRECATED - "--endpoints" should be used instead [16:07:14] (--peers/-C is deprecated) [16:07:22] rmdir will remove it only if empty IIRC [16:07:30] interesting [16:08:03] rm -r it's the recursive delete [16:08:04] I can probably do rm --recursive [16:08:20] yeah, but even then, I wanted to check before doing it and also get a +1 for it, here, since where else! [16:08:24] are you sure the keys are not used? [16:08:43] volans: not anywhere critical right now, no. and the change is ready to be deployed but first I want to clean up the old one [16:08:51] I did an ls -r of /conftool/v1/dnsbox. and I know they are not used by pybal or spicerack [16:08:52] I am quite confident they are not being used as it's bblack and me doing this [16:09:03] then I guess +1 [16:09:20] cool, thank you for reviewing! [16:10:07] sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 rm /conftool/v1/dnsbox --recursive [16:11:20] he [16:11:21] Error: 110: The request requires user authentication (Insufficient credentials) [0] [16:11:21] I would have put --recursive before the path, but hopefully the script is smart enough [16:11:24] sudo [16:16:03] yeah, something else is missing [16:16:29] ? [16:16:32] can't even rm a single key, let along a directory [16:17:05] sukhe@cumin2002:~$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 rm /conftool/v1/dnsbox/ntp/dns1004.wikimedia.org [16:17:08] Error: 110: The request requires user authentication (Insufficient credentials) [0] [16:17:40] er the above key is incorrect but yeah, the correct one also doesn't work [16:18:06] sukhe@cumin2002:~$ etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 get /conftool/v1/dnsbox/ntp/eqiad/dns1004.wikimedia.org [16:18:09] {"pooled": "yes", "weight": 100, "ip": "208.80.154.6"} [16:18:17] sorry checking my history [16:18:42] have you tried sudo -i? [16:18:50] the credentials are in root's home IIRC [16:19:01] same error [16:19:29] Hello everyone, I'm reimaging a host but ithe cookbook is taking longer than usual. [16:19:52] It seems to be working on this step for almost an hour. [45/50, retrying in 135.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title alert2001 not found yet [16:20:07] Do anybody know if this is expected? [16:20:21] I'm worried the cookbook may be stuck. [16:20:54] sukhe: if I run it with --username root and manually paste the password in root's .etcdc [16:20:57] it works [16:21:19] denisse: at which stage it is? dhcp, in d-i, first puppet run... [16:21:36] denisse: that means that puppet is failing to compile the catalog [16:21:47] volans: thanks, so same command as above? [16:21:57] with --username root inside [16:22:03] interesting [16:22:07] ok, I guess I can live with it :) [16:22:12] it's not reading the etcdtc [16:22:36] volans: Thank you, I think it's in this stage: puppet agent -t --noop &> /dev/null [16:22:57] It already generated the certificates. [16:23:05] it's not at that stage, that one failed completely, I'm running it manually to check what's failing [16:24:12] Thank you! [16:24:44] mmmh the catalog compilation work [16:24:47] Notice: Applied catalog in 52.50 seconds [16:24:49] (noop) [16:24:50] volans: thanks, appreciate the help! [16:24:58] sukhe: no prob [16:25:16] denisse: ahhh [16:25:17] got it [16:25:26] it's the snowflake of the passsive alerting host [16:25:30] it has no exported resources [16:25:42] because of our broken puppetizzation of nagios_host [16:25:54] :O [16:26:55] unless you create on the fly an exported resource on puppetdb for alert2001 it will never continue with the current code, sorry [16:27:36] the fact that we don't monitor a host is not supported :/ [16:28:06] Interesting. One question, what's broken with the puppetizzation of nagios_hots? [16:28:28] if it's an icinga host doesn't generate exported resources for itself because it generates "real resources" [16:28:41] because the active icinga host monitor itself with real resources not exported ones [16:28:48] but the passive icinga host is a ghost [16:28:52] not present in icinga at all [16:28:55] it's not monitored [16:30:08] see modules/monitoring/manifests/host.pp line 99+ [16:31:32] Oh, this makes sense. Thanks for the explanation. [16:31:41] I'll document this on Wikitech. [16:31:53] denisse: I see two options... 1) reimage it in insetup an then change the role to the current one [16:32:13] 2) we can make a patch for the reimage cookbook to support this snowflake use case, but I'm still thinking on a clean one [16:32:34] the noop is done to generate the exported resources and be able to downtime it right after [16:32:39] to prevent noise during the first puppet run [16:33:09] in this case you don't need either of those, but adding a --passive-icinga-host flag seems a bit of a too-specific option [16:33:12] Yes, the first approach sounds good to proceed with the upgrade. But I'd like to explore 2 as well. [16:33:14] but might be the only one doable [16:33:19] It'd be my first contribution to a cookbook. :D [16:34:16] for me 2 was a quick patch I could make right away but I'm not seeig a super clen one, if you want I can put a --passive-icinga-host flag in 5 minutes and just skip those steps [16:37:57] denisse: patch in 3 [16:38:26] 2 [16:38:32] :) [16:38:43] I didn't specify the unit of measure ;) [16:39:39] volans: Nice, thank you! [16:39:59] I also sent a patch for setting the role as insetup. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006540 [16:40:03] well, I noticed that "It's the volans countdown" scanned, so you can all have the earworm too [16:42:39] * volans trying a different approach, might have a better fix [16:46:24] denisse: proper fix incoming [16:48:15] denisse: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1006545 [16:48:32] you don't neeed the insetup dance [16:49:50] volans: Wow, thank you very much for taking a look and for your help with the patch! :D [16:50:21] * volans waiting for jenkins [16:50:36] I had to test querying puppetdb that it was actually working :D [16:51:18] I appreciate the help! [16:51:41] That bug was complicated to debug. [16:52:26] anytime :) [16:58:26] denisse: merged and deployed, you can retry, sorry for the trouble (but fixing that puppet snowflake would help too ;) ) [16:59:38] Thank you, I'm cleaning the node from Puppet's DB, [16:59:52] the reimage does that for you :) [17:01:26] Oh, but it's to reimage it as a new server and force Puppet 5. [17:02:38] if it's in puppetdb it knows it should be puppet 5, unless I'm missing something [17:02:49] or did the catalog not compile on p7? [17:03:02] I forgot, but I remember we discussed about this last week :D [17:05:26] Yes, the current host uses Puppet 5 and we plan to do the Puppet upgrade after the Bookworm upgrade. [17:06:02] But we were unable to reimage it to Puppet 5. So setting it up as --new with the -p5 work after removing the host from Puppet's DB allowed us to reimage it with p5. [17:08:03] k [17:26:06] denisse: how's going? [17:26:43] It seems to be working fine, it's doing its first Puppet run. :D [17:28:15] yay [18:03:10] --- [18:03:23] there shouldn't be any issues as we are testing while we roll it out but we are refactoring a bunch of things on the DNS hosts [18:03:27] please ping me here if you see any issues, thank you [19:15:41] what to do if decom cookbook finishes with 0 but didn't remove the DNS record of the host in question, and sync-netbox and sync-dns cookbooks both say nothing to commit [19:27:12] mutante: I am surprised that the IP is not being removed but I think I would remove it from netbox (DNS name), just being very careful that it's just that IP [19:32:55] hmm.. touching netbox directly got me in trouble before [19:33:03] yeah :) [20:11:56] Can anyone see the members in the discovery alerts list? I apparently don't have the perms. If you're able to see them, can you DM me with the list? https://lists.wikimedia.org/postorius/lists/discovery-alerts.lists.wikimedia.org/ Sorry to bug [20:18:12] inflatador: [lists1001:~] $ sudo mailman-wrapper members discovery-alerts.lists.wikimedia.org [20:22:39] excellent, thanks mutante ! [20:26:07] yw [21:26:13] anybody happen to know what these ongoing errors from snapshot1010 are? https://logstash.wikimedia.org/goto/6e47b4196b9f54592fdcdc5c30c9d98d [22:17:18] brennen: I don't know exactly. They're something to do with the XML dump of commonswiki that is in progress. https://dumps.wikimedia.org/commonswiki/20240220/ [22:22:29] there's a surface-level answer to that question which is that there are 4 PHP processes with their stderr sent to a disconnected pipe, and MW is writing regular progress reports and getting EPIPE [22:23:28] the processes have PPID=1 [22:24:08] but I don't know the deeper answer of why it is so or who is maintaining this [22:26:03] I think it's likely that a.pergos has the most knowlege, but maintenance of these XML dumps is being transferred to the Data Engineering team with support from Data Platform SRE. So it's either on or near my plate. [22:26:39] they have stime=Feb 21 and you can't fix that pipe or the warning without killing them [22:27:52] although with no controlling process you have to wonder where the data is going [22:29:28] There are some dumps triage notes in a Google Doc here: https://docs.google.com/document/d/1RuP3Qkla-UsOgvKcyK6Zbi8XOmeWx00KUMBEwbNNLG0/edit#heading=h.pek48iwt50fs mentioning this seemingly stuck commonswiki job. Also x.collazo is on Dumps 1.x triage rota. I will pass the details to him. Are the volume of error logs causing anyone any issues? [22:34:05] I believe that this issue is already being tracked here: https://phabricator.wikimedia.org/T358458 [22:39:11] right, timing is very clear, xcollazo killed the python process which was the parent of the PHP processes, leading to these errors [22:40:20] I'll just kill them then [22:40:48] By all means, thanks. [23:13:20] TimStarling If a python process forks a PHP process, don't you risk an irreparable rift in the space-time continuum?