[12:45:21] <MrBleu>	 Hello dear SRE team ! 
[12:47:55] <MrBleu>	 For your information, i created a rspec file for packaging varnish-kafka package for RHEL-based distros. We use it in our infrastructure it is proven to be working. I submitted a PR with just exactly that. If you are interested, feel free to merge, else, feel free to reject :) https://gitlab.wikimedia.org/repos/sre/varnishkafka/-/merge_requests/1
[12:48:51] <MrBleu>	 ( also, feel free to shoot me on sight for any errors i made )
[13:15:14] <kostajh>	 Would someone with admin permissions on Cloud VPS be willing to attempt rebooting Logstash on beta cluster for T350786?
[13:15:14] <stashbot>	 T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 - https://phabricator.wikimedia.org/T350786
[13:21:01] <jbond>	 kostajh: do you know the hostname?  i don't see anything with logstash in the name in deployment-prep
[13:22:25] <kostajh>	 jbond: in the past it was deployment-logstash03, according to T274593
[13:22:25] <stashbot>	 T274593: Logstash beta is not getting any events - https://phabricator.wikimedia.org/T274593
[13:23:06] <kostajh>	 jbond: maybe https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on helps
[13:24:29] <jbond>	 kostajh: unfortunatly therre are no hosts matchig /deployment-logstash.*/
[13:25:18] <jbond>	 i also dont see beta-logs.wmcloud.org as one of the domains or proxies in that project
[13:25:52] <jbond>	 perhaps try asking in #w-cloud they may have more info at the very least they should be able to help locate which host / project gets traffic beta-logs.wmcloud.org
[13:26:29] <kostajh>	 ok, thanks jbond 
[13:27:01] <kostajh>	 is it possible logstash is running on deployment-deploy03?
[13:27:12] <jbond>	 one sec i can check
[13:27:42] <taavi>	 beta-logs.wmcloud.org is in https://openstack-browser.toolforge.org/project/logging, fwiw
[13:27:45] <kostajh>	 ps aux says no 
[13:28:00] <jbond>	 kostajh: see above looks like its in toolforge not deployment-prep
[13:28:03] <jbond>	 thanks taavi
[13:28:16] <jbond>	 sorry loggin project 
[13:28:22] <taavi>	 (you can check it via https://openstack-browser.toolforge.org/proxy/)
[13:28:28] <taavi>	 no, it's in the logging cloud vps project
[13:29:33] <jbond>	 kostajh: i have ran the following 
[13:29:34] <jbond>	 root@logging-logstash-02:~# systemctl restart logstash.service 
[13:30:33] <kostajh>	 jbond: thanks! could you please add a note on T350786 with the hostname you ssh'ed to, and the command, for the next time this might happen?
[13:30:42] <jbond>	 sure
[13:30:52] <kostajh>	 assuming this fixes the issue, ofc
[13:30:57] <stashbot>	 T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 - https://phabricator.wikimedia.org/T350786
[13:34:59] <kostajh>	 jbond: I don't see any logs yet.
[13:35:53] <kostajh>	 would you mind to check logs on the server, in case logstash is outputting something there?
[13:39:12] * jbond looking
[13:40:22] <jbond>	 kostajh: logstash is in a crash loop with the following error https://phabricator.wikimedia.org/P53517
[13:42:20] <kostajh>	 aha. now we are getting somewhere :)
[14:01:30] <_joe_>	 MrBleu: thanks a lot, I'll get it the attention of the right people :)
[14:29:16] <brouberol>	 headsup: I'm about to ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/974500, which automates the generation of the subnet DHCP config files from hiera data. If something looks wrong in the near future, scream and I'll investigate. Thanks!
[16:17:55] <vgutierrez>	 moritzm: it looks like cumin2002 (already migrated to Puppet 7) has some issues running cookbooks involving ganeti
[16:18:14] <vgutierrez>	 brett hit https://paste.debian.net/plainh/d3e1253d a few minutes ago
[16:18:21] <vgutierrez>	 looking good from cumin1001 though
[16:19:52] <moritzm>	 oh I see, sounds like we're hitting https://phabricator.wikimedia.org/T350686
[16:20:06] <moritzm>	 just use cumin1001 until this is fixed
[16:20:15] <vgutierrez>	 ack
[16:48:03] <elukey>	 hello on-callers, I am going to complete the rollout of changeprop on nodejs-18
[16:48:20] <elukey>	 it will be both cp and cp-jobqueues codfw, so the instances taking most of the traffic
[16:48:42] <elukey>	 we already know that cpu usage will increase a bit, but so far everything worked as expected
[16:49:02] <elukey>	 in case of troubles (high backlog in jobs etc..), https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/974476 needs to be reverted
[16:59:26] <fabfur>	 can you do that in 2 minutes ? I'll not be oncall anymore :D 
[16:59:38] <fabfur>	 (joking, thanks for the info) 
[16:59:49] <sukhe>	 fabfur: that would be true but you work till midnight anyway so :P 
[17:01:08] <brouberol>	 elukey: out of curiosity, did you ever find out why the CPU increased? Was it due to the ticker operation becoming more expensive?
[17:01:34] <brouberol>	 I remember reading a phab comment with a perf output, but forgot the specifics
[17:02:31] <elukey>	 brouberol: I didn't find a clear cause, after more flame graphs it seemed that most of the cpu usage was related to the new version of librdkafka (and how it interacts with nodejs 18 timers)
[17:04:10] <elukey>	 the changeprop code didn't change in years, we fixed little things here and there but it would need a serious refactor
[17:04:46] <elukey>	 I keep saying that it is really sad that a tool so important for us is not owned/maintained by a proper dev team
[17:05:02] <elukey>	 we don't even have a future plan to replace it etc..
[17:05:41] <elukey>	 I am helping with the upgrade only because we cannot really run nodejs10 on stretch with librdfkafka from years ago in production :D
[17:06:16] <brouberol>	 understood, thanks
[17:06:19] <elukey>	 ServiceOps and Hugh have maintained cp up to now, but it is clearly out of scope..
[17:15:03] <elukey>	 deployments done!
[17:15:08] <elukey>	 so far I don't see any issue
[17:17:18] <elukey>	 (going afk but I'll check later)
[17:53:58] <hnowlan>	 thanks elukey!
[20:19:11] <urandom>	 is there anyone around whose software raid-fu is strong  than mine?
[20:23:32] <urandom>	 I have a host with a raid10 array that had an ssd fail right when it was booted into d-i for a reimage.  The ssd was replaced, and the reimage completed (using a partman recipe meant to preserve everything but the root partition)... but my assumption  that I could just leave the array be and sort out afterward was probably unwise
[20:24:27] <urandom>	 It looks like the array has been assembled with the new ssd
[20:24:39] <urandom>	 https://www.irccloud.com/pastebin/5T8LvY0Z/
[20:24:51] <urandom>	 but it won't mount
[20:25:48] <urandom>	 I'm hoping it can be saved
[20:40:53] <mutante>	 urandom: based on the "resyncing (PENDING)" i think it's in "auto-readonly"
[20:41:55] <urandom>	 mutante: yeah, that sounds right
[20:42:10] <urandom>	 https://www.irccloud.com/pastebin/HJFr63ix/
[20:43:07] <mutante>	 https://unix.stackexchange.com/questions/101072/new-md-array-is-auto-read-only-and-has-resync-pending
[20:43:23] <mutante>	 see the top answer there
[20:43:59] <mutante>	 ..will automatically switch from auto-read-only to read-write when it receives its first write..  only reason you'd need to run mdadm --readwrite on it is if you want it to sync before you perform any writes.
[20:44:45] <mutante>	 if you set it --readwrite like that it should actually start the sync 
[20:44:50] <urandom>	 yeah, I actually came across the same
[20:45:40] <mutante>	 but also makes me think you should be able to mount it, just readonly?
[20:45:41] <urandom>	 so... from the answer provided, it sounds like you wouldn't need to do that
[20:46:06] <urandom>	 it made me think I should be able to mount it, and that once I did and something was written, it would go readwrite
[20:46:15] <mutante>	 only if something would write to it, like if you created a filesystem though?
[20:46:33] <urandom>	 it doesn't seem like it has a filesystem, and it should
[20:46:45] <urandom>	 or rather, if it doesn't, then I've failed :)
[20:47:17] <urandom>	 https://www.irccloud.com/pastebin/zntW3Iux/
[20:49:49] <urandom>	 I guess what I should have done is had the ssd replaced, and then booted into a rescue image to handle re-add it, and then did the reimage
[20:50:45] <urandom>	 I assumed that since d-i wasn't doing anything with that array, I could just wait until afterward... but I guess it assembled it from the constituent devices, one of which was "new"?
[20:51:41] <urandom>	 ...not even new, it was taken from a decomm'd server.
[20:53:01] <urandom>	 so I'm thinking that the contents are gone, and that I'll have to reformat... but want to be sure
[20:54:00] <mutante>	 urandom: maybe there is still hope because of this
[20:54:02] <mutante>	 [Thu Nov 16 20:35:09 2023] md/raid10:md2: not clean -- starting background reconstruction
[20:54:10] <mutante>	 [Thu Nov 16 20:35:09 2023] md2: detected capacity change from 0 to 3790495285248
[20:54:25] <mutante>	 that is last lines from dmesg
[20:54:44] <mutante>	 ~ 20 min ago 
[20:59:33] <urandom>	 I don't think it has a filesystem :/
[20:59:50] <urandom>	 which feels like game-over
[21:05:42] <mutante>	 checked bacula but no aqs machines in it. did you have an actual loss or .. is it just about having to reinstall
[21:08:31] <mutante>	 urandom: I think you are right about that.. no filesystem.. output of "lsblk -f" shows just "md2" and on another host "md2 ext4"
[21:08:55] <urandom>	 it's recoverable
[21:09:19] <urandom>	 not pleasantly, and it will take time, but it can be recovered
[21:09:35] <mutante>	 gotcha
[22:16:47] <inflatador>	 is anyone using envoy to do outbound rate limiting? Just wondering if it's possible/easy (baremetal hosts if that matters)
[22:43:47] <mutante>	 I added a new function to wmflib. You can now use wmflib::debian_php_version() to get the distro PHP version. no more repetition of this common pattern: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974285/9/modules/profile/manifests/phabricator/httpd.pp
[22:47:11] <mutante>	 (or the hardcoded "$php_module = 'php7.3'" lines)
[23:23:05] <rzl>	 cccccclrjgugfltkvdirijrgrdjjkigetrhnlunlnklh
[23:23:13] <bd808>	 bingo!
[23:23:14] <rzl>	 oops typo
[23:23:15] <rzl>	 I mean: hi
[23:35:30] <inflatador>	 {◕ ◡ ◕}
[23:38:03] <klausman>	 How do you spot a Yubikey user? They willcccccclrjtellyou
[23:47:15] <TheresNoTime>	 and here I was thinking you just keysmashed :)