[12:45:21] Hello dear SRE team ! [12:47:55] For your information, i created a rspec file for packaging varnish-kafka package for RHEL-based distros. We use it in our infrastructure it is proven to be working. I submitted a PR with just exactly that. If you are interested, feel free to merge, else, feel free to reject :) https://gitlab.wikimedia.org/repos/sre/varnishkafka/-/merge_requests/1 [12:48:51] ( also, feel free to shoot me on sight for any errors i made ) [13:15:14] Would someone with admin permissions on Cloud VPS be willing to attempt rebooting Logstash on beta cluster for T350786? [13:15:14] T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 - https://phabricator.wikimedia.org/T350786 [13:21:01] kostajh: do you know the hostname? i don't see anything with logstash in the name in deployment-prep [13:22:25] jbond: in the past it was deployment-logstash03, according to T274593 [13:22:25] T274593: Logstash beta is not getting any events - https://phabricator.wikimedia.org/T274593 [13:23:06] jbond: maybe https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on helps [13:24:29] kostajh: unfortunatly therre are no hosts matchig /deployment-logstash.*/ [13:25:18] i also dont see beta-logs.wmcloud.org as one of the domains or proxies in that project [13:25:52] perhaps try asking in #w-cloud they may have more info at the very least they should be able to help locate which host / project gets traffic beta-logs.wmcloud.org [13:26:29] ok, thanks jbond [13:27:01] is it possible logstash is running on deployment-deploy03? [13:27:12] one sec i can check [13:27:42] beta-logs.wmcloud.org is in https://openstack-browser.toolforge.org/project/logging, fwiw [13:27:45] ps aux says no [13:28:00] kostajh: see above looks like its in toolforge not deployment-prep [13:28:03] thanks taavi [13:28:16] sorry loggin project [13:28:22] (you can check it via https://openstack-browser.toolforge.org/proxy/) [13:28:28] no, it's in the logging cloud vps project [13:29:33] kostajh: i have ran the following [13:29:34] root@logging-logstash-02:~# systemctl restart logstash.service [13:30:33] jbond: thanks! could you please add a note on T350786 with the hostname you ssh'ed to, and the command, for the next time this might happen? [13:30:42] sure [13:30:52] assuming this fixes the issue, ofc [13:30:57] T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 - https://phabricator.wikimedia.org/T350786 [13:34:59] jbond: I don't see any logs yet. [13:35:53] would you mind to check logs on the server, in case logstash is outputting something there? [13:39:12] * jbond looking [13:40:22] kostajh: logstash is in a crash loop with the following error https://phabricator.wikimedia.org/P53517 [13:42:20] aha. now we are getting somewhere :) [14:01:30] <_joe_> MrBleu: thanks a lot, I'll get it the attention of the right people :) [14:29:16] headsup: I'm about to ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/974500, which automates the generation of the subnet DHCP config files from hiera data. If something looks wrong in the near future, scream and I'll investigate. Thanks! [16:17:55] moritzm: it looks like cumin2002 (already migrated to Puppet 7) has some issues running cookbooks involving ganeti [16:18:14] brett hit https://paste.debian.net/plainh/d3e1253d a few minutes ago [16:18:21] looking good from cumin1001 though [16:19:52] oh I see, sounds like we're hitting https://phabricator.wikimedia.org/T350686 [16:20:06] just use cumin1001 until this is fixed [16:20:15] ack [16:48:03] hello on-callers, I am going to complete the rollout of changeprop on nodejs-18 [16:48:20] it will be both cp and cp-jobqueues codfw, so the instances taking most of the traffic [16:48:42] we already know that cpu usage will increase a bit, but so far everything worked as expected [16:49:02] in case of troubles (high backlog in jobs etc..), https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/974476 needs to be reverted [16:59:26] can you do that in 2 minutes ? I'll not be oncall anymore :D [16:59:38] (joking, thanks for the info) [16:59:49] fabfur: that would be true but you work till midnight anyway so :P [17:01:08] elukey: out of curiosity, did you ever find out why the CPU increased? Was it due to the ticker operation becoming more expensive? [17:01:34] I remember reading a phab comment with a perf output, but forgot the specifics [17:02:31] brouberol: I didn't find a clear cause, after more flame graphs it seemed that most of the cpu usage was related to the new version of librdkafka (and how it interacts with nodejs 18 timers) [17:04:10] the changeprop code didn't change in years, we fixed little things here and there but it would need a serious refactor [17:04:46] I keep saying that it is really sad that a tool so important for us is not owned/maintained by a proper dev team [17:05:02] we don't even have a future plan to replace it etc.. [17:05:41] I am helping with the upgrade only because we cannot really run nodejs10 on stretch with librdfkafka from years ago in production :D [17:06:16] understood, thanks [17:06:19] ServiceOps and Hugh have maintained cp up to now, but it is clearly out of scope.. [17:15:03] deployments done! [17:15:08] so far I don't see any issue [17:17:18] (going afk but I'll check later) [17:53:58] thanks elukey! [20:19:11] is there anyone around whose software raid-fu is strong than mine? [20:23:32] I have a host with a raid10 array that had an ssd fail right when it was booted into d-i for a reimage. The ssd was replaced, and the reimage completed (using a partman recipe meant to preserve everything but the root partition)... but my assumption that I could just leave the array be and sort out afterward was probably unwise [20:24:27] It looks like the array has been assembled with the new ssd [20:24:39] https://www.irccloud.com/pastebin/5T8LvY0Z/ [20:24:51] but it won't mount [20:25:48] I'm hoping it can be saved [20:40:53] urandom: based on the "resyncing (PENDING)" i think it's in "auto-readonly" [20:41:55] mutante: yeah, that sounds right [20:42:10] https://www.irccloud.com/pastebin/HJFr63ix/ [20:43:07] https://unix.stackexchange.com/questions/101072/new-md-array-is-auto-read-only-and-has-resync-pending [20:43:23] see the top answer there [20:43:59] ..will automatically switch from auto-read-only to read-write when it receives its first write.. only reason you'd need to run mdadm --readwrite on it is if you want it to sync before you perform any writes. [20:44:45] if you set it --readwrite like that it should actually start the sync [20:44:50] yeah, I actually came across the same [20:45:40] but also makes me think you should be able to mount it, just readonly? [20:45:41] so... from the answer provided, it sounds like you wouldn't need to do that [20:46:06] it made me think I should be able to mount it, and that once I did and something was written, it would go readwrite [20:46:15] only if something would write to it, like if you created a filesystem though? [20:46:33] it doesn't seem like it has a filesystem, and it should [20:46:45] or rather, if it doesn't, then I've failed :) [20:47:17] https://www.irccloud.com/pastebin/zntW3Iux/ [20:49:49] I guess what I should have done is had the ssd replaced, and then booted into a rescue image to handle re-add it, and then did the reimage [20:50:45] I assumed that since d-i wasn't doing anything with that array, I could just wait until afterward... but I guess it assembled it from the constituent devices, one of which was "new"? [20:51:41] ...not even new, it was taken from a decomm'd server. [20:53:01] so I'm thinking that the contents are gone, and that I'll have to reformat... but want to be sure [20:54:00] urandom: maybe there is still hope because of this [20:54:02] [Thu Nov 16 20:35:09 2023] md/raid10:md2: not clean -- starting background reconstruction [20:54:10] [Thu Nov 16 20:35:09 2023] md2: detected capacity change from 0 to 3790495285248 [20:54:25] that is last lines from dmesg [20:54:44] ~ 20 min ago [20:59:33] I don't think it has a filesystem :/ [20:59:50] which feels like game-over [21:05:42] checked bacula but no aqs machines in it. did you have an actual loss or .. is it just about having to reinstall [21:08:31] urandom: I think you are right about that.. no filesystem.. output of "lsblk -f" shows just "md2" and on another host "md2 ext4" [21:08:55] it's recoverable [21:09:19] not pleasantly, and it will take time, but it can be recovered [21:09:35] gotcha [22:16:47] is anyone using envoy to do outbound rate limiting? Just wondering if it's possible/easy (baremetal hosts if that matters) [22:43:47] I added a new function to wmflib. You can now use wmflib::debian_php_version() to get the distro PHP version. no more repetition of this common pattern: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974285/9/modules/profile/manifests/phabricator/httpd.pp [22:47:11] (or the hardcoded "$php_module = 'php7.3'" lines) [23:23:05] cccccclrjgugfltkvdirijrgrdjjkigetrhnlunlnklh [23:23:13] bingo! [23:23:14] oops typo [23:23:15] I mean: hi [23:35:30] {◕ ◡ ◕} [23:38:03] How do you spot a Yubikey user? They willcccccclrjtellyou [23:47:15] and here I was thinking you just keysmashed :)