[08:36:09] hi there, the secondary deployment server deploy1002.eqiad.wmnet is not reachable [08:36:30] scap uses it as an extra host to sync out the train and our presync failed last night [08:36:49] anyone here that could take a look? [08:40:30] Hi! Yes lemme check [08:40:57] * hashar blames firewalls [08:41:58] the host is up but I can't ping say deploy2002 [08:42:56] and ethtool says that there is no link [08:43:46] ahhh [Mon Oct 23 15:24:51 2023] tg3 0000:04:00.0 eno1: renamed from eth0 [08:43:49] lovely [08:44:24] mmm no I thought /etc/network/interfaces was stale [08:44:27] it is correct, but [08:44:28] [Mon Oct 23 15:25:02 2023] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready [08:48:31] jnuche: afaics it seems that something is wrong on the DC side [08:49:10] maybe the ethernet cable is misplaced or similar [08:49:17] (or needs to be replaced) [08:49:39] I can open a task, but we'd need to wait for SREs in dcops-eqiad to be in the DC [08:50:30] elukey: ah damn, yes please go ahead and open a task [08:50:37] and thanks a lot for looking into it [08:53:09] 10serviceops, 10ops-eqiad: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10elukey) [08:53:12] jnuche: --^ [08:53:33] elukey: thank you! [08:56:17] elukey: can't the interface renaming be caused by a kernel upgrade? [08:56:27] 10serviceops, 10ops-eqiad: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [08:56:36] or some software issue rather than a cable/hardware trouble? [08:57:13] deploy1002 was moved yesterday as a part of T308339 [08:57:29] oh joy [08:58:02] that will do it [09:00:02] the switch port is disabled.. that's an easy fix thankfully [09:00:05] 10serviceops, 10ops-eqiad: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [09:00:27] 10serviceops, 10ops-eqiad: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [09:01:14] like administratively disabled / not enabled rather than a cable disconnected? [09:01:20] hashar: so /etc/network/interfaces is up-to-date, I tried ifdown/ifup etc.. and they work. [09:01:41] yes, I'm fixing that [09:01:46] taavi: nice thanks! [09:01:55] you have access to network devices? :-] [09:02:11] that is handy [09:03:14] hmmm homer shows a diff on cp1110 and cp1111 which I don't think I want to touch [09:03:44] * taavi runs the cookbook instead [09:04:24] TIL configure-network-interfaces, it has been a while since I've done it (and it was manual) [09:05:17] deploy1002 reachable again :) [09:05:22] * elukey bbiab [09:05:44] my team deals with all kinds of weird network stuff these days, kind of forces you to learn all of the tools :P [09:08:45] I am quite happy to not have to deal with networking things anymore ;-] [09:09:50] * hashar runs Puppet on deploy1002 [09:10:04] 10serviceops, 10ops-eqiad: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10taavi) The host was moved in {T308339} but the switch ports were not updated. I ran `sre.network.configure-switch-interface` to configure the port as Homer was showing an unrelated diff. That does mean that t... [09:10:52] taavi, elukey: thank you! :) [09:23:18] ok, going to rerun the presync now [09:26:57] now the train-blockers tool in toolforge is broke, ouch: 500 Server Error: Internal Server Error for url: https://train-blockers.toolforge.org/api.php [09:36:17] seems like it's back [11:18:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10jijiki) [12:24:58] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Data Products (Sprint 03)), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10WDoranWMF) [13:40:53] hiya, dcausse just did a rolling restart of eventgate-main according to these instructions https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart [13:41:03] is that still the recommended way to do this? [13:41:25] I see some references to kubectl rollout restart in docs online, but that might be just for newer k8s [13:46:20] wathing the pods I had the impression that they restarted all at once, looking at logstash we seem to have ~ [13:46:52] 6k failures with "JobQueueError: Could not enqueue jobs" during the restart [13:49:50] i'm not sure, but it kind of looks like the readiness probe failed its first attempts, but the old pods were killed anyway? Old pods killed before new pods are ready? [13:51:26] I've never used the helmfile cmd, I have used the kubectl cmd but I think that requires root privileges [14:29:01] it's still the way to go I'd say [14:29:55] but, as with kubectl rollout, it will roll-restart according to the configured policy [15:26:16] 10serviceops, 10Data-Engineering, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) I tried to profile nodejs' code via `-prof` and `--prof-process` to have a better view of the CPU usage. I tried first with `perf` but I didn't obtain useful info... [15:27:01] folks I added some thoughts in https://phabricator.wikimedia.org/T348950#9276824 to the changeprop's increase in cpu usage [15:27:20] there are some details about what I checked etc.. [15:27:35] if you have time let me know what you think about it, or if you have a preference [15:45:45] did you had the chance to run the same profiling with the old version to maybe see if that has changed? [15:53:21] I did not, I can try tomorrow (after a rollback) [17:19:10] if anyone is feeling brave, thumbor is ~ready to be upgraded to bullseye https://gerrit.wikimedia.org/r/920760 [17:22:08] 10serviceops, 10Thumbor, 10Patch-For-Review: Upgrade Thumbor to bullseye - https://phabricator.wikimedia.org/T336881 (10hnowlan) The review for moving Thumbor to bullseye is ready for review - we've had to make a variety of changes as patches to imagemagick lead to slightly different outputs for generated PN... [17:26:59] elukey: I have a very uninformed theory that we are seeing the impact of improved concurrency in nodejs and we were previously failing to poll as much as we were previously configured or some kind of positive downside of the upgrade. No evidence for this though [17:30:22] but that combined with changeprop's overall mode of operation does make me tempted to see how well/badly the new version would cope with actual production traffic. It should be immediately obvious [18:05:02] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests in a private context - https://phabricator.wikimedia.org/T345185 (10Tgr) >>! In T345185#9203965, @daniel wrote: > The [[https://w...