[06:43:29] 10netops, 06Infrastructure-Foundations: Upgrade Junos 20 switches - https://phabricator.wikimedia.org/T390813 (10ayounsi) 03NEW [07:03:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Upgrade management switches to Junos 21.4 - https://phabricator.wikimedia.org/T390814 (10ayounsi) 03NEW [07:04:34] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702309 (10ayounsi) I went to open a JTAC case for the non-working msw but they're all Out Of Support, I opened {T390814} to track their upgrade. [07:51:24] 10netops, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#10702393 (10ayounsi) [08:10:46] moritzm: I'll need to trunk the sandbox vlan on the eqiad row B ganeti fyi [08:14:26] sure, just let me know which one to start with and we can drain them piece by piece [08:18:13] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702500 (10ayounsi) Opened JTAC case 2025-0402-657200 for the SRXs. [08:18:53] cool, will do [08:35:00] moritzm: ganeti1036 [08:35:09] https://netbox.wikimedia.org/dcim/devices/?location_id=6&q=ganeti&sort=name&status=active [08:48:25] FIRING: SystemdUnitFailed: nic-saturation-exporter.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:45] XioNoX: ganeti1036 is ready [08:53:25] RESOLVED: SystemdUnitFailed: nic-saturation-exporter.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:04] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10702638 (10aborrero) >>! In T389958#10683594, @cmooney wrote: > @aborrero @taavi one thing we could maybe try, if we wanted to make progress sooner (i.e. with... [08:57:47] thx [09:00:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on puppetboard2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:04:55] FIRING: MaxConntrack: Max conntrack at 80.63% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:09:55] RESOLVED: MaxConntrack: Max conntrack at 80.63% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:10:24] moritzm: it's working fine with ifup, should I restart it ? [09:11:31] yeah, these are drained, so good to reboot [09:12:18] but don't we also need to run sre.network.configure-switch-interfaces for each host? [09:15:00] moritzm: I updated netbox for all of them then run homer for the whole switch, so it's the same end result (that change is hitless) [09:15:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on puppetboard2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:16:18] ack! [09:21:28] moritzm: back up [09:23:46] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702760 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=18436b96-18d3-4109-9dbe-088b91594c7c) set by ayounsi@cumin1002 for 0:30:00 on 1 host(s) and their services with re... [09:26:38] topranks: https://supportportal.juniper.net/s/article/JunOS-Telemetry-How-to-use-gnmigrpc-to-get-the-devices-running-config Juniper starting giving examples with gnmic :) [09:27:34] nice :) [09:36:13] topranks, XioNoX: FYI I'm making the homer release [09:36:49] volans: should we include https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1124437 or is it scope creep? :) [09:37:29] I got errors when I tried to use that from my laptop for some reason [09:37:31] doesn't pass CI, merge conflict... let's include it in the next :D I've already sent the patch for the release [09:37:38] but I could have messed something else up also [09:40:25] hahahahah, Colt circuit it down, I got to the Colt portal to open a ticket, and get "Your account has been disabled due to inactivity. Please contact Support." [09:40:45] are they expecting us to open a trouble ticket every week or ? [09:41:46] gosh [09:45:37] calling them to open a ticket.... [09:45:46] like we're in the 90s [09:45:49] :facepalm: [09:45:50] send a fax [09:50:35] alright, ticket open and they emailed noc@ [09:50:50] ahahah [09:50:55] no fax? [09:50:58] I'm disappointed [09:56:57] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10702937 (10ayounsi) JTAC asked us to reboot it. It didn't help. [10:00:34] XioNoX: let me know when I should get 1039 ready [10:00:59] moritzm: anytime [10:01:43] ok, I'll drain it in ~ 10m, will ping you when it's ready [10:04:58] topranks: let me know if I can help with the errors your're getting btw [10:06:18] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044#10702957 (10ayounsi) [10:07:09] XioNoX: thanks! I'll do a proper re-test now that I have things working, it may just have been because of something else I'd modified in my attempt to overcome the other problem [10:09:47] XioNoX: that's hillarious about Colt [10:10:09] I found myself locked out of the Lumen portal yesterday and had to call them [10:10:29] it was actually fine tbh fairly simple [10:10:55] also thanks... I was being lazy hoping it'd magic itself back [10:26:46] XioNoX: ganeti1039 is ready [10:28:01] cool, might be after lunch [10:37:37] the "MaxConntrack: Max conntrack at foo % on krb1001:9100" should now be gone [10:39:29] \o/ [10:42:08] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703121 (10cmooney) >>! In T389958#10702638, @aborrero wrote: > Yes, lets try with the static routes. Thanks! Thanks Arturo - can we arrange a window for thi... [11:26:02] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703215 (10aborrero) >>! In T389958#10703121, @cmooney wrote: >>>! In T389958#10702638, @aborrero wrote: >> Yes, lets try with the static routes. Thanks! > >... [11:49:27] moritzm: all good with ganeti1039 [11:59:01] slyngs: I am confused has how Bitu is deployed? CI builds a Docker image using pipelinelib but there is a Debian changelog maintained which would implies that is using a Debian package for deployment? [11:59:34] slyngs: the reason I ask is CI should most probably build the Debian package whenever a file is touched under the ./debian directory [11:59:48] Correct, the production deployment is currently a Debian package. The Docker image is used for integraton test with other systems. [11:59:57] AHH [11:59:59] clever :) [12:00:10] It would be nice to have the deb package build automatically. [12:00:41] Not sure it's clever, but it's an option :-) [12:01:05] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703334 (10aborrero) [12:07:02] XioNoX: ganeti1040 is ready [12:08:41] slyngs: https://gerrit.wikimedia.org/r/c/integration/config/+/1133369 that would build the deb package and errors will be ignored (non-voting) [12:09:34] Where does it put the package? [12:10:45] XioNoX: I see Colt is back up, did they come back with any info on what happened? [12:10:51] homer release: https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1133370 [12:12:00] topranks: Their email says to expect an update within 2 hours [12:13:27] slyngs: the build artifacts are attached to the CI Jenkins build :) [12:13:33] they are not published anywhere [12:13:38] excellent [12:20:52] slyngs: for deb automatic pipeline you might want to look at https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner/Trusted_Runners [12:42:37] topranks: "Following my investigation, I have identified alarms in the NA portion of the circuit. I have engaged our NA provider to carry out further investigations. " see emails to noc@ [12:42:51] those providers that need to go check to see alarms [12:42:57] yeah I seen that alright, then it came up 5 mins later [12:43:48] all good I guess though :) [13:27:00] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10703649 (10ayounsi) Bad news, JTAC told me that gNMI is not supported on SRX300 (or any branch level SRX) nor EX4300. Some pointers : https://apps.juniper.net/feature-explorer/feature/4332... [13:38:29] moritzm: 1040 is back up [13:40:01] I'll get 1041 ready [13:40:19] thx [13:47:24] XioNoX: 1041 is ready. there are two nodes running there which don't use DRBD due to latency (dse-k8s-etcd1002 and kubestagemaster1004), these are redundant on the service level, so when you reboot 1041 eventually, they will go down, but that's okay [13:49:32] moritzm: ok [13:49:43] rebooting [13:51:47] XioNoX, topranks: I'm ready to deploy homer 0.8.0 to the cumin hosts, lmk when is a good moment [13:58:20] volans: now is a good moment [13:59:49] ok merging the plugin one too and deploying [14:02:33] getting permission denied in cleaning up the old venv, weird, fixing manually for now, to be seen if needs tweaking in the deploy script [14:04:12] moritzm: done with 1041 [14:06:26] deploy completed, running homer diff '*' [14:06:39] wow, INFO:homer.devices:Initialized 100 devices [14:06:43] hadn't realized [14:07:25] volans: type go to commit [14:07:39] crazy [14:22:11] XioNoX: ganeti1042 is drained (again one non-DRBD node; ml-etcd1001) [14:31:46] I know you're busy, for later, the diff so far is looking good, just got 1 error: [14:31:49] ERROR:homer.transports.junos:Failed to get diff for lsw1-b8-codfw.mgmt.codfw.wmnet: RpcError(severity: warning, bad_element: speed 1G, message: mgd: 1g config will be applied to ports 20 to 23) [14:40:08] volans: that one was in the latest daily diff I think [14:40:18] so not related [14:42:27] k [14:44:08] moritzm: rebooting 1042 [14:54:46] diff completed, 98 noop, 1 change for asw2-ulsfo.mgmt.ulsfo.wmnet (seems legit), 1 failure for lsw1-b8-codfw.mgmt.codfw.wmnet [14:54:49] \o/ [15:21:30] volans: great, thanks!