[07:25:46] <moritzm>	 I'm rebooting netmon1003
[07:26:33] <XioNoX>	 moritzm: fyi, o11y is working on netmon2002 (it's being re-imaged) so we don't have any redundancy
[07:27:02] <XioNoX>	 reboot is ok as long as it doesn't means longer downtime
[07:30:06] <moritzm>	 yeah, it's just a reboot, should be back any minute
[07:34:19] <moritzm>	 it's back
[08:13:59] <volans>	 what are our current thoughts on naming for ganeti groups with the planned migraion to per-rack? because in the pops we didn't use the rack name, but for the core DCs it might be useful to keep that information. Thughts?
[08:18:05] <moritzm>	 we use the rack names in esams and drmrs, only ulsfo/eqsin use "1", but we can easily rename them as well
[08:19:51] <volans>	 yeah sorry I wasn't clear, the cluster groups in netbox drmrs01 02...
[08:20:07] <godog>	 XioNoX: FWIW netmon2002 is back to yesterday' state, i.e. it could take a failover if needed
[08:20:21] <volans>	 but also we could remove them all once we complete the migration as we'll have just one flat level
[08:20:43] <volans>	 or group by site at that point
[08:20:45] <volans>	 not ssure
[08:29:41] <XioNoX>	 volans: we could also have 1 cluster esams, and 2 groups BY/BW
[08:30:02] <XioNoX>	 like we have 1 eqiad and A/B/C/D
[08:30:55] <volans>	 yes that's what I meant by group by site, but at that point the netbox grrouping will not match ganeti grouping I think
[08:31:17] <XioNoX>	 netbox can model whatever we want ganeti to do
[08:36:34] <moritzm>	 no preference at all on naming
[08:37:56] <volans>	 ack, thx
[08:38:55] <XioNoX>	 I think it comes down to ease of mgmt vs. blast radius
[08:39:03] <volans>	 and what will be the path for migrating VMs? 
[08:39:21] <XioNoX>	 migrating where?
[08:40:28] <volans>	 when a ganeti host will migrate from private1-a-codfw to  private1-a1-codfw
[08:40:35] <volans>	 what wil happen to its VMs?
[08:41:05] <volans>	 I guess in the end they will need re-numbering too or decom+makevm
[08:41:20] <moritzm>	 yeah, if we do this, then decom+makevm
[08:41:34] <moritzm>	 there is some Ganeti intenral dump and export mechanism, but we haven't used it yet so far
[08:43:09] <volans>	 but as I understand it for some period the physical host will have both VLANs available, so I was wondering if there could be a way to renumber a VM too like we'll probably do with physical
[08:51:56] <XioNoX>	 yeah it's all TBD
[08:52:13] <XioNoX>	 but we will need the proper automation whatever way we're going
[08:52:48] <XioNoX>	 ideally similarly to what we have for physical servers
[09:07:35] <jbond>	 hi all i forgot to say in the meeting that i have some vacation this week starting (now that i extended it( tomottow and back on the 21st
[09:07:47] <jbond>	 please let me know if there is anything yuo want me to take a look at today
[09:09:08] <XioNoX>	 jbond: quick one: https://phabricator.wikimedia.org/T102099
[09:09:19] <XioNoX>	 :)
[09:09:33] <volans>	 ack
[09:09:36] <jbond>	 lol :D
[09:21:42] <topranks>	 are there good arguments why we couldn’t statically assign IPv6 addresses and default routes?
[09:21:59] <topranks>	 assuming we bootstrap machines with IPv4 & DHCP?
[09:23:52] <jbond>	 topranks: disclaimer its a hilw since i read the whole of that class however 
[09:24:39] <jbond>	 we currently use interface::add_ip6_mapped to create the configure the correct IPv6 addresses to use for a host
[09:24:55] <jbond>	 however im not sure its enabled every where because ... fear/risk
[09:25:11] <jbond>	 but also even when it is enabled there are often now AAAA records 
[09:26:55] <jbond>	 we also have the following in late_install.sh suggesting that everything now gets a mapped addresses
[09:26:58] <jbond>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/scripts/late_command.sh#L97-L119
[09:27:12] <topranks>	 Ok right, yeah I best read the whole task again I did go through it a while ago 
[09:27:33] <topranks>	 I guess I was more thinking at a high-level, and in terms of re-working our network config with systemd-networkd or similar changes in future 
[09:28:05] <jbond>	 topranks: looking at profile::base::production::enable_ip6_mapped ganetie is the only [production cluster without the mapped address
[09:28:06] <topranks>	 but yeah hadn't really considered the case where we don't want IPv6 at all on some hosts 
[09:28:34] <jbond>	 yes the whole network config managment is a bit of a mess and desperatly needs some love :)
[09:29:24] <jbond>	 also note that we dont have ipv6 in WMCS at all (unless it changed recently)
[09:30:11] <volans>	 correct, IIRC we do have ipv6 on all prod hosts, just lacking the AAAA records for the clusters not ready yet (and there are a lot of them with tracking tasks)
[09:30:14] <topranks>	 the hosts have v6 mapped addresses on their main interface (10.x)
[09:30:52] <topranks>	 volans: ack thanks for the clarification
[09:54:21] <wikibugs>	 10SRE-tools, 10Spicerack: spicerrack.decorators.retry: dynamic_params_callbacks=(set_tries,) dfosn;t seem to work as epected - https://phabricator.wikimedia.org/T346134 (10jbond) p:05Triage→03Medium
[09:55:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans)
[10:05:32] <wikibugs>	 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90305a26-47b2-42a2-abe5-284f8035bf3b) set by jmm@cumin2002...
[10:10:00] <wikibugs>	 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2af641c9-48a3-42b7-8c75-56c12506718a) set by jmm@cumin2002...
[10:53:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) a:03Volans Yes the issue is that the `set_tries` defined in spicerack doesn't check the function s...
[11:28:01] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero feel free to close this one if it's not being worked on, the status...
[11:35:28] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) 05Open→03Declined OK, closing for now and hoping some more modern BGP-bas...
[12:23:26] <btullis>	 Could I pick someone's brain about a Debian packaging question please? It's about how best to begin, ideally under GitLab-CI.
[12:24:17] <moritzm>	 rebuilding an existing packaging or startin from scratch?
[12:24:44] <btullis>	 Brand new package. I'm looking to distribute a single jar file in a package. e.g. `spark-3.2-yarn-shuffle`
[12:25:07] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10cmooney) In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using...
[12:25:38] <btullis>	 Until now the only build of spark we had that makes this jar was in the docker production-images repo.
[12:25:57] <jbond>	 will leave it to moritz to advice onthe best process but i thught id point you to https://phabricator.wikimedia.org/T304491 as well
[12:26:10] <btullis>	 Here is where we built it in docker: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/spark/build/Dockerfile.template#66
[12:26:38] <btullis>	 Great, thanks jbond. I've briefly seen that, but have no experience of dgit yet.
[12:27:25] <btullis>	 I'm now experimenting with building spark under GitLab-CI instead of production-images: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/blob/add_initial_spark_pipeline/3.4/blubber.yaml#L70
[12:28:15] <btullis>	 I was wondering whether it would be possible/wise to try to build a deb containing this jar file as part of the GitLab-CI process. 
[12:28:36] <jbond>	 im a complete novice on dgit and often annoy Emperor for help.  however i have some very rought notes https://wikitech.wikimedia.org/wiki/User:Jbond/dgit 
[12:30:47] <btullis>	 Great, thanks. The odd thinkg about this is that it's a single binary (jar) artifact - is putting it in a Debian package even the best way? Maybe I should be publishing it to Archiva or GitLab instead and pulling from there with puppet.
[12:32:17] <btullis>	 I need to make several minor versions available concurrently on all of the Hadoop workers. (T344910) 
[12:32:17] <stashbot>	 T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910
[12:32:35] <volans>	 puppet volatile could also be an option to evaluate
[12:34:09] <btullis>	 Oh, thanks volans. I hadn't thought of that. It's quite new to me.
[12:35:53] <btullis>	 The current way that we distribute the single version of the jar is a bit suboptimal and ties in with our single conda-analytics environment. https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/blob/main/docker/Dockerfile#L98-100
[12:36:14] <moritzm>	 T304491 is mostly around how to rebuild/maintain/update a package, it doesn't deal with the initial packaging at all
[12:36:15] <stashbot>	 T304491: Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491
[12:36:47] <moritzm>	 however, you could use the cas-overlay-template repository
[12:37:00] <moritzm>	 (specifically the debian/ directory included in it)
[12:37:04] <btullis>	 So the jar ends up in a Debian package, but it doesn't feel very clean to start copying 4 different jar files into this package.
[12:37:47] <moritzm>	 for CAS we kick off the build and then the (fairly minimal) debian/ directory takes care of installing the result into a deb
[12:38:02] <moritzm>	 are those Jars prebuilt or how are they created?
[12:38:46] <btullis>	 This one? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/cas-overlay-template/+/refs/heads/master/debian/
[12:39:10] <btullis>	 We build all of the jars from clean sources. https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
[12:42:29] <moritzm>	 yeah, that one
[12:43:03] <moritzm>	 the question is rather whether the build of these Jars need to be handled as part of the whole deb build or whether it's separate
[12:43:24] <moritzm>	 if the the latter, then you can create a simply deb with just these files for a given package foo:
[12:43:47] <moritzm>	 debian/rules, debian/control, debian/changelog, debian/foo.install and debian/foo.dirs
[12:44:12] <moritzm>	 https://people.wikimedia.org/~jmm/slides/deb-101.pdf also has a quick intro
[12:46:39] <jbond>	 btullis: as uits just a jar can yuo not have a gitlab pipeline build the jar and publish it to archiva
[12:47:13] <jbond>	 moritzm: should we consider doing similar for cas (we also have the deployment logic in the deb so i suspect not but still)
[12:47:54] <jbond>	 https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva#Deploy_artifacts_using_scap3
[12:48:00] * jbond sees scap and runs away
[12:49:04] <btullis>	 Yes, perhaps archiva would be simpler in this case, but also as moitzm mentioned it would be a lightweight .deb file too. 
[12:50:29] <jbond>	 ack
[12:52:31] <moritzm>	 for cas the deb seems preferable since it also deal with the deployment (extracting the WAR, moving the old deploy directory around). and it's biggest perk is no scap :-)
[12:52:41] <moritzm>	 I also don't know what the long term plan is with Archiva
[12:53:32] <jbond>	 yes +1 to no scap :) but yes agree we do a little bit more for cas
[12:55:51] <btullis>	 OK, I'll do a little more experimenting. Thanks all.
[14:10:35] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) 05Open→03Resolved a:03cmooney Thanks all, config applied now.  @volans I left the timeout at 30 mins.  I think (esp. in an emergency situation) it's not unlikely yo...
[14:38:33] <jinxer-wm>	 (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:39:46] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to mention here, but the restriction described in T322937#8847201 no longer seems to be the case.  In codfw with devices on JunOS 22.2R3....
[14:43:33] <jinxer-wm>	 (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:24:10] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) I took a little look at the routed-mode docs from [[ https://github.com/grnet/gnt-networking/blob/develop/docs/routed.rst | here ]].  Overall the setup looks a...
[18:37:25] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @Vgutierrez and @BBlack friendly poke :)
[21:09:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:14:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:29:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:08:33] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed