[07:25:46] I'm rebooting netmon1003 [07:26:33] moritzm: fyi, o11y is working on netmon2002 (it's being re-imaged) so we don't have any redundancy [07:27:02] reboot is ok as long as it doesn't means longer downtime [07:30:06] yeah, it's just a reboot, should be back any minute [07:34:19] it's back [08:13:59] what are our current thoughts on naming for ganeti groups with the planned migraion to per-rack? because in the pops we didn't use the rack name, but for the core DCs it might be useful to keep that information. Thughts? [08:18:05] we use the rack names in esams and drmrs, only ulsfo/eqsin use "1", but we can easily rename them as well [08:19:51] yeah sorry I wasn't clear, the cluster groups in netbox drmrs01 02... [08:20:07] XioNoX: FWIW netmon2002 is back to yesterday' state, i.e. it could take a failover if needed [08:20:21] but also we could remove them all once we complete the migration as we'll have just one flat level [08:20:43] or group by site at that point [08:20:45] not ssure [08:29:41] volans: we could also have 1 cluster esams, and 2 groups BY/BW [08:30:02] like we have 1 eqiad and A/B/C/D [08:30:55] yes that's what I meant by group by site, but at that point the netbox grrouping will not match ganeti grouping I think [08:31:17] netbox can model whatever we want ganeti to do [08:36:34] no preference at all on naming [08:37:56] ack, thx [08:38:55] I think it comes down to ease of mgmt vs. blast radius [08:39:03] and what will be the path for migrating VMs? [08:39:21] migrating where? [08:40:28] when a ganeti host will migrate from private1-a-codfw to private1-a1-codfw [08:40:35] what wil happen to its VMs? [08:41:05] I guess in the end they will need re-numbering too or decom+makevm [08:41:20] yeah, if we do this, then decom+makevm [08:41:34] there is some Ganeti intenral dump and export mechanism, but we haven't used it yet so far [08:43:09] but as I understand it for some period the physical host will have both VLANs available, so I was wondering if there could be a way to renumber a VM too like we'll probably do with physical [08:51:56] yeah it's all TBD [08:52:13] but we will need the proper automation whatever way we're going [08:52:48] ideally similarly to what we have for physical servers [09:07:35] hi all i forgot to say in the meeting that i have some vacation this week starting (now that i extended it( tomottow and back on the 21st [09:07:47] please let me know if there is anything yuo want me to take a look at today [09:09:08] jbond: quick one: https://phabricator.wikimedia.org/T102099 [09:09:19] :) [09:09:33] ack [09:09:36] lol :D [09:21:42] are there good arguments why we couldn’t statically assign IPv6 addresses and default routes? [09:21:59] assuming we bootstrap machines with IPv4 & DHCP? [09:23:52] topranks: disclaimer its a hilw since i read the whole of that class however [09:24:39] we currently use interface::add_ip6_mapped to create the configure the correct IPv6 addresses to use for a host [09:24:55] however im not sure its enabled every where because ... fear/risk [09:25:11] but also even when it is enabled there are often now AAAA records [09:26:55] we also have the following in late_install.sh suggesting that everything now gets a mapped addresses [09:26:58] https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/scripts/late_command.sh#L97-L119 [09:27:12] Ok right, yeah I best read the whole task again I did go through it a while ago [09:27:33] I guess I was more thinking at a high-level, and in terms of re-working our network config with systemd-networkd or similar changes in future [09:28:05] topranks: looking at profile::base::production::enable_ip6_mapped ganetie is the only [production cluster without the mapped address [09:28:06] but yeah hadn't really considered the case where we don't want IPv6 at all on some hosts [09:28:34] yes the whole network config managment is a bit of a mess and desperatly needs some love :) [09:29:24] also note that we dont have ipv6 in WMCS at all (unless it changed recently) [09:30:11] correct, IIRC we do have ipv6 on all prod hosts, just lacking the AAAA records for the clusters not ready yet (and there are a lot of them with tracking tasks) [09:30:14] the hosts have v6 mapped addresses on their main interface (10.x) [09:30:52] volans: ack thanks for the clarification [09:54:21] 10SRE-tools, 10Spicerack: spicerrack.decorators.retry: dynamic_params_callbacks=(set_tries,) dfosn;t seem to work as epected - https://phabricator.wikimedia.org/T346134 (10jbond) p:05Triage→03Medium [09:55:45] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) [10:05:32] 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90305a26-47b2-42a2-abe5-284f8035bf3b) set by jmm@cumin2002... [10:10:00] 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2af641c9-48a3-42b7-8c75-56c12506718a) set by jmm@cumin2002... [10:53:46] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) a:03Volans Yes the issue is that the `set_tries` defined in spicerack doesn't check the function s... [11:28:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero feel free to close this one if it's not being worked on, the status... [11:35:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) 05Open→03Declined OK, closing for now and hoping some more modern BGP-bas... [12:23:26] Could I pick someone's brain about a Debian packaging question please? It's about how best to begin, ideally under GitLab-CI. [12:24:17] rebuilding an existing packaging or startin from scratch? [12:24:44] Brand new package. I'm looking to distribute a single jar file in a package. e.g. `spark-3.2-yarn-shuffle` [12:25:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10cmooney) In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using... [12:25:38] Until now the only build of spark we had that makes this jar was in the docker production-images repo. [12:25:57] will leave it to moritz to advice onthe best process but i thught id point you to https://phabricator.wikimedia.org/T304491 as well [12:26:10] Here is where we built it in docker: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/spark/build/Dockerfile.template#66 [12:26:38] Great, thanks jbond. I've briefly seen that, but have no experience of dgit yet. [12:27:25] I'm now experimenting with building spark under GitLab-CI instead of production-images: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/blob/add_initial_spark_pipeline/3.4/blubber.yaml#L70 [12:28:15] I was wondering whether it would be possible/wise to try to build a deb containing this jar file as part of the GitLab-CI process. [12:28:36] im a complete novice on dgit and often annoy Emperor for help. however i have some very rought notes https://wikitech.wikimedia.org/wiki/User:Jbond/dgit [12:30:47] Great, thanks. The odd thinkg about this is that it's a single binary (jar) artifact - is putting it in a Debian package even the best way? Maybe I should be publishing it to Archiva or GitLab instead and pulling from there with puppet. [12:32:17] I need to make several minor versions available concurrently on all of the Hadoop workers. (T344910) [12:32:17] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [12:32:35] puppet volatile could also be an option to evaluate [12:34:09] Oh, thanks volans. I hadn't thought of that. It's quite new to me. [12:35:53] The current way that we distribute the single version of the jar is a bit suboptimal and ties in with our single conda-analytics environment. https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/blob/main/docker/Dockerfile#L98-100 [12:36:14] T304491 is mostly around how to rebuild/maintain/update a package, it doesn't deal with the initial packaging at all [12:36:15] T304491: Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491 [12:36:47] however, you could use the cas-overlay-template repository [12:37:00] (specifically the debian/ directory included in it) [12:37:04] So the jar ends up in a Debian package, but it doesn't feel very clean to start copying 4 different jar files into this package. [12:37:47] for CAS we kick off the build and then the (fairly minimal) debian/ directory takes care of installing the result into a deb [12:38:02] are those Jars prebuilt or how are they created? [12:38:46] This one? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/cas-overlay-template/+/refs/heads/master/debian/ [12:39:10] We build all of the jars from clean sources. https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service [12:42:29] yeah, that one [12:43:03] the question is rather whether the build of these Jars need to be handled as part of the whole deb build or whether it's separate [12:43:24] if the the latter, then you can create a simply deb with just these files for a given package foo: [12:43:47] debian/rules, debian/control, debian/changelog, debian/foo.install and debian/foo.dirs [12:44:12] https://people.wikimedia.org/~jmm/slides/deb-101.pdf also has a quick intro [12:46:39] btullis: as uits just a jar can yuo not have a gitlab pipeline build the jar and publish it to archiva [12:47:13] moritzm: should we consider doing similar for cas (we also have the deployment logic in the deb so i suspect not but still) [12:47:54] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva#Deploy_artifacts_using_scap3 [12:48:00] * jbond sees scap and runs away [12:49:04] Yes, perhaps archiva would be simpler in this case, but also as moitzm mentioned it would be a lightweight .deb file too. [12:50:29] ack [12:52:31] for cas the deb seems preferable since it also deal with the deployment (extracting the WAR, moving the old deploy directory around). and it's biggest perk is no scap :-) [12:52:41] I also don't know what the long term plan is with Archiva [12:53:32] yes +1 to no scap :) but yes agree we do a little bit more for cas [12:55:51] OK, I'll do a little more experimenting. Thanks all. [14:10:35] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) 05Open→03Resolved a:03cmooney Thanks all, config applied now. @volans I left the timeout at 30 mins. I think (esp. in an emergency situation) it's not unlikely yo... [14:38:33] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:46] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to mention here, but the restriction described in T322937#8847201 no longer seems to be the case. In codfw with devices on JunOS 22.2R3.... [14:43:33] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:10] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) I took a little look at the routed-mode docs from [[ https://github.com/grnet/gnt-networking/blob/develop/docs/routed.rst | here ]]. Overall the setup looks a... [18:37:25] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @Vgutierrez and @BBlack friendly poke :) [21:09:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:33] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed