[06:37:45] it supports name resolution once the DNS server config option has been commited, probably not before [06:38:31] topranks: there are the inventory items Power Supplies, which are mostly to track their serial#, models, etc https://netbox.wikimedia.org/dcim/devices/3570/inventory/ [06:39:01] and the power ports, that are like interfaces https://netbox.wikimedia.org/dcim/devices/3570/power-ports/ to connect them to PDUs [06:41:12] XioNoX: ok thanks makes perfect sense. [06:42:38] we’re missing those on some of our devices [06:42:42] https://netbox.wikimedia.org/dcim/devices/3929/ [06:43:34] so the inventory items PS are compared with the LibreNMS inventory, so they should all be there [06:43:38] I’ll get them added anywhere they’re missing and ask dc ops to complete the connection if that makes sense? [06:44:02] but yeah for the power ports, as we don't track power cables in eqiad/codfw we've been more lax about them [06:44:45] ah ok that’s probably how I missed that detail [06:44:46] doesn't hurt to have the power ports (it shouldn't trigger any kind of alerting) but it won't be used [06:44:49] cool [06:44:50] [06:45:21] I’ll add them to the template anyway [06:48:17] topranks: also this might be useful https://github.com/netbox-community/devicetype-library/blob/master/device-types/Juniper/QFX5120-48Y.yaml [06:50:11] thanks [06:50:23] funny they list fxp0 there but it’s em0 on the 5120s we have [06:50:59] yeah, it's community based, so could be a mistake too [06:51:16] but that can help for the power port type and max draw at least [07:07:14] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10fgiunchedi) >>! In T333007#8826083, @cmooney wrote: > FYI I updated the wikitech page relating to this to update it (when w... [07:09:17] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10ayounsi) The issue is that the check is ran from the switch side, and for the switch the port is up `Physical interface: ge-3/0/22, Enabled, Physical link is... [09:06:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:07] slyngs: do you still need to reimage sretest to test the reimage cookbook patch? [09:54:33] because I have some dhcp changes that would benefit to check everything still works wrt reimages, so we could hit 2 birds with one reimage [09:54:35] Yes, I think it best to test it before releasing it. [09:55:16] I just decided to make my Friday sad by trying to use Googles API [09:57:12] lmk if you need that access we talked about in the meeting [10:05:04] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10jbond) WARNING: wild speculation > Is it possible that the server turns its interfaces off when the server is off? i guess if it has wake on lan, or some ty... [10:06:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:05] Thanks, so far it's not completely clear what access I need, what kind of access I'm able to grant, what my scope is or why the documentation i outdated [10:07:17] I don't think they want you to use it [10:10:19] slyngs: i feel your pain, i hate working with the google API's [10:11:00] (specifically the oauth api's) [10:11:13] May I should just have IMAP enabled and do the calendar stuff with Selenium [10:21:19] volans: Can we schedule testing the cookbook stuff Monday or Tuedays? [10:21:59] I was planning to do it today, no worries I can either run your cookbook or just the prod one [10:22:12] Okay, that works too :-) [10:23:00] Just get the latest version in the patch, it had a few bugs, but they are fixed now [10:23:51] latest from gerrit? [10:24:22] Yes, patchset 18 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/904510 [10:25:02] ok [10:25:04] thx [10:25:29] I'm just going to get lunch, so I'll be back in a bit [10:27:06] slyngs: I see the gmail API as an option in my "service" account, so if you need it we can use that one [10:28:47] I think it needs Groups access, because Groups are Gmail or something, that bit isn't all that clear from the documentation [10:29:45] And it's confusing because Groups is an unfortunate product name and and get confused with IAM groups, which may or may not be the same [10:30:39] The Calendar API seems to work, but what I failed to understand is that it's a calendar for the service account 😵‍💫 [10:33:08] slyngs: https://stackoverflow.com/questions/68915009/whats-the-recommended-api-to-use-for-querying-conversations-for-google-groups [10:33:24] btw doesn't forcely have to be a google group in the longer term IMHO [10:33:59] Could it be a POP3 account :-) [10:34:33] yeah the maint-announce address can be anything we want [10:34:50] the google group is for convenience as we don't have anything else so far [10:38:47] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) One question did arise to me, I'll mention it here but not sure we need to focus on it, at least initially. Shou... [10:39:04] slyngs: you can see some code that interact with gapis here [10:39:04] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/reports/accounting.py [11:01:52] fyi all im off next week so let me know if there is anything yuo want me to look at today [11:03:21] ack [11:06:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:51] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) I've seen that option and decided that was not relevant for new host's ztp, but lmk if we need it too. The general... [11:10:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) [11:11:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) 05Open→03Resolved Fixed! thanks [11:17:04] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) >>! In T336485#8847055, @Volans wrote: > The general usage for that seems to me more for a "reimage" concept of u... [11:17:42] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10aborrero) [11:21:43] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) It would be useful during the initial provisioning to have the device running the Junos version we want on day 1.... [11:35:12] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10aborrero) p:05Triage→03Medium [11:38:01] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10jbond) > We have a puppet manifest that runs dnsquery::a() that fails in PCC for the domain private.codfw.wikimedia.cloud.... [11:40:15] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10jbond) in fact i get SERVFAIL for the wikimedia.cloud domain ` jbond@cloudinfra-internal-puppetmaster-01:~$ dig soa wikime... [11:40:28] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10aborrero) [11:45:58] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: unable to run dnsquery:a() for some domains - https://phabricator.wikimedia.org/T336566 (10aborrero) Yes, sorry for the noise, there is nothing wrong with PCC and it may be the Openstack Designate setup: `lang=she... [12:03:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) ######Routing issue I hit an issue with the new spines in that the overlay loopback address was not reachable when they... [12:06:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:06] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Is it ok to start testing without it? Based on how we want the workflow to go we would need a change in Spicerack... [12:27:03] jbond, XioNoX: if you have a spare minute could I get a quick one on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/919052 ? [12:29:24] volans: lgtm! [12:31:08] XioNoX: just FYI you were right about the 'vme' interfaces, it's no problem to delete them on the 5120s [12:33:04] thanks a lot [13:06:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:24] rzl: FYI this is flapping since few days ^^^ [13:11:04] everytime I look at it is for a different reason (unable to connect, 503, connection reset) [13:20:54] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) sgtm as it's an additional feature and to prevent scope creep but might be worth looking at implementing it soone... [14:02:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10ayounsi) An alternative (or complement) here would be to go the gNMI way, probably through gNMIc https://github.com/openconfig/gnmic https://www.youtube.com/w... [14:06:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:13] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) I 'do agree that we can also have the Junos image for upgrade during the process. Our first goal here was to have... [15:09:33] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Do we want to hardcode that in the dhcp settings? Or better to pass it dynamically to the cookbook? Based on that... [15:15:27] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [15:21:47] 10netops, 10Infrastructure-Foundations, 10SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10Dzahn) [15:38:35] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [15:52:33] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) [15:55:30] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) [15:55:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [15:56:21] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) p:05Triage→03Medium [16:02:36] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10aborrero) a:03Papaul [16:06:33] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Current idea that has gained some momentum as part of {T297596} and {T324992}: * hook the cloudser... [16:14:43] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) >>! In T336485#8847630, @Volans wrote: > Do we want to hardcode that in the dhcp settings? Or better to pass it d... [16:23:43] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Andrew) This plan sounds OK to me. We could also move the recursors onto VMs, at which point they'd need to be able to a... [16:24:15] Anyone has anything to keep on netbox-next/ [16:24:16] ? [16:24:25] we'd like to re-import a fresh dump from prod [16:32:38] volans: ack, I've seen the same -- I think it's a mw-on-k8s issue rather than an httpbb issue, but haven't been able to dig into it yet [16:33:56] I'm off today, do you mind putting it in a short task? I can take a look on Monday [16:34:46] and that's now done, netbox-next has a clean DB [16:34:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [16:35:05] rzl: sure can [16:35:25] thank you! [16:52:35] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Some httpbb checks are flapping - https://phabricator.wikimedia.org/T336590 (10Volans) p:05Triage→03Medium [16:53:08] rzl: ^^^ all yours :) enjoy your day off [16:53:28] much obliged [17:10:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [18:08:16] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [18:08:26] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [18:13:12] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [18:13:28] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [19:42:39] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [21:05:59] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [21:06:08] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bus... [22:18:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS b... [22:32:26] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bulls... [22:35:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS b... [22:54:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) a:05Papaul→03Jhancock.wm @Jhancock.wm was trying to install the OS on cloudswitf1001 and the server was not getting DHCP a... [22:59:55] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bulls...