[00:03:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 4.858% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:58:30] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:30] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:40] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10ayounsi) > The obvious solution is to allow passing of the specific IP to use, and default to $facts['ipaddre... [09:30:31] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10jbond) [09:31:06] 10netbox, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: gdnsd failures when converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) 05Open→03Stalled Setting to stalled as i need to test the procedure in https://phabricator.wikimedia.... [10:58:30] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:42] XioNoX, topranks: quick question when I deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/914260 where should I commit to? homer "*" commit or can I use something more narrow for the routers chris had access to? [12:33:36] that's used in all devices AFAIK, so yeah I think * is needed, but I'll leave the final answer to the Experts :D [12:48:21] moritzm: yes volans is correct, the user accounts and keys are applied on all the devices Homer manages, so it has to be done across them all [12:48:43] it's quite laborious unfortunately, if you want to merge I'm happy to push it out if that helps [12:52:56] topranks: thanks, I'll manage :-) [13:03:00] I'm seeing some unrelated changes, which look harmless enough (description lvs2007 -> description lvs2007 (#12175)", going to merge those along [13:03:05] moritzm: you can filter on 'status:active' [13:03:13] moritzm: yeah +1 [13:04:35] ack [13:04:37] moritzm, slyngs, other than uid, would it be possible to store this data in the new UI and expose it via an API https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L43 ? [13:06:00] Yes I don't see why not [13:06:10] storing is perfectly fine (once account profiles are implemented) [13:06:38] We are planning to do something similar for Puppet data. [13:06:39] the data is stored in LDAP, not sure what kind of API you have in mind? [13:07:08] moritzm: anything easy :) [13:07:27] Something you can cUrl :-) [13:07:33] yeah exactly [13:07:37] or python request [13:08:19] curl supports ldap :-P [13:08:20] I'm just thinking authentication, but it shouldn't be to hard to do with Django REST Framework. [13:08:46] taavi: Really... that's either amazing or terrible [13:09:00] slyngs: no need for auth, that file is public :) [13:09:19] That's even easier then [13:09:38] Where does the class come from? [13:09:58] slyngs: user set, there is no strict rule [13:10:36] Okay... so checkout homer-public, read the YAML file and expose the users as an API [13:11:47] probably a LDAP bit: "network access" "none/RO/RW" for example [13:12:21] and the API endpoint that returns the list of users that don't have it as "none" [13:12:53] Maybe groups would be better [13:13:43] So two LDAP groups, network and network_readonly? [13:14:40] slyngs: fopr admin.pp integration there is https://github.com/voxpupuli/puppet-ldapquery/. i have not looked at it in depth and it of course adds a dependency on ldap availability for puppet compilations which may not be desirable but still worth exploring [13:16:25] XioNoX: there's still various things to be implemened in Bitu first, can you file a task about storing the Homer user credentials, then this can be designed/implemented when the necessary underlying bits are ready? [13:16:35] moritzm: perfect, yep [13:17:12] sounds good, just use Infrastructure-Foundations as a tag for now, there'll be a dedicated tag for Bitu in the near future [13:19:09] jbond: It might make sense that way, have an interface to manage the data, but let Puppet handle the actual query. That way Puppet/Homer/whatever doesn't fail if the IDM is down [13:20:05] Or have curl query LDAP.... [13:22:24] slyngs: ack, in case uits not clear not neccesarily recomending just something to add to the mix of solutions to explore [13:39:11] 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10herron) [13:43:31] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:57] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:57] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:59] Not really sure why ifup claims to have failed, I've silience the alert and will look into it a bit later [13:56:27] slyngs: maybe https://phabricator.wikimedia.org/T273026 [14:08:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Andrew) @papaul, note that these hosts are still pending some trial work in codfw1dev so you shouldn't spend any effort on these hosts... [14:40:26] XioNoX: Yeah, maybe, don't clear the alert though [14:41:20] The old "Reboot it again Sam" seems to work [15:02:41] 10netops, 10Infrastructure-Foundations: Store network users in Bitu/LDAP - https://phabricator.wikimedia.org/T335870 (10ayounsi) [15:02:50] slyngs, moritzm: https://phabricator.wikimedia.org/T335870 (cc topranks) [15:03:57] 10netops, 10Infrastructure-Foundations: Store network users in Bitu/LDAP - https://phabricator.wikimedia.org/T335870 (10SLyngshede-WMF) p:05Triage→03Low a:03SLyngshede-WMF [15:22:40] volans: just trying to run the puppetdb import script on netbox-next [15:22:53] told me "Unable to run script: RQ worker process not running", did you ever see this before? [15:23:06] I can dig into it of course just checking if there might be another reason [15:24:18] topranks: we did deploy to netbox-next today multiple times so leave it to me [15:24:21] it's surely related [15:24:27] in a meeting will look at it in a bit [15:25:07] ok np, yeah I don't need to test anything was just using it to get the device data, I can check in puppetdb or with ssh [15:25:08] thanks [15:30:25] 10SRE-tools, 10Infrastructure-Foundations: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) [17:53:30] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:31] (SystemdUnitFailed) firing: (2) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:29:57] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:35] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:31] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.949% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace