[03:02:13] (DiskSpace) firing: Disk space krb1001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:24:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.service Failed on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:42] (SystemdUnitFailed) firing: (4) confd_prometheus_metrics.service Failed on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:42] (SystemdUnitFailed) firing: (4) confd_prometheus_metrics.service Failed on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:42] (SystemdUnitFailed) firing: (4) confd_prometheus_metrics.service Failed on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:40] ^ /var/log/kerberos/krb5kdc.log is a 42G file [05:45:24] and that only since May 17th [05:52:13] (DiskSpace) resolved: Disk space krb1001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:54:42] (SystemdUnitFailed) resolved: (4) confd_prometheus_metrics.service Failed on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:51] opened https://phabricator.wikimedia.org/T337906 [05:57:59] that's https://phabricator.wikimedia.org/T337544 [05:58:47] ah great, I looked for the file name in Phab but no task showed up [05:59:07] moritzm: I still think both tasks are valid [05:59:10] I'll look into rotating these more frequently in the interim [05:59:22] cool, thanks! [05:59:29] we already run KDCs in default log log level, the real issue here is Presto [06:00:34] which makes Kerberos requests for every single operation [06:02:01] and it grows non-linearly as well, the expansion of the previous cluster from 5-10 made requests explode: https://phabricator.wikimedia.org/T329525 [06:02:41] and it seems upstream won't change this, probably because hardly anyone else actually uses auth for their analytics systems :-) [06:03:12] eh :) [06:59:42] (SystemdUnitFailed) firing: wmf_auto_restart_uwsgi-debmonitor.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:58] moritzm, jbond: I've upgraded spicerack on cumin1001, tested the new changes and all looks good so far. I'm waiting for a reimage to finish on cumin2002 to upgrade there too. [09:19:20] I've merged the change for the makevm to use float and 1.5G min and run puppet on cumin1001 in case you want to test it [09:27:40] cheers [09:32:52] and that's now done on cumin2002 too [09:36:04] looks good, sre.ganeti.makevm correctly bails out if less than 1.5G are used [09:36:14] nice [10:07:34] 10Mail, 10Infrastructure-Foundations, 10TranslationNotifications, 10serviceops: Investigate if TranslationNotification's DigestEmailer.php is really sending emails and what happens to them - https://phabricator.wikimedia.org/T333899 (10Nikerabbit) 05Open→03Resolved a:03Nikerabbit Closing this as pare... [10:37:30] slyngs: I see linux-host-entries.ttyS0-115200 still on the install hosts, is there any leftover patches to merge for the reimage stuff? [10:53:39] No, but perhaps there should be one to remove it [11:01:36] yes, definitely [11:02:06] I'll add a todo to remove it [11:03:59] thx should take 30s :D [11:25:41] topranks, XioNoX: what's your final on the junos image for ZTP? should we add it or not? [11:28:19] I think it’s worth having yeah, I think it’d save a good amount of time from us having to do separately afterwards [11:28:33] 10CAS-SSO, 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 (10SLyngshede-WMF) a:03SLyngshede-WMF [11:29:29] ok [11:30:01] lets see what Arzhel says I guess [11:33:06] yep [11:33:12] * volans lunch [11:50:29] yeah agreed, or save time from dcops [11:50:50] I forgot where we left the implementation discussion at [12:16:52] we left it there, but the idea was to move from tftp to http but via IP because of lack of dns at that stage [12:17:09] so that means that I probably have to move the http url from the dhcp config in puppet to the cookbook [12:17:16] I'm looking to make a patch now [13:05:57] volans: Found the issue: https://gerrit.wikimedia.org/r/c/operations/puppet/+/888692 <- I just removed the file from Puppet not Puppet was told to remove the file from the servers [13:06:39] I'm trying to locate the commit that excludes the coping of the file [13:06:58] we probably don't have purge there on the puppet side because of the automation files [13:07:02] so that puppet doesn't remove them [13:07:19] so I guess you can just delete the file on all install servers with cumin [13:07:29] and then check that it doesn't get created back when running puppet [13:07:51] I'll try with on first and check :-) [13:14:38] volans: All gone [13:15:14] yay [13:15:38] Doesn't take much to make you happy :-) [13:16:00] :) [13:29:10] XioNoX: FYI creating a new device with just the required fields gives 'NoneType' object has no attribute 'lower' [13:30:37] the joys of validation rules shall keep us entertained in perpetuity I think :) [13:30:45] that's the instance.name == instance.asset_tag.lower() [13:39:28] volans: I'm not sure what your thinking was on the image if we do that. [13:39:51] I think the most flexible for us might be a "--image " parameter we can pass, which causes it to go into the dhcp snippet? [13:39:59] not sure if that's simple to do or not [13:40:24] yes, something like that [13:43:04] volans: well, device need an asset tag anyway :) [13:44:54] doesn't seem user friendly for DCops to know what junos version to use [13:47:12] XioNoX: yes but it crashes [13:47:31] so it means that we check first that before checking it's present [13:47:49] perhaps it could automatically just push ".tgz" as the image name? [13:47:50] And we can simlink our preferred version for each platform to that on the apt server? [13:48:42] it's sufficiently rare I don't think it'd be overkill if they had to ask us what filename to use in the previous suggestion [13:49:06] but yeah want to make it as easy to use as possible without it becoming a massive thing to implement [14:04:07] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10ops-knams: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10cmooney) >>! In T331886#8688489, @RobH wrote: >>>! In T331886#8688402, @ayounsi wrote: >> Ideally we should also have 1 patch panel per rack.... [14:05:31] the --image would be great as optional override, the symlink option sounds great too [14:18:15] volans: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/925817 [14:19:30] thx [14:55:34] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [15:52:54] jbond: do you know if the firmware update process/cookbook makes use of the Dell iDrac "virtual USB" feature? [15:53:09] artur.o had an issue re-imagine, partman failing [15:53:28] issue seems to be the two disks were sdb and sdc, nor sda and sdb as partman excpected [15:53:51] there was an idrac "virtual usb" drive detected as sda we found [15:54:03] which I was able to disconnect via the idrac web gui [15:54:15] but unsure why it might have been there to begin with [15:54:32] I did upgrade the firmware on the on-board 1G NIC earlier though [15:57:20] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [15:57:40] topranks: no, disk name like that are not garuntead to stay the same [15:58:17] we normally see this on host that have many disks like the swift [15:58:29] we fix it by detecting the actual disks to use with https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/scripts/partman_early_command.sh [15:58:52] if artur.o server only has two disks then its not this [15:59:19] the error we seen in the logs was [15:59:19] partman-auto-raid: mdadm: cannot open /dev/sda2: No such file or directory [15:59:31] fdisk only showed sdb and sdc [15:59:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [15:59:48] lsscsi showed another device, which turned out to be idrac virtual usb [16:00:07] ahh ok then yes this sounds like something different i have not seen that before [16:00:08] anyway "disconnecting" that on the idrac allowed us to restart and all worked [16:00:26] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [16:00:47] the firmware upgrade cook book use redfish and uploads the file to the idrac software catalog so im 80% sure it wont be the cookbook [16:01:01] its opnly 80% because who the hell knows what dell have done internally [16:01:11] yeah exactly :) [16:01:22] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 3 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [16:02:30] anyway we worked through it, I guess if we see it again maybe it might be worth delving deeper into why the virtual usb showed connected [16:03:16] yes sgtm [16:07:09] agree, we could see if we can ensure is "disconnected" via redfish in the cookbook [16:16:58] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr let me know when it might suit to try and get more of these moves done. Thanks. [16:24:17] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [16:28:22] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) [16:40:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [16:40:37] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bul... [16:55:43] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [16:55:53] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bul... [17:05:40] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS... [17:05:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bul... [17:07:24] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) [17:23:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jhancock.wm) @Jclark-ctr when you have a moment/back can you swap the ports on the NIC? thanks! [22:49:09] jbond and volans: I'd like to get some clarification on what's going on with the cookbooks tomorrow if that's okay :) [23:57:36] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10Andrew) I haven't dug much, but designate is currently failing on cloudservic...