[06:56:13] (DiskSpace) firing: Disk space krb1001:9100:/ 5.481% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:45:25] the culprit of the ^^^ above is /var/log/kerberos/krb5kdc.log (23GB), it seems it generated that amount of logs from Dec 20 00:02:12 to Dec 20 08:45:12 cc slyngs, moritzm [08:46:54] yeah, I had already pinged DE SREs about the increased rate last week and prodded them again 15 mins ago [08:50:14] an-db1001 also seems to be missing, but generally just a lot of traffic [08:54:35] but why the file hasn't been touched or rotated since Dec 20th? [08:56:22] maybe info level logging is too verbose? [08:56:39] one line every two has just: ](info): closing down fd 14 [08:56:53] https://phabricator.wikimedia.org/T337906 [08:57:42] should it be re-opened? [09:01:31] ignore my previous comment as not touched, the file is keep being written and at this rate we'll go out of space before end of day and hance new rotation [09:02:42] either we reduce the logging levels or rotate hourly (but usually logrotate runs daily) or move the log to the /srv/ partition that is much larger or increase / size [09:04:35] there are 79.90 GiB in the volume group unallocated [09:15:03] we can switch to hourly rotation as a workaround, but I'd rather have DE SREs fix the root cause [09:15:24] (but I need to finish something else now and interview in an hour) [09:24:35] ack [09:25:23] fwiw the logrotate timer has currently OnCalendar=daily [10:01:13] (DiskSpace) resolved: Disk space krb1001:9100:/ 3.777% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:29:30] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10ayounsi) Big and needed change, thanks ! Looking at the doc at https://wikitech.wikimedia.o... [14:30:15] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) @ayounsi thanks for the feedback! >>! In T346428#9418490, @ayounsi wrote: > Lookin... [15:05:53] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10ayounsi) To follow up only on the Cassandra usecase, my proposal here is to actually remove... [16:01:08] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10cmooney) >>! In T346428#9418800, @ayounsi wrote: > To follow up only on the Cassandra usecas... [17:39:42] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10cmooney) p:05Triage→03Low [18:12:47] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10cmooney) FWIW here is a log of running the cookbook against a switch where the interface is not set up: ` cmooney@cumin1001:~$ sudo cookbook... [18:25:08] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) > it seems this limitation does not apply to 22.2 which we are using in codfw. An update on this. It seems that we do have this bug in 22.2, but we don't... [18:35:19] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) [21:45:53] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f8c695e1-b7e4-4ad2-a1f2-118a3a1653c9) set by cmooney@...