[02:35:25] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:25] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:55] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:10] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:37] moritzm: are you around? [11:22:47] yep [11:23:00] if you had a sec for a quick sanity check of this nftables change for cloudgw [11:23:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100077 [11:23:23] arturo is out, it's sort of emergency stuff to block their spam/ddos traffic [11:24:02] looking [11:26:20] thanks [11:28:37] looks good, +1d [11:28:49] thanks! [13:19:41] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile HDFS analytics traffic - https://phabricator.wikimedia.org/T381389 (10cmooney) 03NEW p:05Triage→03Medium [13:20:20] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375639 (10cmooney) [13:21:19] 10netops, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375643 (10cmooney) [15:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:58] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376400 (10Gehel) [16:30:39] 11:22:55 <+jinxer-wm> FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:44:56] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376533 (10CDanis) Thanks Xabriel! In practice I think we'll be adding the QoS marking bits on all traffic transmitted from an-worker* with source port 50010, whi... [16:50:28] cdanis: o/ is docker-pkg 4.0.2 going to be pushed to pypi?? [16:50:40] elukey: I have no idea who has permissions for that tbh [16:50:54] https://pypi.org/project/docker-pkg/ I think it is joe [16:51:06] will ask directly to the maintainer :) [16:51:36] get him to add someone else too, always better to have at least 2 people [16:51:51] I am trying to build jaeger locally since build-production-images returns some errors [16:52:10] oh interesting [16:54:02] cdanis: https://phabricator.wikimedia.org/P71502 [16:54:53] uh interesting [16:55:06] I am still running build-production-images in a tmux, it is publishing some images, it will take a bit [16:55:40] anyway, nothing urgent, just wanted to play with docker-pkg locally. The mainter will push it tomorrow, and I'll ask how to add more maintainers :D [16:55:45] <3 [16:55:45] need to run now! [16:57:29] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376658 (10BTullis) @cmooney has already created {T381389} which might cover this, I think. Or maybe they should be parent->child tickets of each other. I won't c... [17:00:19] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: Enable QoS for Hadoop to Presto traffic - https://phabricator.wikimedia.org/T381412#10376686 (10CDanis) →14Duplicate dup:03T381389 [17:00:32] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376692 (10CDanis) I think we need an-worker* source port 50010, which I am pretty sure is just the dataplane of HDFS and not the metada... [17:02:48] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376688 (10CDanis) [17:54:02] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376882 (10cmooney) >>! In T381389#10376688, @CDanis wrote: > I think we need an-worker* source port 50010, which I am pretty sure is ju... [19:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:55] RESOLVED: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:40] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10377666 (10Andrew) a:05Andrew→03ssingh I've checked all the resolv.confs and they all look fine. I'm passing this task over to...