[00:04:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.658% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:05:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:29] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [07:45:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:27] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:52:57] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10fgiunchedi) [08:27:50] 10Puppet, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) [08:28:56] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Gehel) [08:29:14] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Gehel) [08:42:07] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [09:25:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:06] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10BTullis) [09:55:21] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [09:59:51] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade... [10:15:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:48] 10netops, 10Infrastructure-Foundations, 10SRE: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) 05Open→03Resolved [10:53:04] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade... [11:06:02] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) p:05Triage→03Medium [11:06:23] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) [11:10:32] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) I plan to do {T335759} then we can specify the FQDN to use for the bird config. Otherwise I think we would need to hardcod... [11:25:18] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero hey. Yeah I can understand why having to hardcode the IPs in the puppet tree is not a great option. Unfortunate... [11:30:24] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) yeah I'm thinking about doing something like `resolve_ipv4(whateverserver.codfw.hw.wikimedia.cloud)`, so basically let pup... [11:52:03] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero yep that should work. Potentially a race condition there if we drive the DNS from Netbox, which will only get... [12:11:22] yo! do you know if kafka_fundraising_client puppet cert is used for anything? there's an alert about the cert expiring [12:11:23] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:15:02] godog: try pinging Jeff or Dallas [12:15:23] good idea yeah [12:20:40] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:26:01] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [12:27:07] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:42:11] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:46:02] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:55:13] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:31] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:00 on 185 host(s) and... [13:17:53] (SystemdUnitFailed) resolved: (2) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:03] (SystemdUnitFailed) firing: (2) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:46] (SystemdUnitFailed) firing: (2) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:40] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [13:25:30] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [13:25:40] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:47:57] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Andrew) [14:01:31] 10netops, 10Infrastructure-Foundations, 10SRE: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [14:02:21] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:02:37] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:02:55] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [14:57:14] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [15:00:08] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334... [15:03:40] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:41] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:07:06] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went fine! Thanks everybody. [15:17:10] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334... [16:57:52] and the docker-reporter-base-images.service alert is finally gone [17:08:07] \o/ [17:16:49] thx! [18:58:30] (SystemdUnitFailed) firing: (2) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:45] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10wiki_willy) a:03Papaul [19:22:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [19:23:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [20:16:39] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10Papaul) The first 51 servers on the list are R430 since we can not do any for those we are left with 209 servers out of 260. [20:19:11] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [22:13:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.942% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:58:30] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed