[03:39:07] (SystemdUnitFailed) firing: (10) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:29] FYI I've mentioned the httpbb failures in -serviceops [07:33:39] thx [07:33:52] is there a task? :) [07:35:46] not yet, it's a missing/deleted entity in test wikidata [07:36:09] now the real task would be the one that allows to route failed units to different team/channels [07:36:25] insead of alerting in here for anything cumin related, where we have units belonging to differen tteams [07:36:47] at the same time we have very few hosts that are "multi-team" like the cumin hosts... so not sure how worth is it [07:36:59] yeah exactly [07:37:37] volans: all the hosts are multi-team :) [07:37:44] in some way :D [07:38:18] as in a mechanism to route alerts to the proper teams will be useful in many cases [07:39:05] it's also problematic as the team in charge of the alert might not see it [07:39:07] (SystemdUnitFailed) firing: (10) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:48] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:33:44] (SystemdUnitFailed) firing: (10) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:44] (SystemdUnitFailed) firing: (10) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:47] 10Packaging, 10Cloud-VPS, 10Infrastructure-Foundations, 10serviceops, and 2 others: Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 (10fnegri) 05In progress→03Resolved [09:48:44] (SystemdUnitFailed) firing: (7) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:48] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) @Jhancock.wm not 100%, I will try to chase on that. [12:28:44] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:32] 10netbox, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10taavi) Tagging @jbond @volans as this is closely related to the server provisioning workflow. Looki... [13:07:05] 10netbox, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10Volans) The current assumptions are: * Hosts in Active,Failed status in Netbox must be in PuppetDB... [13:15:16] 10netbox, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10cmooney) The netbox hiera sync happens when the //sre.puppet.sync-netbox-hiera// cookbook is run (t... [13:37:24] 10netops, 10Infrastructure-Foundations, 10SRE: Move cr1-esams<->cr2-esams link to QSFP port - https://phabricator.wikimedia.org/T347323 (10ayounsi) Thanks, I remembered there was a reason but forgot what it was! I guess it doesn't make much sens to buy a `MIC3-3D-2X40GE-QSFPP` seeing the [[ https://www.juni... [13:38:19] 10netops, 10Infrastructure-Foundations, 10SRE: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) [13:44:34] 10netops, 10Infrastructure-Foundations, 10SRE: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10cmooney) >>! In T347323#9199448, @ayounsi wrote: > Thanks, I remembered there was a reason but forgot what it was! Yeah it's a shame. I made the same mistake while planning th... [14:17:19] 10netops, 10Infrastructure-Foundations, 10SRE: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) That's a great idea! Opened {T347403} [14:48:02] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) 05Open→03Resolved a:03RobH I think we can close that one. @RobH did the audit afaik. [14:58:44] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:36] ^ that was some leftover from the move to puppetdb hosts, now fixed [15:03:44] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:54] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:44] (SystemdUnitFailed) firing: (3) slapd.service Failed on ldap-rw2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:44] (SystemdUnitFailed) firing: (3) slapd.service Failed on ldap-rw2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:08] Does IF own docker-registry.wikimedia.org? I'm getting 500s during CI: https://integration.wikimedia.org/ci/job/alerts-pipeline-test/1266/console [16:49:32] Hm, was it related to the ongoing incident? I wouldn't have thought they were related but the errors no longer occur [17:18:44] (SystemdUnitFailed) firing: (4) slapd.service Failed on ldap-rw2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:48] 10netops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [17:39:45] brett: at least one inter-datacenter internal link was saturating, so it's possible [17:43:28] aha, good point [17:44:51] 10netbox, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10cmooney) FWIW this inspired me to write up {T347411} [17:48:45] (SystemdUnitFailed) firing: (4) slapd.service Failed on ldap-rw2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:45] (SystemdUnitFailed) firing: (3) slapd.service Failed on ldap-rw2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed