[03:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:24] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10Znuny: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9798320 (10LSobanski) 05Resolved→03Open Reopening as the alert silencing (https://gerrit.wikimedia.org/r/c/operations/puppet/+/697785) sti... [07:26:45] 10CFSSL-PKI, 06Infrastructure-Foundations: CFSSL gencert "remote error: tls: certificate require" - https://phabricator.wikimedia.org/T355750#9798419 (10ayounsi) As data point, same error today with `cumin1002:~$ sudo cookbook sre.network.tls lsw1-d1-codfw` [07:45:48] 10netops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024 - https://phabricator.wikimedia.org/T364893#9798455 (10Gehel) [07:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:26:48] volans|off: I have to rename a couple of servers, do you have time by chance to review of https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818 (I'll do as well) so that I could maybe test it? [08:59:14] I'll upgrade seaborgium to bullseye in a bit [09:23:31] update is complete [10:52:57] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [11:05:53] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with... [11:11:11] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet... [11:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:49] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with... [11:59:11] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9799310 (10MoritzMuehlenhoff) [12:50:17] hi folks. we had two pages overnight (my time!) for: [12:50:37] 00:26:45 <+jinxer-wm> FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:50:42] 00:26:46 <+jinxer-wm> FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [12:50:47] https://librenms.wikimedia.org/device/device=298/tab=port/port=31837/ [12:51:55] https://librenms.wikimedia.org/device/device=298/tab=port/port=31837/ [12:52:02] er https://librenms.wikimedia.org/graphs/device=293/type=device_bits/from=1715691091/legend=yes/popup_title=Device+Traffic/ [12:52:20] timing for both these matches up and seems to point to ganeti7004 [12:52:34] anything obvious here that might have caused this? [12:53:48] sukhe: carefull with writing # p a g e, on IRC, it's a highlight word for many :) [12:55:00] ha sorry [12:55:21] it is for me too but I don't mind in that sense but yeah [12:55:40] I am just more curious about this given it's magru [13:00:06] (still looking) [13:00:48] np thanks! [13:16:22] sukhe: so far I'm leaning toward monitoring glitch, even though it's on 2 devices.. Individual hosts dashboard doesn't show anything, same for the netflow or sflow monitoring, it only show up on LibreNMS [13:17:37] XioNoX: ok thanks! I guess we can dig deeper if we see it again [13:19:41] sukhe: yeah, thanks for the ping though ! definitely worth having a look [14:28:51] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:02] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:51] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:03] XioNoX: topranks: https://netbox.wikimedia.org/circuits/circuits/?site_id=11 [14:59:16] mostly for awareness, is it expected that the transit for EdgeUno is marked as provisioning here? [14:59:28] context is that I was looking at the circuits for my own understanding and noticed it [14:59:46] sukhe: hey good spot that’s an oversight [14:59:48] I am not even sure if the state matters here but just in case it does :) [14:59:51] I’ll change it now [14:59:56] ok thanks! [15:00:05] out of curiosity, does it matter what the status is here? [15:00:20] Nah that field doesn’t control anything operationally, it’s just for our own info [15:00:24] ok! [15:00:42] was in a meeting with kwakuofori and it came up and hence [15:28:21] 10netops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024 - https://phabricator.wikimedia.org/T364893#9800676 (10BTullis) I think that the most likely candidate at the moment is user-gen... [16:29:55] 10netops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024 - https://phabricator.wikimedia.org/T364893#9801014 (10cmooney) >>! In T364893#9800673, @BTullis wrote: > I think that the most... [16:59:07] 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801220 (10SMMPakPanel) Its all-encompassing strategy for social media marketing in Pakistan makes [[ https://smmpakpanel.com/ | SMM Pak Panel ]] unique. It helps businesses efficiently improve their we... [17:02:50] 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801243 (10hashar) [18:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed