[01:48:37] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [03:18:29] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [05:48:37] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:08] 10Mail, 06Infrastructure-Foundations, 06SRE, 10Wikimedia-Mailing-lists: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10837839 (10ABran-WMF) [06:50:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [07:18:29] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [08:36:10] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10838021 (10ayounsi) It's actually multiple of them: * `gnmi_bfd_peer_session_state{}` missing in codfw, while it used to work u... [08:48:49] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10838053 (10cmooney) >>! In T388641#10838021, @ayounsi wrote: > * `gnmi_interfaces_interface_state_counters_in_fcs_errors{}` mi... [08:48:55] FIRING: MaxConntrack: Max conntrack at 81.75% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [08:52:04] ^ fixed, the krb1002 was initially installed with the insetup role and later the KDC role get applied (but sysctls are only read on boot) [08:53:55] RESOLVED: MaxConntrack: Max conntrack at 81.75% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:00:46] Hello. FYI I'm just about to run homer against "cr*eqiad*" like this: `homer "cr*eqiad*" commit "Adding two new workers to dse-k8s-eqiad T394647"` [09:00:46] T394647: Repurpose spare snapshot servers as dse-k8s-workers - https://phabricator.wikimedia.org/T394647 [09:01:12] The diff looks good to me, but I don't use homer very much, so I thought I would just let you know. [09:03:55] btullis: ack thanks! [09:05:13] the workers are in row D/C so nothing special, cr*-eqiad is good [09:16:25] Thanks. All done. Looks good from our side. [09:23:20] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10838158 (10ayounsi) Restarting gNMIc in esams fixed the issue for `gnmi_interfaces_interface_state_counters_out_errors{}`. It s... [09:33:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838224 (10ayounsi) [09:58:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:42] 10CAS-SSO, 06Infrastructure-Foundations: Error authenticating with services on CAS 7.1 - https://phabricator.wikimedia.org/T394759 (10SLyngshede-WMF) 03NEW [10:47:12] 10CAS-SSO, 06Infrastructure-Foundations: Error authenticating with services on CAS 7.1 - https://phabricator.wikimedia.org/T394759#10838411 (10SLyngshede-WMF) p:05Triage→03High [10:47:49] 10CAS-SSO, 06Infrastructure-Foundations: Error authenticating with services on CAS 7.1 - https://phabricator.wikimedia.org/T394759#10838413 (10SLyngshede-WMF) {F60294887} [10:48:12] 10CAS-SSO, 06Infrastructure-Foundations: Error authenticating with services on CAS 7.1 - https://phabricator.wikimedia.org/T394759#10838414 (10SLyngshede-WMF) [11:05:03] 10CAS-SSO, 06Infrastructure-Foundations: Error authenticating with services on CAS 7.1 - https://phabricator.wikimedia.org/T394759#10838471 (10SLyngshede-WMF) Fix: Sign out and the sign back in. This apparently clear any "broken" state left from CAS 7.0. [11:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [11:18:29] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [11:39:49] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838631 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8c92db5f-18b6-481b-8642-01c1d92b5cb0) set by cmooney@cumin1003 for 2:00:00 on 10 host(s) and their servi... [12:21:52] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838856 (10ayounsi) [12:49:35] quick question on spicerack's hiera_lookup: I get a "`Notice: Scope(Scap::Target[gervert/deploy]): mange_ssh_key=true but ssh::userkey gerrit-deploy already defined.`" while retrieving `profile::gerrit::git_dir`. I'm wondering if it is a known behavior? I tried to change the output format to json but the string was still here ahead of the data [12:52:37] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838979 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f40f3f46-731d-46ef-9db5-647d735907d6) set by cmooney@cumin1003 for 3:00:00 on 1 host(s) and their servic... [13:01:49] 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10839059 (10Jgreen) [13:08:09] arnaudb: looking [13:08:41] sigh, puppet is not writing the notice to stderr (that is already discarded) [13:09:14] I can escape it fairly easy because it's on 2 lines (so if notice → trim) [13:09:38] but it can have unintended consequences if not spotted [13:10:56] indeed and it should not happen, but at the same time it highlights an issue in the gerrit puppetizzation [13:11:39] indeed! [13:12:20] so I'm always in doubt in those cases because hiding it in spicerack might actually prevent from noticing [13:12:39] although in theory the 2> /dev/null should already do the right thing if puppet was properly writing to the right I/O [13:13:18] elukey: what do you think? should we do some magic to hide eventual notices or not? [13:13:37] if the pattern doesn't change it could be logged maybe? [13:14:33] which pattern? from spicerack PoV we have no idea what a hiera key can contain and it could start with Notice too when in string format [13:14:59] "Notice: Scope" [13:15:15] indeed it could collide [13:15:45] let me try to understand - there is a problem in the gerrit's puppetization leading to the issue that arnaudb highlighted, that wouldn't occur if the config was right? [13:15:48] it can be any notice or warning, given the very little use of hiera_lookup we could remove the format thing, allow only yaml and remove every line before '--- ' [13:15:59] elukey: correct AFAICT [13:16:11] then the puppetization should be fixed :D [13:16:17] \o/ [13:16:34] mean the puppet code, it shouldn't emit that notice at all [13:16:43] I don't think that we should do anything on the spicerack's side [13:17:00] elukey: only weird think, puppet runs clean on the host, no Notice [13:17:07] the hiera lookup triggers it [13:17:34] e.g. https://puppetboard.wikimedia.org/report/gerrit1003.wikimedia.org/5ea413b358d8d086e4b250d8f731f7b942caa663 [13:17:42] weird indeed, but if we start trimming etc.. it may lead to uninteded consequences as you wrote earlier on [13:17:48] how difficult is it to fix arnaudb ? [13:20:04] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10839156 (10Volans) p:05Triage→03Medium [13:35:38] let me check [13:38:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10839281 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5afc68ed-eba5-4a71-b833-f809ae58201b) set by cmooney@cumin1003 for 4:00:00 on 11... [13:47:45] it would not be simple to fix right now, I'll need to involve releng to make sure won't be breaking anything on their side. I'll handle it locally in my cookbook for now and follow up with a phab task in a bit [13:53:12] arnaudb: ack, for your use case is probably safe to just pick the last line as you expect a single line value [13:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:43] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10839410 (10MoritzMuehlenhoff) >>! In T392629#10835971, @jhathaway wrote: > since the validate cmd runs prior to writing the file to its destination. Right, I forgot about that. [14:12:14] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10839431 (10RobH) Sorry, I meant to update this task with that info sooner! Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE The DL7C version. [15:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:18:29] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:31:43] 10Mail, 06Fundraising-Backlog, 06Infrastructure-Foundations, 06SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#10839809 (10Aklapper) →14Duplicate dup:03T394788 [17:11:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840439 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e24daea6-0330-4b79-bf33-b9e0f9709a10) set by cmooney@cumin1003 for 2:00:00 on 11... [17:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:52] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10840704 (10cmooney) [18:14:36] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10840709 (10cmooney) [18:18:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:23:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840770 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now complete. A few niggles along the way that were sorted out with multiple re-seat... [18:55:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840954 (10cmooney) 05Resolved→03Open Actually there are a few bits like the license and the inventory items in Netbox to be completed which I'll take o... [19:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:18:50] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [19:35:25] https://github.com/netbox-community/pynetbox/releases/tag/v7.5.0 [20:03:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:29] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:23:29] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:27:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868 (10Papaul) 03NEW