[08:21:42] (SystemdUnitFailed) firing: prometheus_lvs_realserver_mss.service Failed on ncredir2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:21] looking ^^ [08:26:42] (SystemdUnitFailed) resolved: prometheus_lvs_realserver_mss.service Failed on ncredir2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:51] 10Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721 (10Vgutierrez) [08:28:39] 10Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721 (10Vgutierrez) p:05Triage→03Medium [08:43:56] 10Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721 (10Vgutierrez) apparently `get_mss` failed to get|capture a SYN/ACK: `lang=python 61 if synack is None or synack[TCP] is None: 62 print(f"[!] Unexpected answer: {synack}", file=sys.stderr) 6... [09:23:07] 10Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721 (10fgiunchedi) An additional metric to count failures seems appropriate to me; since these are exceptional events even a general counter (i.e. not per endpoint) would work I think [10:02:48] godog: on T354721 you mention using a "general counter" to report errors [10:02:48] T354721: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721 [10:03:31] godog: I see a small problem with that, if errors don't occur, prometheus won't get any kind of data for that metric [10:03:59] should I set the counter to 0 if everything goes as expected or should I use a gauge instead? [10:05:05] of course I could set the Counter to zero and increase it if needed [10:07:35] vgutierrez: yeah set the error counter at 0 at startup and increase as needed [10:07:50] ack [10:22:13] godog: I can't set a counter to 0 [10:22:46] it only exposes a inc() method [10:27:30] https://www.irccloud.com/pastebin/oYFri6CG/ [10:28:08] ^^ that's an example using a counter that's only increased when MSS can't be measured [10:29:34] godog: so IMHO a lvs_realserver_mss_successful_measurement implemented with a gauge makes more sense [10:29:41] or _failed_measurement [10:30:33] vgutierrez: sure that works too! [10:41:36] godog: are we ok with missing metrics then? [10:41:51] https://www.irccloud.com/pastebin/YSBFb3qc/ [10:42:02] that's an example with an error [11:23:43] vgutierrez: I'll take another look after lunch [11:35:41] 10Traffic, 10Patch-For-Review: purged package cannot be built due to failing test - https://phabricator.wikimedia.org/T354712 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/5 Add CI for deb building [11:35:48] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/5 Add CI for deb building [11:41:52] 10Traffic, 10Patch-For-Review: purged package cannot be built due to failing test - https://phabricator.wikimedia.org/T354712 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/6 Skip integration test during build [14:12:09] 10Traffic, 10Infrastructure-Foundations, 10SRE: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10JameelKaisar) 05Open→03Resolved [14:12:14] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:13:32] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:16:41] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:18:37] 10Traffic, 10SRE: "Our servers are currently under maintenance" page shown on HTTP 429 - https://phabricator.wikimedia.org/T354718 (10MatthewVernon) [14:20:43] 10Traffic, 10SRE: "Our servers are currently under maintenance" page shown on HTTP 429 - https://phabricator.wikimedia.org/T354718 (10Vgutierrez) p:05Triage→03Medium [14:31:40] vgutierrez: ok I've looked at your review and the code in general, I see what you mean re: missing metrics, I think the easiest would probably to export like -1 as mss when getting mss for whatever reason fails [14:32:28] godog: no problem with setting it to 0.0 or -1.0 but as mentioned on the task, it will trigger an alert [14:32:55] unless we refactor the alert to ignore MSS values of 0 [14:33:20] (assuming that's feasible) [14:35:35] yeah arguably the alert should go off if mss can't be detected tho [14:36:35] and the error can be temporary too, maybe a little leeway in the alert [15:03:13] 10Traffic, 10Patch-For-Review: purged package cannot be built due to failing test - https://phabricator.wikimedia.org/T354712 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/7 add first code draft to manage eventual kafka errors [15:03:59] 10Traffic, 10SRE, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/7 add first code draft to manage eventual kafka errors [15:06:41] godog: that would be lie alerting on UNKNOWN on icinga [15:06:46] *like [15:08:39] vgutierrez: meeting, will reply later [15:55:08] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) Summary of the discussion on the linked CR: - LLDP based logic runs the risk o... [15:57:42] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10JMeybohm) [16:00:22] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10akosiaris) >>! In T352893#9450792, @Clement_Goubert wrote: > Summary of the discussion on the l... [16:09:29] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450792, @Clement_Goubert wrote: > I am left wondering if the fear of L... [16:16:55] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10Papaul) @cmooney link moved to ssw1-a8 [16:28:51] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Volans) I might be missing context, but why we can't get that info from netbox? Extracting it d... [16:42:40] 10Traffic, 10SRE: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10Jdforrester-WMF) [16:58:32] godog: sure.. I've amended the CR to report 0 for an unknown MSS, (a real one shouldn't be lower than ~500) [17:06:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Migrate atlas-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) [17:09:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [17:09:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Migrate atlas-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) 05Open→03Resolved Work completed. Cable moved and irb.2201 added to lsw1-a2-codfw. As no other devices are o... [17:13:44] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450929, @Volans wrote: > I might be missing context, but why we can't... [17:28:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) [17:29:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) Link is now up and BGP has established. ` cmooney@lsw1-a2-codfw> show route receive-protocol bgp 10.192.254.9 table PRODUCTION.inet.0 ters... [17:52:40] vgutierrez: ack thanks! LGTM [17:56:27] Cheers [18:06:41] 10Traffic, 10Data-Engineering, 10Movement-Insights, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) [18:52:48] 10Traffic, 10Data-Engineering, 10Movement-Insights, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with... [18:58:23] 10Traffic, 10Data-Engineering, 10Movement-Insights, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) [22:10:43] 10netops, 10Infrastructure-Foundations, 10SRE: Automate BGP peering on MR routers towards core - https://phabricator.wikimedia.org/T354809 (10cmooney) p:05Triage→03Low