[02:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [02:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [06:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [06:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [09:37:41] note to self: we don't have webrequest alerts going here since we moved to airflow maybe? [10:04:43] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [10:53:27] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:02:33] milimetric: we got some gaps between webrequest_sampled_live and webrequest_sampled_128 [11:02:48] _live is missing some data for esams [11:04:48] vgutierrez: we had two big (~8%) losses reported, and jobs were stuck for a few hours. I'm not sure exactly how this would affect those two pipelines [11:05:11] the loss was reported as "false positive" which in this case is because of mostly invalid records (records with bad dt coming from varnish) [11:05:47] I believe these don't get refined, so I think we have an actual drop for those two hours, and I haven't looked but maybe it's all in esams [11:06:07] https://w.wiki/76iy [11:06:17] vgutierrez: it's possible that webrequest_sampled_128 doesn't care about bad dt and loads them anyway, while _live does care [11:06:26] you can see a big gap using that query between _live and _128 [11:06:55] could be due to requestctl action being null? [11:13:50] is _live filtering out records that have requestctl null? Otherwise I wouldn't expect that [11:15:00] well, I might go back to sleep here, I think there's definitely something to look at but I'm a little under the weather. If you think there's something urgent, ping on slack maybe, only old timers left on here. I'll check back in later. [11:15:15] milimetric: don't worry, go rest :) [11:15:32] I don't understand why that value is null BTW [11:15:45] set req.http.X-Requestctl = req.http.X-Requestctl + ",odesa"; [12:02:10] bad requests that don't hit that line, interesting [12:49:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:34] 10Data-Engineering, 10Anti-Harassment, 10Data Engineering and Event Platform Team, 10Privacy Engineering, and 4 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Iislucas) @Ottomata I read the documentati... [14:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [14:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [17:04:58] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [18:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:19:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [22:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability