[04:35:56] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:09:11] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:33:15] (VarnishKafkaDeliveryErrors) firing: (2) varnishkafka has cache_text errors on cp3052:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [08:33:15] (VarnishKafkaDeliveryErrors) firing: (6) varnishkafka has cache_upload errors on cp3053:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [08:47:45] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_text errors on cp3050:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [08:47:51] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp3051:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [08:55:42] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:55:47] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp3052 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:25:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp3059 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3059%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:25:57] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3059 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3059%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:28:23] Hmm. Still a worrying number of these varnishkafka alerts coming through. Last time it tallied with an active attack, but I'm not aware of anything at the moment. [09:59:47] btullis: esams being depooled? :) [10:00:32] from the SAL: 07:50 topranks: De-pooling esams in advance of cr2-esams line card reboot [10:02:07] vgutierrez: Ah, thanks. Yes, I'm sure that's likely to be the trigger. However, this alram is only supposed to go off when there's a discrepancy between the number of requests received by varninsh and the number of messages sent by varnishkafka. False positives like this means that I must have got the maths wrong in the alertmanager rule somehow. [10:02:50] It was supposed to handle the depooling a DC case, but doesn't look like it's working properly :-( [10:03:27] btullis, vgutierrez: I did cause an issue at approx 08:30 for about 5-10 mins (not user affecting as site was de-pooled) [10:03:36] reachability to elements in esams was disrupted [10:03:52] right now everything back to normal and esams was repooled [10:05:31] Great! Thanks both. All good to know, but I probably haven't got time right now to fine-tune the rule. [11:15:37] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10KCVelaga_WMF) @JAnstee_WMF > Some countries do not appear each year which for each internal metric so we need to ensure all are being included and zeroed into the standard frame of geo locati... [11:18:51] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:23:23] !log decommissioning aqs1004 [11:23:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:43:51] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10ntsako) a:05ntsako→03JAnstee_WMF [12:32:04] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: [12:32:04] t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:47:18] Ah, thanks joal. Yes, these alerts will come through from the remaining hosts in the (legacy) aqs cluster. I will downtime them while I carry on with the remaining hosts. [12:47:46] Thank you btullis [12:52:27] ACKNOWLEDGEMENT - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CR [12:52:27] Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:28] ACKNOWLEDGEMENT - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICA [12:52:29] Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:30] ACKNOWLEDGEMENT - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CR [12:52:31] Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:32] ACKNOWLEDGEMENT - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICA [12:52:33] Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:56:21] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) 05Open→03Resolved a:03elukey The kafka logging clusters have the new PKI configurati... [12:59:41] 10Analytics-Radar, 10SRE, 10Traffic, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez) [13:48:49] !log decommission aqs1007 (also forgot to log aqs1006) [13:48:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:05:15] (03PS18) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [14:22:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4036 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:25:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:27:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4036 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:30:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:33:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:33:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:55:34] joal, what do you think if we just use a slightly modified version of the paths when sorting them? [14:55:35] re.sub('[0-9]+', lambda m: m.group(0).zfill(padding), path) [14:56:14] /some/path/year=2022/month=3/day=15/hour=0 --> /some/path/year=2022/month=0003/day=0015/hour=0000 [14:59:12] (VarnishKafkaDeliveryErrors) resolved: varnishkafka has cache_text errors on cp1089:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1089 - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [14:59:41] mforns: I think I'd rather use a time representation than changing the paths :) [14:59:51] they would not change [14:59:55] but this is open to discussion - let's see if otto wanna chime in [15:00:20] it could be: [15:00:21] for child_path in sorted(child_paths, key=pad_numbers): [15:00:36] just ordering the paths using that criteria, without changing them [15:02:14] it would be the simplest... the other way the code gets longer :/ [15:04:23] yeah, that would have been my guess that adding a time in addition to path makes things complicated [15:04:27] hm [15:04:32] BTW joal, I was wondering yesterday what the daily_aggregated_monthly job did, because I see it outputs to the Druid daily datasource, but I was not sure if I had to backfill that too? [15:04:47] unique devices in Druid ^^^ [15:05:11] This job is about re-compacting druid daily into a month [15:05:25] so, instead of running daily ones, ou could just rerun those [15:05:34] oh...ok [15:06:03] I suspected that, but I saw they have a query... [15:07:10] will do! [15:21:18] 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10JAnstee_WMF) 05Open→03In progress p:05Triage→03High [15:21:20] 10Data-Engineering, 10Equity-Landscape: Milestone: Ingest and Transform Input Data - https://phabricator.wikimedia.org/T305475 (10JAnstee_WMF) [15:21:48] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10JAnstee_WMF) 05Open→03In progress p:05Triage→03High [15:21:50] 10Data-Engineering, 10Equity-Landscape: Milestone: Ingest and Transform Input Data - https://phabricator.wikimedia.org/T305475 (10JAnstee_WMF) [15:30:08] (03PS6) 10Mforns: Fix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) [16:30:49] 10Analytics, 10API Platform (Product Roadmap), 10Code-Health-Objective, 10Epic, and 3 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10VirginiaPoundstone) [16:35:47] RECOVERY - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:49:25] 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10BTullis) @xcollazo - if you're going to be building a new anaoncda-wmf version, it would be great to roll up this change ito it as well, if possible. Thanks. [16:56:13] 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10xcollazo) Got it @BTullis! [16:58:06] 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) @JAnstee_WMF > Editor presence and Editor growth outputs seem to be scaled 0 to 1 rather than 0 to 10 You are right, fixed this. > There seems to be values replacements still n... [16:59:27] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) We'll have to update `anaconda-wmf` to make Spark3 work well. Since that is a big undertaking, let's also roll thes... [16:59:45] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) 05Open→03In progress [17:03:34] 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10Ottomata) Hm. I dunno how long it will take us to move folks to conda analytics instead of anaconda-Wmf, but it is def easier to update conda-analytics. @aqu ? [17:04:32] 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10Ottomata) Oops I meant to write this on the other ticket about MariaDB dep. ignore me here [17:06:05] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10JAnstee_WMF) @KCVelaga_WMF > sorry I was unclear > Some countries do not appear each year which for each internal metric so we need to ensure all are being included and zeroed into the standar... [19:20:34] 10Data-Engineering-Kanban, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files - https://phabricator.wikimedia.org/T308450 (10Ottomata) 05Open→03Resolved Resolving because we fixed the bug in jsonschema-... [19:22:52] (03PS1) 10Ottomata: Remove materialized json files and disable materializing them [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/839677 (https://phabricator.wikimedia.org/T315674) [19:22:54] (03PS1) 10Ottomata: Remove materialized json files and disable materializing them [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/839678 (https://phabricator.wikimedia.org/T315674) [19:40:45] !log Deployed airflow to fix projectview_hourly_dag [19:40:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:51:51] !log Killed Oozie projectview-hourly job [19:51:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:51:54] !log Started airflow projectview_hourly_dag [19:51:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:26:44] Hello. A few hours ago, revision-create stream was showing very old events (September 29). It was a bug? [20:46:50] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) >>! In T319360#8292254, @Ottomata wrote: > Hm. I dunno how long it will take us to move folks to conda analytics >... [23:37:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:42:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages