[04:35:56] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:09:11] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:33:15] <jinxer-wm>	 (VarnishKafkaDeliveryErrors) firing: (2) varnishkafka has cache_text errors on cp3052:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors
[08:33:15] <jinxer-wm>	 (VarnishKafkaDeliveryErrors) firing: (6) varnishkafka has cache_upload errors on cp3053:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors
[08:47:45] <jinxer-wm>	 (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_text errors on cp3050:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors
[08:47:51] <jinxer-wm>	 (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp3051:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors
[08:55:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:55:47] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp3052 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:25:13] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp3059 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3059%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:25:57] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp3059 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3059%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:28:23] <btullis>	 Hmm. Still a worrying number of these varnishkafka alerts coming through. Last time it tallied with an active attack, but I'm not aware of anything at the moment.
[09:59:47] <vgutierrez>	 btullis: esams being depooled? :)
[10:00:32] <vgutierrez>	 from the SAL: 07:50 topranks: De-pooling esams in advance of cr2-esams line card reboot
[10:02:07] <btullis>	 vgutierrez: Ah, thanks. Yes, I'm sure that's likely to be the trigger. However, this alram is only supposed to go off when there's a discrepancy between the number of requests received by varninsh and the number of messages sent by varnishkafka. False positives like this means that I must have got the maths wrong in the alertmanager rule somehow.
[10:02:50] <btullis>	 It was supposed to handle the depooling a DC case, but doesn't look like it's working properly :-(
[10:03:27] <topranks>	 btullis, vgutierrez: I did cause an issue at approx 08:30 for about 5-10 mins (not user affecting as site was de-pooled)
[10:03:36] <topranks>	 reachability to elements in esams was disrupted
[10:03:52] <topranks>	 right now everything back to normal and esams was repooled
[10:05:31] <btullis>	 Great! Thanks both. All good to know, but I probably haven't got time right now to fine-tune the rule.
[11:15:37] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10KCVelaga_WMF) @JAnstee_WMF   > Some countries do not appear each year which for each internal metric so we need to ensure all are being included and zeroed into the standard frame of geo locati...
[11:18:51] <icinga-wm>	 PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:23:23] <btullis>	 !log decommissioning aqs1004
[11:23:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:43:51] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10ntsako) a:05ntsako→03JAnstee_WMF
[12:32:04] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: 
[12:32:04] <icinga-wm>	 t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:47:18] <btullis>	 Ah, thanks joal. Yes, these alerts will come through from the remaining hosts in the (legacy) aqs cluster. I will downtime them while I carry on with the remaining hosts.
[12:47:46] <joal>	 Thank you btullis
[12:52:27] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CR
[12:52:27] <icinga-wm>	  Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:28] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICA
[12:52:29] <icinga-wm>	  Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:30] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CR
[12:52:31] <icinga-wm>	  Test Get per file requests returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:32] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICA
[12:52:33] <icinga-wm>	  Get per article page views returned the unexpected status 500 (expecting: 200) Btullis Decommissioning the lagacy aqs cluster: T302277 https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:56:21] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) 05Open→03Resolved a:03elukey The kafka logging clusters have the new PKI configurati...
[12:59:41] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez)
[13:48:49] <btullis>	 !log decommission aqs1007 (also forgot to log aqs1006)
[13:48:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:05:15] <wikibugs>	 (03PS18) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739)
[14:22:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4036 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:25:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:27:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4036 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4036%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:30:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:33:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:33:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[14:55:34] <mforns>	 joal, what do you think if we just use a slightly modified version of the paths when sorting them?
[14:55:35] <mforns>	 re.sub('[0-9]+', lambda m: m.group(0).zfill(padding), path)
[14:56:14] <mforns>	  /some/path/year=2022/month=3/day=15/hour=0 --> /some/path/year=2022/month=0003/day=0015/hour=0000 
[14:59:12] <jinxer-wm>	 (VarnishKafkaDeliveryErrors) resolved: varnishkafka has cache_text errors on cp1089:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1089 - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors
[14:59:41] <joal>	 mforns: I think I'd rather use a time representation than changing the paths :)
[14:59:51] <mforns>	 they would not change
[14:59:55] <joal>	 but this is open to discussion - let's see if otto wanna chime in
[15:00:20] <mforns>	 it could be: 
[15:00:21] <mforns>	 for child_path in sorted(child_paths, key=pad_numbers):
[15:00:36] <mforns>	 just ordering the paths using that criteria, without changing them
[15:02:14] <mforns>	 it would be the simplest... the other way the code gets longer :/
[15:04:23] <joal>	 yeah, that would have been my guess that adding a time in addition to path makes things complicated
[15:04:27] <joal>	 hm
[15:04:32] <mforns>	 BTW joal, I was wondering yesterday what the daily_aggregated_monthly job did, because I see it outputs to the Druid daily datasource, but I was not sure if I had to backfill that too?
[15:04:47] <mforns>	 unique devices in Druid ^^^
[15:05:11] <joal>	 This job is about re-compacting druid daily into a month
[15:05:25] <joal>	 so, instead of running daily ones, ou could just rerun those
[15:05:34] <mforns>	 oh...ok
[15:06:03] <mforns>	 I suspected that, but I saw they have a query...
[15:07:10] <mforns>	 will do!
[15:21:18] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10JAnstee_WMF) 05Open→03In progress p:05Triage→03High
[15:21:20] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Ingest and Transform Input Data - https://phabricator.wikimedia.org/T305475 (10JAnstee_WMF)
[15:21:48] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10JAnstee_WMF) 05Open→03In progress p:05Triage→03High
[15:21:50] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Ingest and Transform Input Data - https://phabricator.wikimedia.org/T305475 (10JAnstee_WMF)
[15:30:08] <wikibugs>	 (03PS6) 10Mforns: Fix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746)
[16:30:49] <wikibugs>	 10Analytics, 10API Platform (Product Roadmap), 10Code-Health-Objective, 10Epic, and 3 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10VirginiaPoundstone)
[16:35:47] <icinga-wm>	 RECOVERY - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:49:25] <wikibugs>	 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10BTullis) @xcollazo - if you're going to be building a new anaoncda-wmf version, it would be great to roll up this change ito it as well, if possible. Thanks.
[16:56:13] <wikibugs>	 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10xcollazo) Got it @BTullis!
[16:58:06] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) @JAnstee_WMF   > Editor presence and Editor growth outputs seem to be scaled 0 to 1 rather than 0 to 10 You are right, fixed this.  > There seems to be values replacements still n...
[16:59:27] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) We'll have to update `anaconda-wmf` to make Spark3 work well. Since that is a big undertaking, let's also roll thes...
[16:59:45] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) 05Open→03In progress
[17:03:34] <wikibugs>	 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10Ottomata) Hm. I dunno how long it will take us to move folks to conda analytics instead of anaconda-Wmf, but it is def easier to update conda-analytics. @aqu ?
[17:04:32] <wikibugs>	 10Data-Engineering, 10Patch-For-Review: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10Ottomata) Oops I meant to write this on the other ticket about MariaDB dep. ignore me here
[17:06:05] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10JAnstee_WMF) @KCVelaga_WMF > sorry I was unclear  > Some countries do not appear each year which for each internal metric so we need to ensure all are being included and zeroed into the standar...
[19:20:34] <wikibugs>	 10Data-Engineering-Kanban, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files - https://phabricator.wikimedia.org/T308450 (10Ottomata) 05Open→03Resolved Resolving because we fixed the bug in jsonschema-...
[19:22:52] <wikibugs>	 (03PS1) 10Ottomata: Remove materialized json files and disable materializing them [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/839677 (https://phabricator.wikimedia.org/T315674)
[19:22:54] <wikibugs>	 (03PS1) 10Ottomata: Remove materialized json files and disable materializing them [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/839678 (https://phabricator.wikimedia.org/T315674)
[19:40:45] <SandraEbele>	 !log Deployed airflow to fix projectview_hourly_dag
[19:40:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:51:51] <SandraEbele>	 !log Killed Oozie projectview-hourly job
[19:51:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:51:54] <SandraEbele>	 !log Started airflow projectview_hourly_dag
[19:51:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:26:44] <Iluvatar>	 Hello. A few hours ago, revision-create stream was showing very old events (September 29). It was a bug?
[20:46:50] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10xcollazo) >>! In T319360#8292254, @Ottomata wrote: > Hm. I dunno how long it will take us to move folks to conda analytics >...
[23:37:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[23:42:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages