[00:00:23] <icinga-wm>	 RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:13:49] <icinga-wm>	 PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:18:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[01:23:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[04:26:11] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:40:50] <elukey>	 btullis: o/
[07:41:11] <elukey>	 I was checking an alarm for the dse cluster, some 504s registered by kube api
[07:41:15] <elukey>	 (for LIST actions)
[07:41:39] <elukey>	 while checking logstash though I noticed that 1005->1008 use device mapper in docker, not overlay2
[07:42:58] <elukey>	 I am going to drain + reinit the kubelet on them
[07:43:02] <elukey>	 err docker
[07:49:52] <elukey>	 !log re-initialize docker on dse-k8s-worker100[5-8] - wrong storage type set (devicemapper instead of overlay2)
[07:49:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:54:27] <elukey>	 !log re-initialize docker on dse-k8s-worker1004 - wrong storage type set (devicemapper instead of overlay2)
[07:54:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:56:59] <elukey>	 all right all overlay2 now
[07:57:01] <elukey>	 weird
[07:59:56] <elukey>	 kube api errors going down
[08:02:06] <elukey>	 in logstash there is a mention of 1004 having troubles right when the 504s increased, Oct 6 ~ 14:%0
[08:02:09] <elukey>	 14:50
[08:02:33] <elukey>	 but I don't see anything changed in the puppet logs / SAL / etc.., so maybe it was just a weird config that eventually triggered an issue
[08:09:00] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:15:10] <icinga-wm>	 RECOVERY - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:27:56] <icinga-wm>	 PROBLEM - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:25:12] <btullis>	 elukey, thanks so much. I thought I had manually fixed all of the devicemapper hosts, but clearly not.
[09:26:26] <elukey>	 !log delete calico pods in CrashLoop on dse (probably due to the incorrect docker settings)
[09:26:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:27:01] <elukey>	 btullis: I remember to have checked as well at the time, no idea what happened 
[09:28:04] <btullis>	 I was planning to do a https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes#Reimage_the_node at some point, because I remember that going from insetup->role_dse_k8s_worker resulted in getting devicemapper first.
[10:15:47] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10BTullis)
[13:48:26] <wikibugs>	 10Data-Engineering: Requesting Kerberos identity for user sstefanova - https://phabricator.wikimedia.org/T320253 (10Slst2020)
[14:26:49] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10KCVelaga_WMF) @JAnstee_WMF  > The list of output countries from product data is inconsistent each year depending on metric hits, emerging spaces can sometimes disappear from year to year. It wi...
[14:27:19] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10KCVelaga_WMF) Noting that there are some large outliers in the data for yoy growth inputs  ` country_code pageviews_yoy_growth unique_devices_yoy_growth BV  3832.75   857.333333 GS  34.604563  26.833...
[16:22:41] <mforns>	 joal, aqu, xcollazo: I tried to spin up an airflow dev instance for the re-loading of unique devices into cassandra, but it can not work in the stats machine, since it needs to be executed under the analytics user (no keytab there), and I don't want to execute it in an-launcher1002, since we know the dev instance can interfere with production instance.
[16:23:12] <mforns>	 so I created an MR in prod, adding 2 temporary DAGs for the re-loading (or backfilling)
[16:23:42] <mforns>	 if you're OK, I will deploy it and try running them.
[16:23:58] <mforns>	 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/172
[16:25:41] <mforns>	 the temporary dags are exact copies of their production counterparts, except for: 1) the name, 2) the job config, which contains only the unique_devices properties, and 3) The start_date, which is 1st of July.
[16:40:07] <icinga-wm>	 RECOVERY - Check unit status of drop-features-actor-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:53:41] <icinga-wm>	 PROBLEM - Check unit status of drop-features-actor-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:11:43] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) In [[ https://phabricator.wikimedia.org/T212482#8294070 | T212482#8294070 ]] @daniel wrote:...
[17:29:57] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10Ottomata) It is more work, but not significantly.  What needs changed is - https://github.com/wikimedia/puppet/blob/productio...
[17:30:48] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10Ottomata) ^ would be part of {T302819}
[18:46:25] <icinga-wm>	 RECOVERY - Check unit status of drop-features-actor-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:48:19] <icinga-wm>	 RECOVERY - Check unit status of drop-anomaly-detection on an-launcher1002 is OK: OK: Status of the systemd unit drop-anomaly-detection https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:58:01] <icinga-wm>	 RECOVERY - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:03:55] <icinga-wm>	 RECOVERY - Check unit status of refinery-drop-banner-activity on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-banner-activity https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:19:27] <icinga-wm>	 RECOVERY - Check unit status of refinery-drop-pageview-actor-hourly-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-pageview-actor-hourly-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:22:51] <wikibugs>	 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews Service - https://phabricator.wikimedia.org/T288296 (10VirginiaPoundstone) p:05Triage→03High
[19:23:08] <wikibugs>	 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10VirginiaPoundstone) p:05Triage→03Medium
[19:23:16] <wikibugs>	 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Mediarequests Service - https://phabricator.wikimedia.org/T288303 (10VirginiaPoundstone) p:05Triage→03Medium
[20:35:40] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use  Spark3 - https://phabricator.wikimedia.org/T318587 (10Antoine_Quhen) Yes, the first steps look clear.  We also need to: * add jupyterhub-singleuser to conda-analytics. (Alternativ...
[20:46:29] <icinga-wm>	 RECOVERY - Check unit status of refinery-drop-webrequest-refined-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-refined-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:04:03] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Add sql_tuple function to wmfdata-python - https://phabricator.wikimedia.org/T293706 (10nshahquinn-wmf) Since I've currently got my hands in Wmfdata and the code for this function is already written, I've put up a pull request: https://github.com/wik...