[00:00:23] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:13:49] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:18:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [01:23:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [04:26:11] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:40:50] btullis: o/ [07:41:11] I was checking an alarm for the dse cluster, some 504s registered by kube api [07:41:15] (for LIST actions) [07:41:39] while checking logstash though I noticed that 1005->1008 use device mapper in docker, not overlay2 [07:42:58] I am going to drain + reinit the kubelet on them [07:43:02] err docker [07:49:52] !log re-initialize docker on dse-k8s-worker100[5-8] - wrong storage type set (devicemapper instead of overlay2) [07:49:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:54:27] !log re-initialize docker on dse-k8s-worker1004 - wrong storage type set (devicemapper instead of overlay2) [07:54:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:56:59] all right all overlay2 now [07:57:01] weird [07:59:56] kube api errors going down [08:02:06] in logstash there is a mention of 1004 having troubles right when the 504s increased, Oct 6 ~ 14:%0 [08:02:09] 14:50 [08:02:33] but I don't see anything changed in the puppet logs / SAL / etc.., so maybe it was just a weird config that eventually triggered an issue [08:09:00] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:15:10] RECOVERY - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:27:56] PROBLEM - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:25:12] elukey, thanks so much. I thought I had manually fixed all of the devicemapper hosts, but clearly not. [09:26:26] !log delete calico pods in CrashLoop on dse (probably due to the incorrect docker settings) [09:26:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:27:01] btullis: I remember to have checked as well at the time, no idea what happened [09:28:04] I was planning to do a https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes#Reimage_the_node at some point, because I remember that going from insetup->role_dse_k8s_worker resulted in getting devicemapper first. [10:15:47] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10BTullis) [13:48:26] 10Data-Engineering: Requesting Kerberos identity for user sstefanova - https://phabricator.wikimedia.org/T320253 (10Slst2020) [14:26:49] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10KCVelaga_WMF) @JAnstee_WMF > The list of output countries from product data is inconsistent each year depending on metric hits, emerging spaces can sometimes disappear from year to year. It wi... [14:27:19] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10KCVelaga_WMF) Noting that there are some large outliers in the data for yoy growth inputs ` country_code pageviews_yoy_growth unique_devices_yoy_growth BV 3832.75 857.333333 GS 34.604563 26.833... [16:22:41] joal, aqu, xcollazo: I tried to spin up an airflow dev instance for the re-loading of unique devices into cassandra, but it can not work in the stats machine, since it needs to be executed under the analytics user (no keytab there), and I don't want to execute it in an-launcher1002, since we know the dev instance can interfere with production instance. [16:23:12] so I created an MR in prod, adding 2 temporary DAGs for the re-loading (or backfilling) [16:23:42] if you're OK, I will deploy it and try running them. [16:23:58] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/172 [16:25:41] the temporary dags are exact copies of their production counterparts, except for: 1) the name, 2) the job config, which contains only the unique_devices properties, and 3) The start_date, which is 1st of July. [16:40:07] RECOVERY - Check unit status of drop-features-actor-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:53:41] PROBLEM - Check unit status of drop-features-actor-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:11:43] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) In [[ https://phabricator.wikimedia.org/T212482#8294070 | T212482#8294070 ]] @daniel wrote:... [17:29:57] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10Ottomata) It is more work, but not significantly. What needs changed is - https://github.com/wikimedia/puppet/blob/productio... [17:30:48] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10Ottomata) ^ would be part of {T302819} [18:46:25] RECOVERY - Check unit status of drop-features-actor-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:48:19] RECOVERY - Check unit status of drop-anomaly-detection on an-launcher1002 is OK: OK: Status of the systemd unit drop-anomaly-detection https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:58:01] RECOVERY - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:03:55] RECOVERY - Check unit status of refinery-drop-banner-activity on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-banner-activity https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:19:27] RECOVERY - Check unit status of refinery-drop-pageview-actor-hourly-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-pageview-actor-hourly-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:22:51] 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews Service - https://phabricator.wikimedia.org/T288296 (10VirginiaPoundstone) p:05Triage→03High [19:23:08] 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10VirginiaPoundstone) p:05Triage→03Medium [19:23:16] 10Data-Engineering, 10API Platform (Product Roadmap), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Mediarequests Service - https://phabricator.wikimedia.org/T288303 (10VirginiaPoundstone) p:05Triage→03Medium [20:35:40] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10Antoine_Quhen) Yes, the first steps look clear. We also need to: * add jupyterhub-singleuser to conda-analytics. (Alternativ... [20:46:29] RECOVERY - Check unit status of refinery-drop-webrequest-refined-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-refined-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:04:03] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Add sql_tuple function to wmfdata-python - https://phabricator.wikimedia.org/T293706 (10nshahquinn-wmf) Since I've currently got my hands in Wmfdata and the code for this function is already written, I've put up a pull request: https://github.com/wik...