[01:44:02] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.617 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [03:50:54] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.667 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [04:19:22] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.617 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [06:44:46] good morning folks, these are for the marseille dc that has some networking issue (and not live yet) [08:09:00] ack elukey - thanks for the update - we'll triple check our webrequest-alerts with that in mind :) [08:40:53] joal: bonjour [08:41:23] Hi elukey :) [08:48:11] Morning all. I'm back from my hols. [08:48:28] Hi btullis - welcome back :) [08:48:42] btullis: How was time off? [08:48:55] I'm going to fix that hive JVM heap error that keeps firing on the test cluster. [08:49:03] thank you btullis [08:50:07] joal: Well, as with any holiday with young kids, absolutely exhausting. Engineered specifically to drive you to the edge of insanity. [08:50:16] But still good fun. [08:50:20] I can relate to that :) [09:23:19] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:34:23] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:58:29] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -drmrs- on alert1001 is CRITICAL: 5.222 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=drmrs+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [11:16:35] lol ben, I can vaguely remember what vacations without kids were, freedom is a foggy and distant memory [11:19:03] thanks aqu for taking care of those airflow jobs, I'll update docs and might poke you for some details [13:43:01] aqu: thanks for looking into the airflow deadlock problem! I reviewed your fix and left a couple comments. [13:53:42] o/ [13:54:00] ottomata: Welcome back! [13:55:17] htank youUuUU [13:55:36] * ottomata starts sharpening email machete [13:55:42] mforns: https://github.com/fsspec/filesystem_spec/issues/874#issuecomment-1048578621 [14:04:55] heya ottomata :] welcome back! I didn't get if they are updating to use the new pyarrow interface, or if they are replacing pyarrow with fsspec? [14:05:57] uh, my understanding is they are going to make fsspec use the new pyarrow interface by default, by removing support for the deprecated pyarrow interface [14:06:09] ok, got it [14:19:52] (03CR) 10Ottomata: Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [14:25:45] (03CR) 10Ottomata: "Couple of nits, but LGTM otherwise!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan) [14:35:59] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Looks right to me! We'll also need to merge https://gerrit.wikimedia.org/r/c/schemas/even... [14:41:15] 10Analytics, 10Data-Engineering, 10Pageviews-API: Track page views by page ID rather than title (handles moved pages) - https://phabricator.wikimedia.org/T159046 (10Isaac) @EChetty is there an explanation for why this task was declined? As far as I know, it's still not possible to query the pageviews APIs wi... [14:50:25] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add wikis to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/765578 (https://phabricator.wikimedia.org/T299548) (owner: 10Milimetric) [14:50:39] deploying refinery so we're in time for the monthly sqoop [14:59:45] mforns: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/765314 duplicates some of the queries and table create statements, and I don't see an explanation on the change, kind of curious why we're going that way [15:00:16] is this part of the airflow migration? (like we will eventually remove the originals?) [15:01:35] !log deploying new wikis to sqoop list ahead of sqoop job starting in a few hours [15:01:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:19] milimetric: yes! we're just moving it in 3 steps, 1) duplicate the queries, 2) Migrate jobs, 3) remove old queries. [15:17:06] gotcha mforns. In that case, part of step 1 should probably be to update the old queries with something like "NOTE: these are in the process of being moved to Airflow, see {folder X} if you want to make any changes" [15:17:27] OK, makes sense [15:17:54] milimetric: do you want us to update those before you deploy? [15:22:36] mforns: nono, I already started, it's not urgent. I just think it'll get really confusing over the next few months as we move stuff slowly and half the jobs are in one situation, the other half in another, and a problem happens [15:26:55] milimetric: yes we'll do it [16:00:40] !log refinery done deploying and syncing, new sqoop list is up [16:00:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:13:52] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10Snwachukwu) a:03Snwachukwu [16:14:37] 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10Snwachukwu) a:03Snwachukwu [16:24:26] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) I'm significantly further forward I think, but I'm now at the point where I need guidance in terms of setting up: * ingres... [16:29:09] aqu1: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/23#note_4107 [16:30:28] (03PS43) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [16:34:20] (03CR) 10AGueyte: Basic ipinfo instrument setup (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [16:35:42] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10EChetty) [16:35:57] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10EChetty) [16:36:36] 10Data-Engineering: Analytics Platform Future State Planing - https://phabricator.wikimedia.org/T302728 (10EChetty) [16:37:37] ottomata: Thanks for the commands ! [16:55:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [16:55:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10Antoine_Quhen) a:05Ottomata→03Antoine_Quhen [17:05:05] aqu1: yeah sorry it is not more straightforward [17:05:15] i bet we could make the run_dev_instance.sh stuff support all that [17:05:31] but, it won't work if you try to change users after creating the dev instance [17:10:36] org.apache.spark.api.python.PythonUtils.isEncryptionEnabled does not exist in the JVM [17:10:37] D: [17:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [18:26:41] milimetric: I have read the code of the webrequest-checker and I think I understand what troubles you in its results [18:27:48] the results contain every checked hosts results except for the ones having a min(sequence-id) = 0 - this therefore includes hosts having no sequence-id misalignement [18:29:41] The results says "false_positive = TRUE" when there is no data loss - So the behavior is expected, even if it's cumbursome to read - We could change the check from false_positive to real_data_loss to make it more explicit (and change the value as well obviously :) [18:31:38] joal: makes sense, but couldn't we exclude the min(sequence-id) <> 0 results if there's nothing else wrong with them? [18:31:52] we know based on records previous hour and records next hour? [18:32:42] It's usually useful to have them in there, to compare in case there some data-loss - If it's prefered to remove them, we can do it :) [18:41:59] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10LGoto) [18:59:24] (03CR) 10Phuedx: [C: 04-1] Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:02:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [19:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [19:22:07] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Mayakp.wiki) >>! In T300164#7714313, @JAllemandou wrote: > I have started a job extracting daily pageviews... [19:26:50] Gone for tonight - See you tomorrow folks [19:28:13] (03CR) 10Ottomata: Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [21:01:40] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) Vms have been created and are all running opensearch 1.2.4! Side note: puppet needed to be run two times for some reason. As I understand puppet should a... [21:02:26] 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Set up opensearch cluster for datahub - https://phabricator.wikimedia.org/T301382 (10razzi) [22:14:35] (03PS3) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) [22:15:07] (03CR) 10jerkins-bot: [V: 04-1] Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will maintain any future android related fields here. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan) [22:25:33] (03CR) 10Sharvaniharan: "@Mikhail I have also renamed the folder to `wikipedia_android_app` from `android_app`, because of the conversation @Ottomata and I had off" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan) [22:31:20] (03CR) 10Ottomata: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan) [22:34:10] (03CR) 10Sharvaniharan: Add a required field in mobile_apps fragment Added a new required field in a new fragment that will be used only by the android app. We will (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/765671 (https://phabricator.wikimedia.org/T299239) (owner: 10Sharvaniharan) [22:39:12] (03PS2) 10Krinkle: build: Add brief "Getting started" guide [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714874 [22:39:15] (03PS5) 10Krinkle: build: Document simpler alternative contribution flow [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) [22:41:49] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Krinkle) >>! In T290074#7683286, @Ottomata wrote: > The change isn't hard, but the docum... [22:55:13] (03CR) 10Ottomata: build: Document simpler alternative contribution flow (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) (owner: 10Krinkle) [22:57:09] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Ah right! And, I guess this won't remove any previously added .git hooks in p... [23:23:56] 10Analytics, 10Metrics-Platform, 10Performance-Team, 10Product-Analytics, 10MW-1.35-notes (1.35.0-wmf.21; 2020-02-25): Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10nshahquinn-wmf) It would be helpful to close this task and open a separate one...