[00:03:05] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:35:57] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:20:03] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:54:05] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:21:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:24:47] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:10:07] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:56:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:06:55] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:52:19] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:26:21] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:37:35] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:11:25] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:22:47] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:26:02] Good morning aqu - I created https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/155 for your review please :) [07:28:28] Hello joal , thx [07:28:56] np aqu - we'll still have to fix the 'empty parameter' issue [07:56:50] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:08:11] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:14:01] joal currently working on your branch to fix unit tests + lints [08:14:20] aqu: ok thanks - I didn't check - please excuse me [08:14:38] aqu: sending a PR for the apis-graphite issue in minutes [08:19:48] hm actually aqu, those linting errors are not related to my code - weird [08:20:30] Yes, it's on the main branch. [08:22:36] hm [08:23:28] have there been a change on linter version? this is weird that it starts failing now [08:24:30] I don't think so, but the concerned files were recently changed. [08:24:35] Ah [08:24:54] and merged without waiting for checks I guess :( [08:26:02] I'll tell Marcel and Sandra to not push to master directly [08:26:47] This is not good practive [08:26:59] We should even configure gitlab to prevent that [08:27:08] 10Data-Engineering-Operations: RAID battery alert in an-worker1085 - https://phabricator.wikimedia.org/T318659 (10BTullis) [08:32:20] > We should even configure gitlab to prevent that [08:32:20] Morning. Yes, I agree. At the moment it looks like 'Maintainers' and above can push to the main branch, but not force push. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/settings/repository [08:32:29] Maybe we should change it to 'Owners' [08:32:44] hm actually I think it should be prevented from now on [08:32:54] an change to the main branch should go through PR [08:33:44] joal: OK, so apply this change, right? [08:33:47] https://usercontent.irccloud-cdn.com/file/CYyGt1DG/image.png [08:33:48] Cause for instance almost all of analytics is "owner" of the project [08:34:30] +1 btullis [08:34:32] thanks for that [08:35:04] Done. [08:36:38] joal: Thanks also for the update about the Cassandra 2 loading. I'll find the ticket about decommissioning aqs100[4-9] and put it on the board. I'll write out the decom plan. [08:36:46] aqu: I have issues running unit-test on my machine [08:36:58] Awesome btullis :) thank you for that! [08:38:06] aqu: I have a lot (29) tests failing, all due to 'No module named ...' with different modules (analytics, wmf_airflow_common.sensors ...) [08:38:10] aqu: any idea? [08:39:36] joal: `export PYTHONPATH=.` ? [08:39:49] AHHHHHH! thank you - I forgot about that! [08:40:14] * joal is going to update the docs right now [08:40:26] I'm using `autoenv` to forget about those kind of things [08:41:00] I could commit a `.env.example` [08:41:12] Could be nice [08:44:04] also aqu, I think it would have been nicer to make a different PR for the artifact addition instead of adding it as a patch to something different [08:44:44] aqu: nit details, but scope-creep-PR are not great :) [08:45:37] aqu: nonetheless, thank you a lot for the fix! [08:45:41] Actually I was thinking to cherry-pick to main directly before merging your branch. But then you both talked about not pushing to main directly... wrong timing :) [08:46:08] ehehe :) [08:46:33] PRs, even for small stuff, and as small patchsets as possible :) [08:47:11] * joal is becoming a git extremist [08:51:15] aqu: still a nit on types [08:51:27] I also put myself in the git hygiene extremist camp. Sorry :-0 [08:51:30] I also put myself in the git hygiene extremist camp. Sorry :-) [09:24:07] RECOVERY - MegaRAID on an-worker1146 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:17:43] 10Data-Engineering: Stop ingesting data to the old AQS cluster - https://phabricator.wikimedia.org/T302276 (10BTullis) With the completion of {T306962} this step is now complete. This unblocks the decommissioning of aqs100[4-9]. [10:18:00] 10Data-Engineering: Stop ingesting data to the old AQS cluster - https://phabricator.wikimedia.org/T302276 (10BTullis) 05Open→03Resolved [10:18:02] 10Data-Engineering, 10Cassandra, 10Epic, 10Platform Team Workboards (Platform Engineering Reliability): Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) [10:53:37] joal: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/157 [10:53:59] reading! [10:56:46] aqu: one nit about a comment, then I'll merge [11:11:14] Done [11:43:14] Merging aqu - sorry I missed the ping [11:43:21] actually - merged [13:01:12] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10BPirkle) @FGoodwin , for the final test, "character encoding in article titles is happily handled (should handle per article queries wit... [13:01:16] 10Data-Engineering: Failed to find any Kerberos tgt - https://phabricator.wikimedia.org/T318063 (10bmansurov) [13:21:22] (03CR) 10Mforns: [C: 03+1] "LGTM! Neil, please feel free to read the session_days field! See inline comment." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262) (owner: 10Neil P. Quinn-WMF) [13:25:51] 10Data-Engineering: Failed to find any Kerberos tgt - https://phabricator.wikimedia.org/T318063 (10Ottomata) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Run_a_command_as_a_system_user? [13:30:02] (03PS1) 10Joal: Fix mediawiki-history-denormalize for spark 3 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/835618 (https://phabricator.wikimedia.org/T318589) [13:31:04] mforns: if you're happy, I think all three reviews I have created this morning are ready to be merged [13:31:34] joal: I'm going over a looooot of alerts, can you refresh my memory? :] [13:33:53] mforns: those 3 PRs areabout fixing all alerts [13:34:04] oh [13:35:30] 10Analytics, 10API Platform, 10Code-Health-Objective: Synchronize .gitignore files - https://phabricator.wikimedia.org/T315113 (10VirginiaPoundstone) a:03codebug [13:40:51] joal, merged 2 of them, for the other one left a minor comment [13:40:58] ack - revieing [13:46:43] update sent mforns - thanks for the reviews :) [13:46:49] lookin [14:05:27] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10VirginiaPoundstone) [14:08:01] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Obtain a security review of AQS 2.0 - https://phabricator.wikimedia.org/T288663 (10VirginiaPoundstone) [14:08:35] 10Analytics, 10API Platform (Product Roadmap), 10Code-Health-Objective, 10Epic, and 3 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10BPirkle) [14:08:49] 10Data-Engineering, 10API Platform, 10Code-Health-Objective, 10Platform Engineering Roadmap, 10User-Eevans: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10VirginiaPoundstone) [14:09:01] 10Data-Engineering, 10API Platform, 10Code-Health-Objective, 10Epic, and 3 others: Problem details for HTTP APIs (rfc7807) - https://phabricator.wikimedia.org/T302536 (10BPirkle) 05Open→03Resolved a:03BPirkle Done [14:23:03] !log deployed Airflow for 3 fixes [14:23:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:23:11] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Optimization of conda-analytics deb package - https://phabricator.wikimedia.org/T318397 (10EChetty) [14:23:35] !log rolled back Airflow [14:23:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:24:26] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Patch-For-Review, 10User-Elukey: Port architecture of irc-recentchanges to Kafka - https://phabricator.wikimedia.org/T234234 (10EChetty) [14:24:55] 10Data-Engineering-Operations, 10Data Engineering Planning, 10Mail, 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10EChetty) p:05Medium→03High [14:32:55] hey mforns - what happened? [14:33:17] joal: There's a wrong artifact coordinates [14:33:20] I think [14:33:24] I'm looking into it [14:34:50] thanks a lot [14:38:04] joal: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/158 [14:40:06] mforns: merging! [14:40:13] thankssss!! [14:56:31] !log deployed Airflow (fixed) [14:56:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:43] !log re-ran apis_metrics_to_graphite_hourly [14:59:43] failed tasks [14:59:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:03:00] !log re-ran cassandra_daily_load failed airflow tasks [15:03:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:36] !log re-ran wikidata_metrics_to_graphite_daily failed airflow tasks [15:05:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:06:28] thx mforns [15:30:36] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform, 10Patch-For-Review: Incorporate librarized Metrics Platform PHP client into EventLogging - https://phabricator.wikimedia.org/T281762 (10EChetty) [15:32:02] 10Data-Engineering, 10Metrics-Platform, 10Patch-For-Review: Incorporate librarized Metrics Platform PHP client into EventLogging - https://phabricator.wikimedia.org/T281762 (10EChetty) [16:25:44] (03CR) 10Xcollazo: Fix mediawiki-history-denormalize for spark 3 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/835618 (https://phabricator.wikimedia.org/T318589) (owner: 10Joal) [16:30:33] (03CR) 10Joal: Fix mediawiki-history-denormalize for spark 3 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/835618 (https://phabricator.wikimedia.org/T318589) (owner: 10Joal) [16:30:46] xcollazo: I added an answer for you there --^ :) [17:14:12] 10Data-Engineering, 10Product-Analytics (Kanban): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10mpopov) [17:14:31] 10Data-Engineering, 10Product-Analytics (Kanban): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10mpopov) a:03Mayakp.wiki [17:23:45] 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team (Blocking 🧱): Requesting membership of the analytics group in gerrit for 'snwachukwu' and 'nokafor' - https://phabricator.wikimedia.org/T314592 (10xcollazo) Do we need anything else to move this forward @thcipriani ? This is b... [17:24:53] 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team (Blocking 🧱): Requesting membership of the analytics group in gerrit for 'snwachukwu', 'nokafor', and 'xcollazo' - https://phabricator.wikimedia.org/T314592 (10xcollazo) [18:10:37] 10Data-Engineering, 10Product-Analytics (Kanban): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10Mayakp.wiki) [18:18:59] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T318741 (10rook) [18:19:11] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T318741 (10rook) 05Open→03In progress a:03rook [18:28:33] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T318741 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/12 [18:30:57] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10BPirkle) [18:32:09] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10BPirkle) [18:32:33] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10FGoodwin) [18:32:35] 10Data-Engineering, 10Product-Analytics (Kanban): Superset Date Filter fix needed - https://phabricator.wikimedia.org/T318299 (10Mayakp.wiki) [18:34:14] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10FGoodwin) [18:41:24] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T318741 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/12 [18:41:48] 10Quarry: Comment on phabricator task on github update - https://phabricator.wikimedia.org/T318741 (10rook) 05In progress→03Resolved [19:26:43] 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team (Blocking 🧱): Requesting membership of the analytics group in gerrit for 'snwachukwu', 'nokafor', and 'xcollazo' - https://phabricator.wikimedia.org/T314592 (10thcipriani) 05Open→03Resolved a:03thcipriani @xcollazo sorry... [19:46:37] 10Data-Engineering-Radar, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team (Blocking 🧱): Requesting membership of the analytics group in gerrit for 'snwachukwu', 'nokafor', and 'xcollazo' - https://phabricator.wikimedia.org/T314592 (10xcollazo) >This should be done now. Confirmed I see I can now do +2... [20:28:51] (03CR) 10Xcollazo: "Went ahead with a full review, mostly because I want to pickup your changes. 😊" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/835618 (https://phabricator.wikimedia.org/T318589) (owner: 10Joal) [22:50:14] 10Data-Engineering, 10Research, 10Epic: Add more languages to Wikipedia Clickstream - https://phabricator.wikimedia.org/T289532 (10Isaac) Adding some details about the impact of extending the language list to also include all of the languages listed under **Desired state** in the task description (languages...