[01:48:26] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:22:18] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:35:08] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:09:10] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:54:30] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:26:09] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:30:30] 10Data-Engineering: Check home/HDFS leftovers of dpifke - https://phabricator.wikimedia.org/T315841 (10MoritzMuehlenhoff) [08:49:09] 10Data-Engineering: RAID battery alert in an-worker1090 - https://phabricator.wikimedia.org/T315850 (10BTullis) [08:50:32] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Acknowledged: T315850 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:52:39] btullis: https://phabricator.wikimedia.org/T315748 [08:53:36] RhinosF1: Ah, sorry. I didn't spot that. I will merge the two tickets. :-) [08:54:28] 10Data-Engineering-Operations: an-worker1090 MegaRaid issues - https://phabricator.wikimedia.org/T315748 (10BTullis) [08:55:58] btullis: np [09:03:14] btullis: i'm 99% sure that every single host in that batch will go at some point [09:03:21] unless they get decom'd first [09:12:45] RhinosF1: Yes, I will speak to DC Ops about options. We have three concurrent failures at the moment, but they're all out out of warranty and they're not yet up for refresh. The actual impact of failure is pretty low, apart from the alert noise, so it's difficult to prioritize this against the bigger ticket items. [09:13:45] btullis: yeah, i'm quite surprised tbh they've all failed so consistent though. I don't remember any others being so annoying. [09:13:59] I can try to to an assessment of how many are likely to fail in the next few months out of the whole set of 86 (or so) hadoop workers. [09:18:32] btullis: 18 in that batch [09:18:41] https://phabricator.wikimedia.org/T207192 [09:18:46] so probably another 13/14 servers [09:23:46] Thanks. I was thinking of doing a megacli command through cumin across the hadoop worker fleet as well, so catch for low charge percentages. [09:23:53] 10Data-Engineering: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10phuedx) [09:26:24] 10Data-Engineering: Drop event.flowreplies table - https://phabricator.wikimedia.org/T315857 (10phuedx) [09:29:17] 10Data-Engineering: Drop event.flowreplies table - https://phabricator.wikimedia.org/T315857 (10phuedx) [09:31:30] Hi! Could someone un-WIP https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/821711 ? I think you need to be a repo owner to do it. [09:33:28] 10Data-Engineering: Drop event.flowreplies table - https://phabricator.wikimedia.org/T315857 (10phuedx) [09:33:34] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [09:43:22] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:50:24] btullis: cool [09:52:18] (03CR) 10Kosta Harlan: [C: 03+2] Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom) [09:53:20] (03CR) 10CI reject: [V: 04-1] Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom) [10:15:07] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Protsack.stephan) Only question I would ask is //rendered content// is different enough so we need to p... [10:17:54] (03PS2) 10Kosta Harlan: Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom) [10:18:27] (03CR) 10Kosta Harlan: "I bumped the version for this and generated the updated files." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom) [10:30:37] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10JayCano) Hi @cmooney. I can confirm that Tšepo requires this level of access for some work that we are going to do. Thank you. [11:38:25] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) a:05cmooney→03Ladsgroup Taking over as I'm on clinic duty this week. This also needs approval from @Ottomata or @odimitrijevi... [11:45:20] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) [11:49:37] (03CR) 10Vivian Rook: [C: 03+2] mypy: configure via tox.ini and make stricter [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/824438 (owner: 10RhinosF1) [11:54:05] (03Merged) 10jenkins-bot: mypy: configure via tox.ini and make stricter [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/824438 (owner: 10RhinosF1) [12:59:27] 10Data-Engineering, 10Event-Platform Value Stream: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674 (10Ottomata) > Would this change impact existing pipelines or code? Not that I am aware of! :) [12:59:57] 10Data-Engineering: Anomaly detection alarms for the edit event stream - https://phabricator.wikimedia.org/T250845 (10mforns) a:05mforns→03None [13:00:47] hello joal! :] would you have time later today to look at the queries? Maybe in 30 mins? [13:15:28] Hey, we are trying to run a sql query in superset to get all geoline requests for one day. But the query does not come to a result. The time just stops at one point. The same query for the data of one hour is fine. Do you have any hints or tricks we could use? [13:16:42] Hi mforns - I'll have time yes - meeting in 15minutes should be fine :) [13:16:46] lilients: Could you share the query that you're using please? [13:18:11] Sure: https://phabricator.wikimedia.org/P32739 [13:18:30] Ah! thanks btullis - you've been faster at asking :) [13:19:20] Hi lilients - The problem you're seeing is 'kinda' expected [13:19:46] 10Analytics-Wikistats, 10Data-Engineering: Country pageview breakdown by language - https://phabricator.wikimedia.org/T250001 (10Aklapper) a:05ashgrigas→03None Removing inactive assignee as their email address bounces [13:19:58] lilients: You're querying one of our biggest dataset (wmf.webrequest), and superset is not (yet) sized to query it for more than an hour [13:20:32] Ah, okay, thanks that is helpful! [13:20:41] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ottomata) > running monthly data dump script for similarusers It isn't clear that analytics-privatedata-users is the right group for this.... [13:20:56] Would there be a better place to run this? [13:22:21] lilients: just to give you an ieda of the sizes: one hour of webrequest upload is ~20Gb, and a day ~450Gb [13:24:48] lilients: You could get those results using Spark - Are you familiar using our hadoop cluster? [13:29:22] Okay, thanks! We will look into that. [13:32:57] lilients: There are some links on wikitech - IMO best would be for you to use jupyter and pyspark: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter [13:33:54] 👍 joal, I'm in da cave :] [13:33:59] mforns: joining! [13:35:40] 10Data-Engineering-Operations, 10Data Engineering Planning, 10Infrastructure-Foundations, 10Mail, 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ottomata) [13:39:41] 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Ottomata) a:05Ottomata→03None [13:40:20] 10Analytics-Jupyter, 10Data-Engineering, 10Infrastructure-Foundations, 10CAS-SSO, 10User-MoritzMuehlenhoff: Allow login to JupyterHub via CAS - https://phabricator.wikimedia.org/T260386 (10Ottomata) a:05Ottomata→03None [14:22:51] 10Data-Engineering, 10Equity-Landscape: Load language data - https://phabricator.wikimedia.org/T315886 (10ntsako) [14:23:14] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:23:40] ^^ I will look into this now. [14:24:04] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:08] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:20] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:50] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:51] It looks like yarn received a lot of out-of-order chunks and then `Halting due to Out Of Memory Error...` [14:27:54] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:28:20] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:28:50] Maybe this was a really big job that has taken doesn several yarn workers. [14:29:28] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:30:24] btullis: wow - we're running big jobs with mforns [14:30:40] btullis: we're gonna stop for a minute and see if it changes something [14:31:16] OK, cool. I'm not sure that the yarn workers will restart by themselves until a puppet run perhaps. Should I restart them on the affected nodes above? [14:31:40] btullis: there is no rush, it can wait a few minutes [14:31:44] thanks a lot btullis [14:32:36] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:04] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:33:53] https://usercontent.irccloud-cdn.com/file/waG1HErK/image.png [14:34:10] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:56] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:35:07] OK, so it'll be interesting to see how long these take to come back on their own. They should all be within 30 minutes, but probably randomly spaced through that period. [14:35:12] I think. [14:35:46] RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:16] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:52] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:04] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:34] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:04] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:51:22] RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:52] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:52:45] joal: jfyi, we were able to run the query in jupyter, thanks for the tip! [14:52:52] \o/ [14:52:59] thanks for letting me know awight :) [16:02:53] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Thanks @Protsack.stephan. IIUC then, your preference is for the separate `rendered_content_s... [16:21:29] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Milimetric) My opinions on this topic stem from my experience with Graphoid and the problems it ran int... [16:25:42] !log btullis@an-airflow1004:~$ sudo systemctl reset-failed ifup@ens13.service [16:25:43] RECOVERY - Check systemd state on an-airflow1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:47:22] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) > This would be filled in asynchronously AKA wikifunctions AKA [[ https://docs.google.com/do... [18:07:08] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10Ottomata) These 2 articles provide nice summaries of some event driven patterns: - https://medium.com/wix-engineering/6-event-driven-architecture-patterns-part-1-9... [18:15:08] 10Data-Engineering, 10Event-Platform Value Stream: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674 (10Tgr) YAML is a bit of a trainwreck of a format, with significant differences between its various revisions, significant differences between different... [18:28:04] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) 05Open→03In progress [18:28:33] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) Working on this one as part of T315633. [18:29:04] (03CR) 10Milimetric: [C: 03+2] Added ArrayAvgUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/824683 (owner: 10Nmaphophe) [18:29:10] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) [18:29:24] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) a:03xcollazo [18:29:55] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) MR available for review: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/118 [18:48:16] Hey mforns - any news on our uniques issue? [18:48:34] heya joal! well, I'm trying new stuff [18:48:38] wanna pair? [18:48:45] mforns: I'm about to disconnect for tonight, and was willing to check with you before [18:48:53] mforns: show me :) [18:48:59] ok [20:17:03] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) `scap deploy` successfully to `platform_eng`, `analytics_test` and `analytics` instances. The `research` instance is tracking a different branch than... [20:17:20] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) 05In progress→03Resolved [20:52:51] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin) Reviving this discussion, though I renamed the phab to "Running docker containers in a non-production environment", as the issue boils down to t... [20:53:16] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin) [21:01:14] (03PS9) 10NOkafor: Updated usage for files Cassandra Loading HQL files [Draft] Bug: T311507 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) [21:29:32] (03CR) 10Gergő Tisza: [C: 03+2] Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom) [21:30:18] (03Merged) 10jenkins-bot: Update to reflect multiple possible rejection reasons [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/821801 (https://phabricator.wikimedia.org/T314899) (owner: 10Nettrom)