[00:18:30] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:41:18] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:12:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:24:06] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:58:24] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:55:02] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:51:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:14:21] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:48:35] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:40:32] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:23:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add metric_id column to Wikidata EntitySchema text HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [10:35:27] (03PS1) 10Aqu: Update ua-parser library [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/818083 (https://phabricator.wikimedia.org/T306829) [10:44:19] 10Data-Engineering-Kanban, 10Event-Platform, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10EChetty) [10:48:18] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:45:24] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:07:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:44] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:16:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:36] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:22:08] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:42:28] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:30:29] hi team! are both Andrew and Olja on vacation? we (SRE) got some access requests pending on their approval [13:39:34] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:13:01] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 02), 10Data Pipelines (Sprint 00): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10JArguello-WMF) [14:13:03] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 02), 10Data Pipelines (Sprint 00): Migrate the projectview jobs - https://phabricator.wikimedia.org/T305844 (10JArguello-WMF) [14:13:25] 10Data-Engineering, 10SRE, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10JArguello-WMF) [14:20:57] vgutierrez: olja_ should be around (see above about pending access requests, olja_) [14:21:57] AndyRussG: sorry our Jupyter folks are out on vacation, I'm not sure [14:22:17] milimetric: heyyyy now worries! not urgent heheh thx [14:22:20] also hiiiiiii! [14:22:22] mforns: I thought aqs1004 was still part of the cluster, no? https://github.com/wikimedia/puppet/blob/ddb45a3075c4d578cedd19c84d04694020b7c57d/modules/profile/manifests/aqs.pp#L67 [14:22:36] hi :) [14:22:47] heya milimetric :] welcome back [14:22:53] they're not away visiting Jupyter? [14:23:02] oooh, that'd be so cool and also confusing [14:23:06] is it not part of the old cluster? the cassandra1 cluster? [14:23:12] note to self: change names when we go inter-plantary [14:23:43] mforns: is that whole cluster useless at this point? Like are we 100% serving from the new machines? [14:24:35] I think we're serving from the new machines, but we're still loading data in the old cluster for some reason [14:25:01] I think we were about to shut down those machines, but maybe waiting for migration to spark3? [14:25:04] like for some kind of fallback probably [14:25:18] yes [14:25:37] hm, wonder if jo had any risks in mind, should've asked him before his break [14:25:47] yea, we should [14:26:03] maybe we can find some context in a phab ticket [14:26:11] let me see [14:33:10] 10Analytics, 10Voice & Tone: Rename geoeditors_blacklist_country - https://phabricator.wikimedia.org/T259804 (10Isaac) @JArguello-WMF thanks for the clarification (and no worries -- I've mis-handled many a Phabricator ticket)! > For this specific task, T259804, What would you say is the size? My guess is some... [14:34:41] milimetric: the only tasks related to the cassandra migration I could find are: https://phabricator.wikimedia.org/T305102 and https://phabricator.wikimedia.org/T297944 but I could not find any reference about why we are still ingesting to the old nodes... [14:36:36] olja_: what would be the best way to move forward with these requests? I've suscribed you and Andrew to those requests and commented, but let me know if some other workflow would work better for you [15:15:35] 10Data-Engineering-Kanban, 10Event-Platform, 10Metrics-Platform, 10Wikidata, and 6 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10phuedx) [15:15:37] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:12:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:23:59] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:34:40] 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10xcollazo) 05Declined→03Open Re-opening this ticket. As part of the work I am doing to fix T311976, I now need to be able to sudo into the `analytics` user. @mforns informs be that this is part of the `analytic... [16:55:43] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:16:27] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:28:01] 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10WDoranWMF) Not sure if it's required but I approve as a manager. [17:50:25] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:01:51] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:36:07] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:16:10] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10xcollazo) [19:19:06] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10EChetty) [19:21:51] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:56:07] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:05:46] !log killing Oozie projectview-hourly and projectview-geo jobs to deploy corresponding jobs on airflow. [20:05:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:17:00] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:15:59] 10Data-Engineering: requesting Kerbos password for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10odimitrijevic) Request is approved. [21:16:22] 10Data-Engineering: requesting Kerbos password for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10odimitrijevic) a:03RKemper [21:26:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:14] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:46:08] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:00:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:46] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:30] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:09:00] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:09:01] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz) [22:54:42] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:40:26] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:57:25] 10Data-Engineering, 10Patch-For-Review: requesting Kerbos password for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10RKemper) #### Following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user: Found current server: `1150:kerberos_kadmin_ser...