[00:25:31] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:15] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:57] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:37:40] 10Data-Engineering, 10Data-Services, 10Privacy Engineering, 10cloud-services-team (Kanban): Increased visibility in wiki-replicas for volunteers fighting vandals - https://phabricator.wikimedia.org/T284944 (10aokomoriuta) > If people are looking to harass anti-vandalism editors, I'm pretty sure they're not... [02:04:53] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:50:13] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:01:33] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:35:35] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:46:53] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:02:26] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:23:44] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:57:46] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:09:08] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:41:03] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:51:51] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:37:18] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform] enrichment module should not depend on flink-scala - https://phabricator.wikimedia.org/T310680 (10lbowmaker) 05Open→03Resolved [09:49:27] 10Data-Engineering-Kanban, 10Shared-Data-Infrastructure: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis) I notice that we don't have any IPv6 ranges allocated for this yet, nor a specific ASN. I'm planning to create them in netbox. * `2620:0:861:302::/64` - DSE... [09:55:06] 10Data-Engineering-Kanban, 10Shared-Data-Infrastructure: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis) We already have the IPv4 ranges defined, but we might want to update the descriptions for clarity, now that we're sharing responsibility for this cluster betw... [10:21:04] btullis: I've merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/115 which was approved last week. I think it's part of the fix for too much airflow logging, so it would be good to deploy. Adding to the etherpad [10:39:54] milimetric: Thanks. I'll deploy it later today. 👍 [10:40:02] thx! [10:52:55] 10Analytics-Radar, 10Recommendation-API, 10SRE: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [11:13:49] btullis: I checked the squash commit box! I'm so sorry... is it too late to rewrite history and force push? [11:15:09] oh... I only see two commits in the log, the squashed one and the merge commit [11:16:50] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [11:17:49] Oh yes, I see. The `main` branch has the squashed commit. Let's not bother doing anything here. It's just nice if ever you go back to an MR to see a clean list of commits, with what each was trying to achieve. I find it also helps with the review process. [11:18:58] ^^^ Uh oh, I'd better investigate this RPC queue length. I wonder if there is a big job running. [11:21:50] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [11:28:34] 10Data-Engineering-Kanban, 10Shared-Data-Infrastructure: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis) I have now added these records to Netbox {F35458908,width=70%} https://netbox.wikimedia.org/ipam/asns/55/ {F35459272,width=70%} https://netbox.wikimedia.org/... [11:52:14] milimetric: qq I've updated the wikireplicas with this maintain-views change: https://phabricator.wikimedia.org/T313281 [11:52:35] Hi team! I'm back onto the connected world :) [11:53:00] I don't think I need to do anything about updating the sqoop job or an allowlist for when we import this, do I? [11:53:03] welcome back joal ! [11:53:13] Checking jo [11:53:13] joal: Hi, welcome back! [11:54:15] You're good btullis, sqoop list is manual but I'd rather automate updating that [11:54:22] I'm onto unpacking emails (will take some time :S) - Is there anything you'd like me to help with now? [11:54:36] and pageview allowlist will alert us if something new shows up [11:55:18] milimetric: Thanks for checking. I can resolve that ticket then, I think. [11:56:51] joal: I can't think of anything critical right now, thanks. My mind's gone blank, but I'm sure that there are many things that will occur to me :-) [11:57:28] All good btullis - nothing critical feels great :) [12:27:42] 10Quarry, 10Patch-Needs-Improvement: Add rate limiting on queries execution - https://phabricator.wikimedia.org/T225869 (10Aklapper) [13:02:23] 10Data-Engineering, 10Epic, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 (10gmodena) [13:02:40] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 (10gmodena) [13:03:05] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform][SPIKE] investigate Flink metric reporters and prometheus integration - https://phabricator.wikimedia.org/T310805 (10gmodena) Resolving. Follow up work has been identified in https://phabricator.wikimedia.org/T311070 [13:03:19] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform][SPIKE] investigate Flink metric reporters and prometheus integration - https://phabricator.wikimedia.org/T310805 (10gmodena) 05Open→03Resolved [14:01:06] joal: helllOoOOoOoOo o/ hope you had a nice time off! [14:01:35] Hi ottomata :) Wonderfull indeed, thank you for asking :) [14:04:34] ottomata: I've seen your invite for tomorrow's meeting on mediawiki page events refactoring - I won't be able to attend :S I'm sorry for that [14:27:06] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [14:45:04] 10Data-Engineering, 10Data Pipelines: [Airflow] Add log rotation to scheduler logs - https://phabricator.wikimedia.org/T315326 (10mforns) [15:03:24] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10EChetty) [15:06:14] 10Data-Engineering, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10EChetty) [15:08:15] 10Data-Engineering: Update Search Engine list - https://phabricator.wikimedia.org/T315329 (10Isaac) [15:09:11] 10Data-Engineering: Update Search Engine list - https://phabricator.wikimedia.org/T315329 (10Isaac) I'm happy to take a pass on updating these regexes but would want someone from DE to validate / deploy [15:40:42] ottomata: You'll be working with Xabriel on the Geoeditors jobs deploy, right? Is there anything I have to do for the train on this? [15:40:58] 10Data-Engineering, 10Foundational Technology Requests, 10SRE: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) I wanted to get a very stupid simple example of using Flink to sample webrequest in Kafka. Here's an example using purely strea... [15:48:25] btullis: I could! im' going to work with him today on the platform eng airflow instance [15:48:38] milimetric: was mostly helping shepard that out, but i was there trying too. [15:48:53] lemme check with him [15:55:26] I'm about to do the airflow-dags deploy right now - as per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#analytics [16:02:23] !log deploying airflow-dags [16:02:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:14] mforns: SandraEbele: Are you able to check that the airflow-dags change to the SLA level has worked please? [16:03:26] btullis: lookin! [16:17:33] btullis: I think all DAGs except for one (wikidata_item_page_link_weekly_dag) aren't seeing any SLA erros any more. [16:17:44] I wonder what happened with that one, lookin [16:18:54] Oh! I see that it has not been updated to specify the SLA at the DAG level, it's still giving it at the task level. [16:19:00] will create a fix [16:19:20] btullis: do you have the task handy? [16:32:51] is there a convenient way to force a kafka consumer group to have a particular offset without custom code in the thing itself? Essentially i'd like to turn off the consumer, run some command to set the offset, then turn the consumer back on and have it resume from the set position [16:33:33] i suppose i can probably write up some python script, but thought there might already be a thing [20:43:08] PROBLEM - Check systemd state on an-airflow1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:58] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Aklapper) [21:00:58] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10Ottomata) [21:41:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10tchin) - Are we backfilling both the page state change stream and/or the one with content? - Do we want both the f...