[00:18:40] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:10] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:14] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:03] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10RKemper) @MoritzMuehlenhoff Oops it appears we made the same mistake twice :P Can you do one more check for us? I think everything is all set n... [00:34:10] (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:10] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:10] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:10] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:15] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:59:15] (SystemdUnitFailed) firing: (8) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:10] (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:15] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:04:10] (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:21] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Spnq) I reviewed the data available to me and realised that I had previously misrea... [07:30:24] (03PS3) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) [07:30:56] (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:17:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10MoritzMuehlenhoff) >>! In T353392#9444003, @bking wrote: > Per today's pairing session with @RKemper , it looks like our current version of Elastic (7.10.2) do... [08:41:27] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) Do you mean we don't need the last patch? Which should be the right value... [08:48:15] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Patch-For-Review: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Sfaci) We are still working in a secondary bug related to the first one. You can catch up in the related ticket {T354074}. The new pat... [09:37:03] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Gehel) p:05Triage→03High [09:37:07] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Gehel) [09:37:10] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) p:05Triage→03High [09:37:17] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) [09:37:23] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10Gehel) p:05Triage→03Low [09:42:15] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:42:38] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 3 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Sfaci) a:03Sfaci [09:47:37] 10Data-Engineering, 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) [09:49:27] 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel) [09:50:29] 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) [09:50:46] 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) [09:51:11] 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05Triage→03Medium [09:51:15] 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05Medium→03High [09:51:24] 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) p:05Triage→03High [09:51:27] 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel) p:05Triage→03High [09:53:12] 10Data-Engineering, 10Data-Platform-SRE: Deploy new Spark version on production - https://phabricator.wikimedia.org/T354737 (10Gehel) p:05High→03Medium [09:53:14] 10Data-Platform-SRE: Create Conda Analytics environment for new Spark version - https://phabricator.wikimedia.org/T354733 (10Gehel) p:05High→03Medium [09:53:16] 10Data-Engineering, 10Data-Platform-SRE: Validating that current jobs are running correctly with the new Conda Analytics environment - https://phabricator.wikimedia.org/T354735 (10Gehel) p:05High→03Medium [10:04:10] (SystemdUnitFailed) firing: (9) wmf_auto_restart_prometheus-mysqld-exporter@s1.service Failed on dbstore1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene) [10:25:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene) [10:26:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene) [10:27:28] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene) [10:27:30] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [10:27:45] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene) [10:27:47] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [10:28:09] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene) [10:28:11] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [11:29:23] FYI, I am about to merge this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/989213/ which will set up an-master1003 as a hadoop-master-in-waiting for T332573 [11:29:24] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [11:30:11] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10gmodena) Thanks for clarifying @Antoine_Quhen . >>! In T351792#9446831, @Antoine_Quhen wrote: > 1/ Splitting the C... [11:34:13] (03PS4) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) [11:34:39] (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:39:05] !log decommission druid1004.eqiad.wmnet T354741 [11:39:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:39:09] T354741: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 [11:39:10] (SystemdUnitFailed) firing: (10) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:19] (SystemdUnitFailed) firing: (11) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:41] (03PS5) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) [11:45:11] (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:56:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1004.eqiad.wmnet` - druid1004.eqiad.wmnet (**PASS**)... [11:57:28] (03PS1) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) [11:58:06] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10BTullis) >>! In T351792#9449781, @gmodena wrote: > Thanks for clarifying @Antoine_Quhen . >> I have an alternative i... [11:58:44] (03Abandoned) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [12:05:51] !log decommission druid1005.eqiad.wmnet T354742 [12:05:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:05:54] T354742: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 [12:21:31] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1005.eqiad.wmnet` - druid1005.eqiad.wmnet (**PASS**)... [12:22:37] !log decommission druid1006.eqiad.wmnet T354743 [12:22:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:22:40] T354743: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 [12:39:03] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: `druid1006.eqiad.wmnet` - druid1006.eqiad.wmnet (**PASS**)... [12:40:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [12:47:30] !log roll restarting hadoop test workers to pick up new JRE [12:47:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:09:59] (PuppetFailure) firing: Puppet has failed on an-master1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:23:06] (KafkaReplicationFactorTooLow) firing: (860) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:28:43] (KafkaReplicationFactorTooLow) resolved: (860) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [13:39:11] (SystemdUnitFailed) firing: (11) hadoop-hdfs-namenode.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:08] ^ this can be ignored. I thought I had downtimed it, whilst an-master1003 is being brought into service. [13:42:15] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:06:48] 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) a:03Snwachukwu [14:42:26] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform, 10Patch-For-Review: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) Oh, and actually, we only need to count requests to mediawiki.org/beacon/event,... [14:49:10] (SystemdUnitFailed) firing: (11) hadoop-yarn-resourcemanager.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:59] (PuppetFailure) firing: (2) Puppet has failed on an-master1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:02:18] (03PS2) 10Snwachukwu: Migration of browser General table to iceberg format. 1. Add a iceberg create table statement hql file fof browser_general table 2. Add hql file to update browser_general iceberg table with values. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) [15:04:47] (03CR) 10Snwachukwu: Migration of browser General table to iceberg format. 1. Add a iceberg create table statement hql file fof browser_general table 2. Add hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [15:09:10] (SystemdUnitFailed) firing: (12) hadoop-mapreduce-historyserver.service Failed on an-master1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:39] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) This going quite well, so far. I have brought up an-master1003 in the `analytics_cluster::hadoop::master` role and bootstrapped the namenod... [15:13:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [15:13:30] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7c141a8-f6e4-4385-b205-cac59bd12d90) set by btullis@cumin1002 for 7 days, 0:00:00 o... [15:43:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) 05Open→03In progress [15:52:45] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10MoritzMuehlenhoff) Looks all good now, thanks [16:06:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) For reference, here's a screenshot of more kafka metrics around enabling compaction: {F41... [16:15:08] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Investigate Spark History Server silent errors when downloading some files from HDFS - https://phabricator.wikimedia.org/T354777 (10brouberol) [16:17:24] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [16:19:14] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10VRiley-WMF) [16:20:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:38:48] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) This was deployed today and it is looking good in prod. [16:38:54] (03PS1) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [16:42:23] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE: Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol) [16:42:36] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol) [16:47:32] (03PS2) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [17:07:33] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 07): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10WDoranWMF) [17:20:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) Thanks for the update! I'll set this to blocked. No rush or anything, but do reach out when the packages are ready or if we can do anything to help. [17:22:17] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [17:29:18] 10Data-Engineering (Sprint 7): [Refine System] Define a concept and an approach for refactoring the Refine system - https://phabricator.wikimedia.org/T354696 (10Ahoelzl) a:03JAllemandou [17:29:31] 10Data-Engineering (Sprint 7): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10Ahoelzl) a:03gmodena [17:30:02] 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10Ahoelzl) a:03Antoine_Quhen [17:30:31] 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept - https://phabricator.wikimedia.org/T354695 (10Ahoelzl) a:03Antoine_Quhen [17:46:59] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:06:41] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) [18:07:47] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook) 05Open→03Resolved [18:07:57] 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10rook) [18:09:28] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [18:17:31] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) 05In progress→03Resolved [18:37:53] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Pin docker images to an explicit tag instead of using latest - https://phabricator.wikimedia.org/T354785 (10brouberol) 05Open→03Resolved [18:37:58] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [18:39:44] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) [18:44:10] (03CR) 10Sergio Gimeno: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [18:49:59] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:52:48] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with... [18:58:23] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) [19:04:10] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:19] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) [19:25:07] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [21:47:16] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:03:20] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [22:49:59] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:05:19] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed