[00:10:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) >>! In T355617#9482787, @cmooney wrote: >>>! In T355617#9482761, @bking wrote: >> @cmooney Unfortunately, cloudelastic is a stateful, clustere... [01:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:20] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:50] (DiskSpace) firing: Disk space stat1005:9100:/ 2.552% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:27:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [01:42:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [01:48:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [02:08:16] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [02:13:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [02:18:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [05:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:50] (DiskSpace) firing: Disk space stat1005:9100:/ 2.549% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:46:02] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/590 Add iceberg version of interlanguage_d... [08:34:54] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10taavi) This is a bit more complicated unfortunately. The node FQDNs (cloudelasticXXXX) will indeed move to eqiad.wmnet and need to use CFSSL... [08:56:05] (03PS4) 10Aqu: Adopt a more resilient approach to use webrequest x-analytics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) [09:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.57% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:33:54] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of daniram - https://phabricator.wikimedia.org/T355108 (10Gehel) [09:35:01] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of daniram - https://phabricator.wikimedia.org/T355108 (10Gehel) p:05Triage→03Low [09:44:34] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel) [10:06:53] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) The deeper reason behind most of this mess is the probably the uniqueness of the `test` release. There is no other environment whe... [10:13:32] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of nickifeajika - https://phabricator.wikimedia.org/T354241 (10Gehel) [10:16:21] 10Data-Engineering, 10Data-Platform-SRE: Check home/HDFS leftovers of nickifeajika - https://phabricator.wikimedia.org/T354241 (10Gehel) p:05Triage→03Low [10:17:32] 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE: Investigate crypto KDC deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544 (10Gehel) [10:19:07] 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802 (10Gehel) [10:19:49] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) Thanks for the kind words @xcollazo! [10:26:23] (03Abandoned) 10Aqu: [WIP] Rewrite x-analytics header hash parsing [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991805 (owner: 10Aqu) [10:51:23] (03CR) 10Joal: "If I don't mistake we should remove the _iceberg computation files as well as updating those" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [10:53:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10taavi) Trying to change the node FQDN without a re-image seems more trouble than it's worth if that can be avoided relatively easily (which it seems... [12:18:34] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) >>! In T355685#9484091, @akosiaris wrote: > My high level suggestion would be to re-evaluate if the `test` helm relea... [12:47:36] (03CR) 10Joal: "This comment is completely wrong - I messed between druid-loading dags and unique-devices preparation jobs - I'm very sorry." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [13:01:45] (03PS3) 10Aqu: Unique devices druid ingestion job - Iceberg migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) [13:12:20] (03CR) 10Aqu: [V: 03+2 C: 03+2] "Review done and the code has been tested in isolation on production. Ready for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [13:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.65% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:25:59] (03Merged) 10jenkins-bot: Adopt a more resilient approach to use webrequest x-analytics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992475 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [13:28:25] (03PS1) 10Aqu: Update Changelog for v0.2.29 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992694 [13:29:00] (03CR) 10Aqu: [V: 03+2 C: 03+2] Update Changelog for v0.2.29 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992694 (owner: 10Aqu) [13:29:40] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10gmodena) An update on progress so far. Anomaly detection on traffic, UA, and mobile OS uses an entropy based method to estimate irregularities... [13:38:17] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [13:40:02] (03Merged) 10jenkins-bot: Update Changelog for v0.2.29 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992694 (owner: 10Aqu) [13:41:17] (03CR) 10Aqu: [V: 03+2 C: 03+2] Adopt a more resilient approach to use webrequest x-analytics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992477 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [13:41:47] (03CR) 10Joal: "Still one change needed" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [13:44:33] (03PS4) 10Aqu: Unique devices druid ingestion job - Iceberg migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) [13:44:58] (03CR) 10Aqu: Unique devices druid ingestion job - Iceberg migration (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [13:45:39] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [13:46:05] Starting build #94 for job analytics-refinery-update-jars-docker [13:46:14] Project analytics-refinery-update-jars-docker build #94: 04FAILURE in 9.3 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/94/ [13:47:19] Starting build #95 for job analytics-refinery-update-jars-docker [13:47:36] Project analytics-refinery-update-jars-docker build #95: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/95/ [13:49:18] Starting build #134 for job analytics-refinery-maven-release-docker [14:04:21] 10Data-Platform-SRE: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10bking) [14:08:55] Project analytics-refinery-maven-release-docker build #134: 09SUCCESS in 19 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/134/ [14:11:18] Starting build #96 for job analytics-refinery-update-jars-docker [14:11:43] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.29 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992650 [14:11:43] Yippee, build fixed! [14:11:43] Project analytics-refinery-update-jars-docker build #96: 09FIXED in 25 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/96/ [14:12:49] (03CR) 10Joal: "Some minimal changes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [14:13:51] (03CR) 10Aqu: [C: 03+2] Add refinery-source jars for v0.2.29 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992650 (owner: 10Maven-release-user) [14:13:54] (03CR) 10Aqu: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.29 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992650 (owner: 10Maven-release-user) [14:16:23] (03PS5) 10Aqu: Unique devices druid ingestion job - Iceberg migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) [14:18:01] (03CR) 10Aqu: [V: 03+2 C: 03+2] Unique devices druid ingestion job - Iceberg migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983673 (https://phabricator.wikimedia.org/T347879) (owner: 10Aqu) [14:20:00] (03CR) 10Aqu: [C: 03+2] Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [14:20:05] (03CR) 10Aqu: [V: 03+2 C: 03+2] Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [14:21:29] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/582 Feed Druid unique de... [14:22:27] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/569 Add iceberg version of aqs_hourly dag [14:29:48] 10Data-Engineering (Sprint 7), 10Patch-For-Review: Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/591 Fix Refine webrequest to include bugfix... [14:31:16] !log Refinery weekly deployment train - begin [14:31:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:54:39] (03CR) 10Aqu: [C: 03+2] Migration of browser General table to iceberg format. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [14:54:41] (03CR) 10Aqu: [V: 03+2 C: 03+2] Migration of browser General table to iceberg format. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [15:19:44] 10Data-Engineering (Sprint 7), 10Patch-For-Review: Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/591 Fix Refine webrequest to include bugfix... [15:21:43] !log Refinery weekly deployment train - end (scap, then deployed onto hdfs) (test cluster deploy still broken T354703) [15:21:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:21:46] T354703: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 [15:29:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:33:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10bking) [15:38:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [15:40:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10bking) 05Open→03In progress a:03bking [15:44:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10bking) Per 1x1 with @dcausse , the healthcheck's metric [[ https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/890ea5ff1acc9bac57aa8bf08b9008a1e8eb... [15:45:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10bking) [16:06:45] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10fgiunchedi) Thank you for the quick update, from a very quick look I concur that switching to logstash-based metrics/alerts is the right thing to do here; I belie... [16:11:17] (03PS1) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [16:15:56] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) [16:17:09] (03PS2) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [16:41:53] (03PS18) 10Btullis: Update to Superset version 3.1.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [16:45:25] 10Data-Engineering, 10Data Products (Data Products Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10WDoranWMF) 05Open→03Resolved [16:48:41] (03CR) 10Brouberol: [C: 03+1] Update to Superset version 3.1.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [16:48:46] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10WDoranWMF) 05Open→03Resolved [16:48:48] 10Data-Engineering (Sprint 8), 10serviceops-radar, 10Data Products (Data Products Sprint 05): Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10WDoranWMF) 05Open→03Resolved [17:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.67% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:24:44] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update to Superset version 3.1.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [17:29:24] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [17:55:32] (03CR) 10Nik Gkountas: [C: 03+1] cx event: add event sources for return to the dashboard [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/991111 (https://phabricator.wikimedia.org/T355200) (owner: 10KCVelaga) [17:55:47] (03CR) 10Nik Gkountas: [C: 03+1] "Looks good to me" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/991111 (https://phabricator.wikimedia.org/T355200) (owner: 10KCVelaga) [18:08:08] (03PS3) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [18:36:39] (03PS4) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [19:38:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [19:44:52] (03PS5) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [20:06:15] (03PS6) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [20:15:28] (03CR) 10Joal: "One nit" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [20:28:09] (03PS7) 10Aqu: Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) [20:32:10] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [20:35:25] 10Data-Engineering (Sprint 7), 10Patch-For-Review: Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/592 Fix Refine webrequest to include bug fix... [20:37:45] (03CR) 10Aqu: [V: 03+2 C: 03+2] Fix null pointer error in isPageviewUdf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/992752 (https://phabricator.wikimedia.org/T355391) (owner: 10Aqu) [20:41:56] Starting build #135 for job analytics-refinery-maven-release-docker [20:46:25] 10Data-Engineering, 10Product-Analytics, 10superset.wikimedia.org: Investigate Superset query templating as a mean to optimize partition pruning - https://phabricator.wikimedia.org/T299961 (10JAllemandou) 05Open→03Declined Closing as the strategy is to migrate to Iceberg. [20:56:52] Project analytics-refinery-maven-release-docker build #135: 09SUCCESS in 14 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/135/ [21:01:28] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10VirginiaPoundstone) [21:04:55] 10Data-Engineering (Sprint 7), 10Patch-For-Review: Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10CodeReviewBot) aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/592 Fix Refine webrequest to include bug fix... [21:13:04] 10Data-Engineering: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10VirginiaPoundstone) @JAllemandou and @BTullis what are next steps on this work? Is this still valid (the task is quite old). [21:15:56] 10Data-Engineering, 10AQS2.0, 10Data Products, 10PageViewInfo: MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension) - https://phabricator.wikimedia.org/T341634 (10VirginiaPoundstone) @BTullis is this issue still present? This task was filed prior to the switchover to AQS 2, so I'm... [21:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.659% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:26:37] 10Data-Engineering: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10JAllemandou) The task is old but the objective is still valid IMO. We should talk to @Eevans about this. [21:41:13] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10VirginiaPoundstone) [22:10:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye [22:11:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye [22:11:32] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye [22:11:34] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye [22:13:34] 10Data-Platform-SRE, 10DC-Ops: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10bking) [22:50:42] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye completed: - elastic2106 (**PASS**)... [22:58:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10RKemper) Forgot to add the `Bug:` label but https://gerrit.wikimedia.org/r/c/operations/puppet/+/992826 is part of this ticket as well [23:04:37] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye executed with errors: - elastic2103... [23:32:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye executed with errors: - elastic2104... [23:32:28] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye executed with errors: - elastic2105... [23:34:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye [23:38:31] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [23:44:53] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin)