[00:05:45] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm completed: - elastic1107 (**PASS**)... [00:09:45] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm completed: - elastic1105 (**PASS**)... [00:12:20] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [00:12:40] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [00:22:33] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [00:23:10] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1107... [00:38:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm completed: - elastic2094 (**PASS**)... [00:38:25] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2097.codfw.wmnet with OS bookworm completed: - elastic2097 (**WARN**)... [00:38:34] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2099.codfw.wmnet with OS bookworm completed: - elastic2099 (**WARN**)... [00:38:52] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2098.codfw.wmnet with OS bookworm completed: - elastic2098 (**WARN**)... [00:40:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2100.codfw.wmnet with OS bookworm [00:41:12] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [00:42:00] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [00:44:32] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2101.codfw.wmnet with OS bookworm [00:49:37] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2102.codfw.wmnet with OS bookworm [00:54:02] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2103.codfw.wmnet with OS bookworm [00:59:57] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2104.codfw.wmnet with OS bookworm [01:09:40] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [01:11:28] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.816% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:29:59] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2101.codfw.wmnet with OS bookworm completed: - elastic2101 (**WARN**)... [01:30:05] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2100.codfw.wmnet with OS bookworm completed: - elastic2100 (**PASS**)... [01:31:37] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2102.codfw.wmnet with OS bookworm completed: - elastic2102 (**PASS**)... [01:33:00] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2105.codfw.wmnet with OS bookworm [01:34:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2103.codfw.wmnet with OS bookworm completed: - elastic2103 (**PASS**)... [01:36:42] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2106.codfw.wmnet with OS bookworm [01:40:26] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2107.codfw.wmnet with OS bookworm [01:40:40] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2104.codfw.wmnet with OS bookworm completed: - elastic2104 (**PASS**)... [01:43:33] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2108.codfw.wmnet with OS bookworm [01:49:42] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2109.codfw.wmnet with OS bookworm [01:50:48] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [01:51:18] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) note to self to check the network port on ceph2002 [01:51:21] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) a:03Jhancock.wm [02:16:51] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2105.codfw.wmnet with OS bookworm completed: - elastic2105 (**PASS**)... [02:18:03] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2106.codfw.wmnet with OS bookworm completed: - elastic2106 (**PASS**)... [02:27:29] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2108.codfw.wmnet with OS bookworm completed: - elastic2108 (**WARN**)... [02:27:34] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2107.codfw.wmnet with OS bookworm completed: - elastic2107 (**PASS**)... [02:31:46] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2109.codfw.wmnet with OS bookworm completed: - elastic2109 (**PASS**)... [02:34:53] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [02:35:32] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) 05Open→03Resolved @bking all your's [03:57:16] (03PS3) 10Clare Ming: Add custom schema for *uiactionstracking instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) [05:11:28] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.743% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:51:21] 10Data-Engineering: Check home/HDFS leftovers of jbond - https://phabricator.wikimedia.org/T352511 (10MoritzMuehlenhoff) [09:11:29] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.689% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:16:38] 10Data-Platform-SRE, 10Discovery-Search (Current work): CirrusSearch: make p95 alerts more granular - https://phabricator.wikimedia.org/T349340 (10Gehel) 05Open→03Resolved a:03Gehel [09:17:31] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) 05Open→03Resolved [09:17:33] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10Gehel) [09:21:05] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [09:26:16] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10CodeReviewBot) brouberol updated https://gitlab.wikimedia.org/repos/data-engineering/kerberos-kinit/-/merge_requests/3 Fix: the kinit apt package has to... [09:26:22] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/kerberos-kinit/-/merge_requests/3 Fix: the kinit apt package has to d... [09:26:33] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10CodeReviewBot) brouberol updated https://gitlab.wikimedia.org/repos/data-engineering/kerberos-kinit/-/merge_requests/2 Fix gitlab yml file until the pipe... [09:33:52] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10brouberol) 05Open→03Resolved [09:33:55] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:34:13] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10brouberol) Build pipeline: https://gitlab.wikimedia.org/repos/data-engineering/kerberos-kinit/-/pipelines/32660 [09:35:18] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:55:43] 10Data-Platform-SRE, 10Patch-For-Review: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10brouberol) The helm chart is _mostly_ done. @BTullis @Antoine_Quhen whenever possible and convenient, could I get your opinion on the general design and the default values b... [10:04:13] !log marked TaskInstance: pageview_hourly.move_data_to_archive scheduled__2023-12-01T06:00:00+00:00 as succeeded in airflow analytics [10:04:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:04:59] (PuppetFailure) firing: Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:09:59] (PuppetFailure) firing: (2) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:11:38] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Gehel) 05Open→03Resolved [10:11:41] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [10:12:18] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10Gehel) 05In progress→03Resolved [10:12:21] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10Gehel) [10:15:28] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Observability-Metrics, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "operations/software/dropwizard-metrics" (20150219) - https://phabricator.wikimedia.org/T352103 (10fgiunchedi) Pretty sure we can nuke this, I'l... [10:16:28] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10Gehel) 05Resolved→03Open [10:16:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10Gehel) [10:16:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10dcausse) The ticket description mentions wdqs2022 but the slow down was also observed on wdqs1022 and wdqs1023 which are both 2.40Ghz CPUs,... [10:19:59] (PuppetFailure) firing: (2) Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:44:59] (PuppetFailure) resolved: Puppet has failed on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:13:23] !log pool druid1010 after reimage T336043 [11:13:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:13:26] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [12:06:44] 10Data-Platform-SRE: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10BTullis) 05Open→03Resolved a:03BTullis I have now archived these files: ` sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mv /user/paramd /wmf/data/archive/user/ sudo -u hdfs kerberos-run-command hdfs h... [12:12:11] (03CR) 10Mabualruz: "I cannot judge the schema as I do not know all the required parts" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [12:16:12] 10Data-Platform-SRE: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10BTullis) a:03BTullis I see that @aranyap has now been employed by the WMF ([[https://wikimedia.slack.com/archives/CSG7RKWTY/p1691081664977309|see Slack thread]]), after having completed her internship. Welc... [12:25:05] 10Data-Platform-SRE: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) @WDoranWMF - Would you have any update on this question of whether we should delete or archive the files belonging to @ntsako please? Thanks. [12:33:24] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) a:03BTullis [12:38:43] 10Data-Platform-SRE: Check home/HDFS leftovers of jbond - https://phabricator.wikimedia.org/T352511 (10lbowmaker) [12:43:19] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) This is confirmed. ` Enclosure Device ID: 32 Slot Number: 9 Drive's position: DiskGroup: 12, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 9... [12:51:50] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) [12:53:03] 10Data-Platform-SRE, 10sre-alert-triage: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) I've created {T352168} and tagged it with #ops-eqiad so I'll move this ticket to waiting. [12:57:02] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) a:03BTullis [12:57:31] (03PS1) 10Mforns: Add Commons Impact Metrics code drafts for later [analytics/refinery] - 10https://gerrit.wikimedia.org/r/979341 (https://phabricator.wikimedia.org/T351836) [13:02:26] 10Data-Engineering (Sprint 5): [Data Platform] Document proposal for data-product configuration store - https://phabricator.wikimedia.org/T349746 (10lbowmaker) 05Open→03Resolved [13:02:30] 10Data-Engineering (Sprint 5): [Data Quality] Visualize platform and system alerts on a dashboard - https://phabricator.wikimedia.org/T349765 (10lbowmaker) 05Open→03Resolved [13:02:42] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10lbowmaker) 05Open→03Resolved [13:03:02] 10Data-Engineering (Sprint 5), 10serviceops, 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10lbowmaker) 05Open→03Resolved [13:03:24] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10lbowmaker) 05Open→03Resolved [13:03:37] 10Data-Engineering (Sprint 5): [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10lbowmaker) 05Open→03Resolved [13:04:35] 10Data-Engineering, 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10lbowmaker) [13:05:44] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10lbowmaker) [13:11:29] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.567% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:25:42] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) [13:32:30] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) I'm having a look at this now. I believe that it is related to the dumps architecture and specifically with wikid... [13:32:49] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) 05Open→03Resolved [13:36:21] Hi btullis - I'm interacting with Airflow-analytics and getting errors 500 when trying to update a task-notes - any idea? [13:37:38] joal: Oh, no I have no idea. Do you want to look together? [13:37:42] sure! [13:38:07] Cave!  [13:47:19] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10lbowmaker) [13:47:52] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker) [13:48:07] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10lbowmaker) [13:48:17] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10lbowmaker) [13:48:20] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10lbowmaker) [13:48:27] 10Data-Engineering (Sprint 6): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [13:48:29] 10Data-Engineering (Sprint 6): [Data Quality] [Needs Grooming] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10lbowmaker) [13:48:32] 10Data-Engineering (Sprint 6): [Data Quality] [Needs Grooming] Collect requirements to define prioritized data pipeline and data metrics - https://phabricator.wikimedia.org/T350409 (10lbowmaker) [13:48:38] 10Data-Engineering (Sprint 6), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10lbowmaker) [13:49:05] 10Data-Engineering (Sprint 6), 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10lbowmaker) [13:49:07] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10lbowmaker) [13:49:09] 10Data-Engineering (Sprint 6), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10lbowmaker) [13:49:12] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker) [13:49:14] 10Data-Platform-SRE: Downloading from Archiva.wikimedia.org is slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) [13:49:16] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10SRE Observability: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker) [13:49:18] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) [13:49:32] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [13:49:34] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) p:05Triage→03High [13:49:36] 10Data-Engineering: [Data Quality] Calculate and log post processing record counts metrics for unique devices - https://phabricator.wikimedia.org/T349455 (10lbowmaker) [13:49:45] 10Data-Engineering: [Data Quality] Define persona and user stories for system and data monitoring and alerting - https://phabricator.wikimedia.org/T349454 (10lbowmaker) [13:49:54] 10Data-Engineering: [Data Quality] [Needs Grooming] Calculate and log comprehensive post processing metrics for webrequests - https://phabricator.wikimedia.org/T349456 (10lbowmaker) [13:49:59] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10lbowmaker) [13:50:14] 10Data-Engineering (Sprint 6), 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [13:51:20] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) [14:05:14] joal: I will update T352534 with our findings. i.e. the error seems to be related to the fact that we are not logged into Airflow when trying to add notes. I'll discuss options around fixing it. [14:05:14] T352534: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 [14:19:00] (03PS1) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [14:24:30] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) @JAllemandou and I did some investigation of this issue. ##Cause The key part of the log appears to be the this part. ` [SQL: INSERT INTO... [14:38:03] 10Data-Platform-SRE, 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) >>! In T350465#9372880, @bking wrote: > @Gehel Is this a duplicate of T347504? No, it's not. T347504 is about loading the full data set, T350465 is about l... [16:00:43] 10Data-Platform-SRE, 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/52 Add the data-engineering/superset project to... [16:25:51] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) In order to ascertain if we can work around this easily in the short term, I have created a local user on the analytics_test airflow inst... [16:28:19] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) Success! So this is a valid workaround, if we want to use it. {F41551274,width=40%} [16:29:07] (03Abandoned) 10Ladsgroup: Pass spark_job_jar as an argument in ArticlePlaceholder oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/572713 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [16:36:37] 10Data-Platform-SRE, 10Data Pipelines: Can't save dagrun notes in airflow after 2.7.3 migration - https://phabricator.wikimedia.org/T352483 (10EBernhardson) [16:36:41] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10EBernhardson) [16:43:09] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) Ah, thanks @EBernhardson and apologies for the inconvenience. It's interesting to find out that you also use this feature. As mentioned a... [16:50:42] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10EBernhardson) We do use the feature, although tbh I don't know how useful it is. Sometimes we skip runs that failed because canary events didn't... [16:55:57] 10Data-Engineering, 10Data-Platform-SRE: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 - https://phabricator.wikimedia.org/T352534 (10BTullis) a:03BTullis [16:59:24] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye [17:02:11] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10BTullis) a:03BTullis [17:08:18] 10Data-Engineering, 10Discovery-Search, 10IPv6: Some Search clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts - https://phabricator.wikimedia.org/T312555 (10Volans) [17:16:13] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.494% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:36:16] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10BTullis) >>! In T338234#9254848, @XenoRyet wrote: > @BTullis Sorry for not getting back to you on this sooner. Yes, dropping a tarball of this stuff in my home directory sounds like a good idea. @XenoRyet... [17:47:25] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye executed with errors: - ceph200... [18:22:58] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye [18:34:07] (03PS1) 10Xcollazo: Fix recursion for Maps with Structs on SanitizeTransformation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979406 (https://phabricator.wikimedia.org/T349121) [19:00:16] 10Data-Platform-SRE: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10Jcross) Hi, sorry for the delay on this. Aranya does not require production shell access but we'd like to keep her in the analytics-priveatedata-users group if it's not too much trouble. Thank you! @BTullis [19:45:32] (03PS1) 10Clare Ming: Add readme to product_metrics schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/979407 [20:37:31] (03PS1) 10Milimetric: Update usage example [analytics/refinery] - 10https://gerrit.wikimedia.org/r/979414 [20:37:42] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update usage example [analytics/refinery] - 10https://gerrit.wikimedia.org/r/979414 (owner: 10Milimetric) [21:16:14] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 5.324% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:53:26] 10Data-Engineering, 10Data Products: Airflow unittests failing with TypeError: Pool.create_or_update_pool() - https://phabricator.wikimedia.org/T352577 (10xcollazo) [22:43:54] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 7 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex)