[02:12:09] 10Data-Engineering, 10Event-Platform: [Event Platform] eventutilities-python should convert pyflink Instants to python DateTimes - https://phabricator.wikimedia.org/T349640 (10Ottomata) WIP https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/tree/pyflink_python_conversion?ref_type=heads [03:17:30] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.08% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:17:30] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.104% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:16:26] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Trust-and-Safety, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) 05Open→03Resolved a:03Milimetric @Milimetric basically has done this in {T333716}, thanks! [08:17:04] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10stjn) @Milimetric: this is great, but I think it should be also indicated under the map that some countries do... [08:28:02] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/mer... [08:43:28] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) I deployed the chart with a Deployment running the `repos/data-engineering/spark/spark3.4-history` image (... [08:44:05] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) [08:44:07] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Build an image for spark-history with user uid=909 - https://phabricator.wikimedia.org/T352850 (10brouberol) 05Open→03Resolved [08:49:08] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) I've redeployed the app with ` spark.history.kerberos.principal: spark-history/spark-history.svc.eqiad... [08:51:55] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) Reading https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html, I think w... [09:50:23] 10Data-Platform-SRE, 10Wikidata, 10collaboration-services: Review alerts for miscweb-hosted domains commons-query.wikimedia.org and query.wikidata.org - https://phabricator.wikimedia.org/T352921 (10Gehel) [09:51:08] 10Data-Platform-SRE: Consider implementing Envoy for Elastic hosts - https://phabricator.wikimedia.org/T352872 (10Gehel) p:05Triage→03Medium [09:51:49] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10Gehel) p:05Triage→03High [09:52:11] 10Data-Platform-SRE, 10Wikidata, 10collaboration-services: Review alerts for miscweb-hosted domains commons-query.wikimedia.org and query.wikidata.org - https://phabricator.wikimedia.org/T352921 (10Gehel) p:05Triage→03Medium [10:10:00] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) >>! In T352838#9389431, @brouberol wrote: > Reading https://hadoop.apache.org/docs/stable/hadoop-project-dis... [10:21:42] (SystemdUnitFailed) firing: refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:49] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) Or the alternative, I suppose, is that we do add a posix POSIX users for `spark-history` and make it part of... [10:23:58] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) Right, I came to the same conclusion as well. I'd still like to understand whether proxy user would work,... [10:26:42] (SystemdUnitFailed) firing: (2) refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:33] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39 Update the version of wmfdata-python used in... [10:27:39] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39 Update the vers... [10:38:21] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 05): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [10:38:50] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 05): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [10:48:30] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:50:24] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:50:42] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:52:34] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:52:48] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:54:25] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:55:09] 10Data-Engineering-Radar, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 05), 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [10:57:31] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [11:12:13] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) A few notes in reply to the comment above: * This ticket is specifically for 3 experimental endpoints that are temporary... [11:12:29] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) > But yes, I'm not sure either how we get the spark-history process to use the doAs() method outlined here... [11:17:30] (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 3.101% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:19:52] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) Yes, I think that your testing in T352838#9389418 actually reveals a simpler solution. It looks like the na... [11:22:15] (DiskSpace) resolved: Disk space an-test-ui1001:9100:/ 3.101% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:28:09] joal: Would you have a moment to help me understand the refinery deployment better please? I've deployed this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/980445 but the jars are not present on the servers, so refine jobs are failing on the test cluster. [11:41:41] Starting build #91 for job analytics-refinery-update-jars-docker [11:41:52] Project analytics-refinery-update-jars-docker build #91: 04FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/91/ [11:43:35] Starting build #92 for job analytics-refinery-update-jars-docker [11:43:58] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.27 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/980837 [11:43:59] Yippee, build fixed! [11:43:59] Project analytics-refinery-update-jars-docker build #92: 09FIXED in 23 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/92/ [11:45:01] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.27 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/980837 (owner: 10Maven-release-user) [11:48:09] !log deploying refinery to hadoop-test only [11:48:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [12:10:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [12:21:42] (SystemdUnitFailed) firing: (2) refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:42] (SystemdUnitFailed) resolved: (2) refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:33] !log deploying conda-analytics v0.0.26 to hadoop-test [12:31:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:49:58] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) Ugh! conda-analytics version 0.0.26 is failing to run `conda-analytics-clone mycoolenv` in the test environment. `lines=10 C... [13:51:07] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10Gehel) p:05High→03Medium [13:52:09] 10Data-Platform-SRE, 10Product-Analytics: Fix presto kerberos support for system users - https://phabricator.wikimedia.org/T292072 (10Gehel) p:05High→03Medium [13:52:53] 10Data-Platform-SRE: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10Gehel) p:05High→03Medium [14:32:26] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) We went with @BTullis 's suggestion, which worked flawlessly. [14:34:25] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) [14:48:01] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10Milimetric) >>! In T333716#9389355, @stjn wrote: > @Milimetric: this is great, but I think it should be also in... [14:49:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [14:49:50] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Trust-and-Safety, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10Milimetric) I'm really sorry this didn't get through the pipeline sooner, someone only told me about the issue l... [14:49:54] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Epic, 10Event-Platform: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Ottomata) 05Open→03Declined After a discussion with @Gehel and @dcausse, there isn't a lot of interest in using Kafka stre... [14:50:42] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [14:50:51] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [14:53:08] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [14:56:09] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10stjn) You can split it in two translatable messages, it doesn’t have to be added to the one that was there. But... [14:57:56] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10gmodena) a:03gmodena [15:00:11] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10Ahoelzl) For the alerting backend, let's discuss with the observability team whether the Alert Manager infrastructure can be used for data metrics alerting and platform alerting in general going forward. [15:05:01] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Epic, 10Event-Platform: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10BTullis) I think that we should: * repurpose kafka-stretch100[1-2] to add them to the analytics **Hadoop** cluster in eqiad (... [15:06:08] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/40 Downgrade some... [15:08:22] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) ` brouberol@an-master1001:~$ sudo kerberos-run-command hdfs hdfs dfs -mkdir /var/log/spark brouberol@an-ma... [15:08:32] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) [15:08:45] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) 05Open→03Resolved [15:36:20] (03PS1) 10Milimetric: Move linktarget table to the private folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/981351 (https://phabricator.wikimedia.org/T352879) [15:37:37] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Move linktarget table to the private folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/981351 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [15:37:45] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Sanitize the linktarget sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/980937 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [15:38:14] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Ottomata) > This is the varnish code (VCL) that does analytics-y things to create and update the X-analytics header. Can we do this i... [15:41:00] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas - https://phabricator.wikimedia.org/T352879 (10CodeReviewBot) milimetric merged https://gitlab.wikimedia.org/repos/data-engineer... [15:43:27] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Ottomata) Alternative using X-Analytics VLC: https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352 [15:45:17] !log deploying refinery for the sqoop fix [15:45:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:17] 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 Milestone 1): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) This can be unblocked when we have a newer version of Superset that allows better browsing of data. [15:59:26] 10Data-Engineering (Sprint 6), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10Ahoelzl) [16:11:46] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) Status update: * Finished and shared [[ https://docs.google.com/document/d/1oNHvfmdtWOxYhm4yOnT19REIrfex-0uIrjQIJZ8RWu4/edit?usp=sharing| draft of content strategy... [16:12:04] !log finished deploying and syncing refinery [16:12:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:20:42] (SystemdUnitFailed) firing: refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:05] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) [16:28:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:01] 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 (10brouberol) The next attempt to enable the LVS service for the dse k8s ingress gateway should work, as ports are now open: ` brouberol@lvs1019:~$ for i in $(seq 1 8); do echo... [16:29:30] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Observability-Metrics, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "operations/software/dropwizard-metrics" (20150219) - https://phabricator.wikimedia.org/T352103 (10fgiunchedi) I couldn't find any dropwizard-me... [16:29:54] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "operations/software/dropwizard-metrics" (20150219) - https://phabricator.wikimedia.org/T352103 (10fgiunchedi) [16:35:42] (SystemdUnitFailed) resolved: refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:20] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [16:38:26] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10brouberol) 05Resolved→03Open I'm going to re-open this, as @elukey mentioned that he'd much rather we create a `s... [16:39:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [16:39:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [16:44:55] hi folks. running a hive query and I see "NULL" in the output for wmf.webrequest for pageview_info['page_title']. any idea where I can find more information about what and why it could mean in this context? [16:57:53] 10Data-Platform-SRE: Improve observability for non-k8s Envoy proxies (wdqs) - https://phabricator.wikimedia.org/T353003 (10bking) [17:00:52] 10Data-Platform-SRE: Improve observability for non-k8s Envoy proxies (wdqs) - https://phabricator.wikimedia.org/T353003 (10bking) [[ https://phabricator.wikimedia.org/P54276 | Here's ]] the envoy.yaml file I modified to include access logging. I also had to create a the `/var/log/envoy/access_log` file with `env... [17:01:48] 10Data-Platform-SRE (23/24 Q2 Milestone 1), 10Data-Platform: Deploy an echoserver service on dse-k8s-eqiad behind ingress - https://phabricator.wikimedia.org/T353004 (10brouberol) [17:02:04] (03CR) 10Joal: "I think this strategy is a bad idea." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/980937 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [17:10:51] btullis: Heya - sorry for not answering, I've been busy AFK all day today :( [17:11:05] btullis: have you managed to find help? [17:16:05] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) > Is there any place or anyhow we can test this before going live? @REsquito-WMF hm, no... [17:16:31] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) I plan to deploy {T351247} on Monday Dec 11, and then if all goes well, will enable the... [17:19:27] sukhe: iirc, page_title will only be populated for pageviews? is_pageview must be true. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Current_Schema [17:20:13] ah got it, so it must meet that definition otherwise it's null [17:20:24] definition of what constitutes as "pageview" [17:20:29] ya [17:20:44] thanks ottomata! <3 [17:23:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [17:27:22] 10Data-Engineering (Sprint 6): [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10Ottomata) FWIW, Alert Manager won't work well for historical dataset based alerts. The best we can do in Alert Manager is 'there is a problem in the last