[02:04:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [05:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:45] (SystemdUnitFailed) resolved: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:30] 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10odimitrijevic) Thank you @jbond! [05:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:27] good morning folks :) [06:23:37] going to upgrade kafka jumbo with the new threads settings [06:23:46] the cookbook will likely take some hours to complete [06:24:56] !log roll restart kafka jumbo brokers to apply new threads settings [06:24:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:09:50] https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&viewPanel=75 [07:09:54] the change is working nicelu [07:09:56] *nicely [07:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:40:02] 10Data-Engineering: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10MoritzMuehlenhoff) [08:29:33] elukey: Thanks ever so much. [08:29:45] <3 [08:30:05] I am truly amazed how kafka manages this amount of traffic [08:30:15] before today only one thread was enough :D [08:31:27] Maybe it'll be even better when we get to Kafka 3 :-) [08:32:09] I wish! [08:34:32] I'm going to press on with some more hadoop worker upgrades today, starting with analytics10[76-77] [08:38:38] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye [08:40:13] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye [08:55:34] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) Thanks @xcollazo - I can definitely see the use case here. I remember reading something about this in {T300937} and it doesn't quite add up. @JAllemandou... [09:03:23] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye executed with errors: - analytics1077 (**FAIL**) - Downtimed on Icinga... [09:33:08] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye executed with errors: - analytics1076 (**FAIL**) - Downtimed on Icinga... [09:33:31] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye [09:34:27] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) Unfortunately, both analytics1076 and analytics1077 are repeatedly stopping at a partman prompt, saying that the reuse-parts recipe is incorrect. {F37158323} Continuing to investigate. [09:36:52] (03PS10) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) [09:37:41] (03CR) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. (033 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer) [10:05:22] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) 05Open→03Resolved a:05Ladsgroup→03None [10:12:40] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) @Stevemunene >>! In T305874#9054412, @Stevemunene wrote: > - name: AUTH_OIDC_CLIENT_ID > value: "our-client-id" This is `datahub` > - na... [10:18:33] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye executed with errors: - analytics1077 (**FAIL**) - Removed from Puppet... [10:28:27] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye executed with errors: - analytics1076 (**FAIL**) - Removed from Puppet... [10:32:49] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) @jbond - It's possible that this behaviour is caused by a recent change to `reuse_parts.sh` in https://gerrit.wikimedia.org/r/c/operations/puppet/+/938898 under {T95064} I've run `install_console anal... [10:55:12] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) > It's possible that this behaviour is caused by a recent change to reuse_parts.sh yes that's possible , that script is quite complicated. im happy for you to revert as it was mostly to fix some stylis... [11:02:08] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) In fact looking at the changes im not sure if we change this part of the script however the following seems like something that could be an issue > ls: /var/lib/partman/devices/*: No such file or directory [11:03:47] btullis: i comented on the task however it might be worth trying a revert just to rule that out [11:05:13] (DiskSpace) firing: Disk space an-tool1009:9100:/ 4.545% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-tool1009 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:08:09] jbond: Thanks. I'll have a look at it. I'm just trying to find out if there's an ephemeral copy of the reuse-parts.sh script on the install server used during the installation. [11:12:32] Oh I see, it's on the local apt server and there's no other copy. [11:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:20:15] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've reverted the commit just for diagnostic purposes. I'll try another pair of reimages now. [11:21:54] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye [11:22:33] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye [11:28:35] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) It's gone past the partman error on both servers, so that's fairly certain. I'll allow these two reimages to complete, but I'll try looking again at the `reuse-parts.sh` script af... [11:37:15] !log ran apt clean on an-tool1009 to free up disk space [11:37:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:39:14] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) >>! In T332570#9058674, @BTullis wrote: > It's gone past the partman error on both servers, so that's fairly certain. I'll allow these two reimages to complete, but I'll try looking again at the `reuse-p... [11:40:13] (DiskSpace) resolved: Disk space an-tool1009:9100:/ 4.543% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-tool1009 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:54] (03PS1) 10AikoChou: Update mediawiki/page/prediction_classification_change to 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/944183 (https://phabricator.wikimedia.org/T343002) [12:01:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:37] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye completed: - analytics1077 (**PASS**) - Removed from Puppet and Puppet... [12:06:47] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye completed: - analytics1076 (**PASS**) - Removed from Puppet and Puppet... [12:58:53] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) >So did we change the log retention from 40 days to 7 in between these times Ah, what I meant to say is that about a month passed between job start and w... [13:00:36] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) > 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text. Hmm, that is a lot. I wonder if there is any setting that would compre... [13:08:47] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) >>! In T342923#9059010, @xcollazo wrote: >> 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text. > Hmm, that is a lot. I wond... [13:16:33] (03CR) 10Elukey: [C: 03+1] Update mediawiki/page/prediction_classification_change to 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/944183 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [13:18:04] hello! [13:24:10] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10achou) > @achou, reading the Model Card Data section it looks... [13:35:22] hola mforns [13:35:25] <3 :) [13:40:03] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Thanks @jbond Adding a datahub_staging oidc entry with `service_id: 'https://datahub-frontend\.k8s-staging\.discovery\.wmnet(/.*)?'` which we access... [14:21:12] heya elukey :] [14:42:42] (SystemdUnitFailed) firing: (4) ferm.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:21] 10Data-Engineering, 10Data-Platform-SRE: Configure airflow to send metrics to prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) [14:52:25] 10Data-Engineering, 10Data-Platform-SRE: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) [15:00:28] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Explore the use of Airflow notifiers for more flexible DAG failure handling - https://phabricator.wikimedia.org/T343234 (10BTullis) [15:00:51] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) [15:17:42] (SystemdUnitFailed) resolved: ferm.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:20:26] 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) [15:21:50] 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) p:05Triage→03High This would be really useful in order to help us with {T305874} as well. [15:24:02] 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) [15:29:40] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [15:29:44] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) 05Open→03Resolved [15:31:53] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) 05Open→03Resolved [15:32:05] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) 05Open→03Resolved [15:32:34] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [15:40:07] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) Moving to blocked whilst we carry out {T343236} [16:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:34] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis) [17:39:59] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) @Htriedman we are picking this work up again. Is the POC that you did available in a repository on gitlab? [17:48:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10Htriedman) Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/h... [18:37:49] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [18:38:19] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [18:39:51] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [19:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:17:50] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) Forwarding learnings from T343238#9060658 to this ticket: SIGTERMs on the Airflow instance only kill the Airflow process, with no (current) mechanism to forward the kill t... [20:19:11] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) Perhaps we should catch the SIGTERM and do a best effort to forward it to Skein/Spark? [20:53:50] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) The code that does the conversion from dumps to the table, MediawikiXMLDumpsConverter, [[ https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/s... [23:18:29] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:18:29] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability