[02:04:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:19:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[03:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[05:00:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:45] <jinxer-wm>	 (SystemdUnitFailed) resolved: refinery-sqoop-wikifunctions-production.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:30] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10odimitrijevic) Thank you @jbond!
[05:32:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:34:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:23:27] <elukey>	 good morning folks :)
[06:23:37] <elukey>	 going to upgrade kafka jumbo with the new threads settings
[06:23:46] <elukey>	 the cookbook will likely take some hours to complete
[06:24:56] <elukey>	 !log roll restart kafka jumbo brokers to apply new threads settings
[06:24:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:09:50] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&viewPanel=75
[07:09:54] <elukey>	 the change is working nicelu
[07:09:56] <elukey>	 *nicely
[07:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[07:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[07:40:02] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10MoritzMuehlenhoff)
[08:29:33] <btullis>	 elukey: Thanks ever so much.
[08:29:45] <elukey>	 <3
[08:30:05] <elukey>	 I am truly amazed how kafka manages this amount of traffic 
[08:30:15] <elukey>	 before today only one thread was enough :D
[08:31:27] <btullis>	 Maybe it'll be even better when we get to Kafka 3 :-)
[08:32:09] <elukey>	 I wish!
[08:34:32] <btullis>	 I'm going to press on with some more hadoop worker upgrades today, starting with analytics10[76-77]
[08:38:38] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye
[08:40:13] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye
[08:55:34] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) Thanks @xcollazo - I can definitely see the use case here. I remember reading something about this in {T300937} and it doesn't quite add up.  @JAllemandou...
[09:03:23] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye executed with errors: - analytics1077 (**FAIL**)   - Downtimed on Icinga...
[09:33:08] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye executed with errors: - analytics1076 (**FAIL**)   - Downtimed on Icinga...
[09:33:31] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye
[09:34:27] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) Unfortunately, both analytics1076 and analytics1077 are repeatedly stopping at a partman prompt, saying that the reuse-parts recipe is incorrect. {F37158323} Continuing to investigate.
[09:36:52] <wikibugs>	 (03PS10) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202)
[09:37:41] <wikibugs>	 (03CR) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. (033 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer)
[10:05:22] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) 05Open→03Resolved a:05Ladsgroup→03None
[10:12:40] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10jbond) @Stevemunene   >>! In T305874#9054412, @Stevemunene wrote:   >     - name: AUTH_OIDC_CLIENT_ID >       value: "our-client-id" This is `datahub`  >     - na...
[10:18:33] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye executed with errors: - analytics1077 (**FAIL**)   - Removed from Puppet...
[10:28:27] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye executed with errors: - analytics1076 (**FAIL**)   - Removed from Puppet...
[10:32:49] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) @jbond - It's possible that this behaviour is caused by a recent change to `reuse_parts.sh` in https://gerrit.wikimedia.org/r/c/operations/puppet/+/938898 under {T95064}  I've run `install_console anal...
[10:55:12] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) >  It's possible that this behaviour is caused by a recent change to reuse_parts.sh yes that's possible , that script is quite complicated.  im happy for you to revert as it was mostly to fix some stylis...
[11:02:08] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) In fact looking at the changes im not sure if we change this part of the script however the following seems like something that could be an issue  > ls: /var/lib/partman/devices/*: No such file or directory
[11:03:47] <jbond>	 btullis: i comented on the task however it might be worth trying a revert just to rule that out
[11:05:13] <jinxer-wm>	 (DiskSpace) firing: Disk space an-tool1009:9100:/ 4.545% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-tool1009 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[11:08:09] <btullis>	 jbond: Thanks. I'll have a look at it. I'm just trying to find out if there's an ephemeral copy of the reuse-parts.sh script on the install server used during the installation.
[11:12:32] <btullis>	 Oh I see, it's on the local apt server and there's no other copy.
[11:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[11:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[11:20:15] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) I've reverted the commit just for diagnostic purposes. I'll try another pair of reimages now.
[11:21:54] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye
[11:22:33] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye
[11:28:35] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) It's gone past the partman error on both servers, so that's fairly certain. I'll allow these two reimages to complete, but I'll try looking again at the `reuse-parts.sh` script af...
[11:37:15] <btullis>	 !log ran apt clean on an-tool1009 to free up disk space
[11:37:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:39:14] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10jbond) >>! In T332570#9058674, @BTullis wrote: > It's gone past the partman error on both servers, so that's fairly certain. I'll allow these two reimages to complete, but I'll try looking again at the `reuse-p...
[11:40:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space an-tool1009:9100:/ 4.543% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-tool1009 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[11:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:48:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:54] <wikibugs>	 (03PS1) 10AikoChou: Update mediawiki/page/prediction_classification_change to 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/944183 (https://phabricator.wikimedia.org/T343002)
[12:01:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:03:37] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1077.eqiad.wmnet with OS bullseye completed: - analytics1077 (**PASS**)   - Removed from Puppet and Puppet...
[12:06:47] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1076.eqiad.wmnet with OS bullseye completed: - analytics1076 (**PASS**)   - Removed from Puppet and Puppet...
[12:58:53] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) >So did we change the log retention from 40 days to 7 in between these times  Ah, what I meant to say is that about a month passed between job start and w...
[13:00:36] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) > 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text. Hmm, that is a lot. I wonder if there is any setting that would compre...
[13:08:47] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) >>! In T342923#9059010, @xcollazo wrote: >> 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text. > Hmm, that is a lot. I wond...
[13:16:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Update mediawiki/page/prediction_classification_change to 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/944183 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou)
[13:18:04] <mforns>	 hello!
[13:24:10] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10achou) > @achou, reading the Model Card Data section it looks...
[13:35:22] <elukey>	 hola mforns 
[13:35:25] <elukey>	 <3 :)
[13:40:03] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Thanks @jbond  Adding a datahub_staging oidc entry with `service_id: 'https://datahub-frontend\.k8s-staging\.discovery\.wmnet(/.*)?'` which we access...
[14:21:12] <mforns>	 heya elukey :]
[14:42:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) ferm.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:51:21] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Configure airflow to send metrics to prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis)
[14:52:25] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis)
[15:00:28] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Explore the use of Airflow notifiers for more flexible DAG failure handling - https://phabricator.wikimedia.org/T343234 (10BTullis)
[15:00:51] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis)
[15:17:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: ferm.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[15:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[15:20:26] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis)
[15:21:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) p:05Triage→03High This would be really useful in order to help us with {T305874} as well.
[15:24:02] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis)
[15:29:40] <wikibugs>	 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis)
[15:29:44] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) 05Open→03Resolved
[15:31:53] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) 05Open→03Resolved
[15:32:05] <wikibugs>	 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10BTullis) 05Open→03Resolved
[15:32:34] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis)
[15:40:07] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) Moving to blocked whilst we carry out {T343236}
[16:02:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:04:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:49:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:01:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:37:34] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis)
[17:39:59] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) @Htriedman we are picking this work up again. Is the POC that you did available in a repository on gitlab?
[17:48:52] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10Htriedman) Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/h...
[18:37:49] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin)
[18:38:19] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin)
[18:39:51] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin)
[19:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[19:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[20:17:50] <wikibugs>	 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) Forwarding learnings from T343238#9060658 to this ticket: SIGTERMs on the Airflow instance only kill the Airflow process, with no (current) mechanism to forward the kill t...
[20:19:11] <wikibugs>	 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) Perhaps we should catch the SIGTERM and do a best effort to forward it to Skein/Spark?
[20:53:50] <wikibugs>	 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) The code that does the conversion from dumps to the table, MediawikiXMLDumpsConverter, [[ https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/s...
[23:18:29] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[23:18:29] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability