[01:16:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:42:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:43:46] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[02:44:26] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:49:28] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[02:50:09] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:41:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:42:42] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[03:43:24] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:49:40] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[03:50:22] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:43] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[03:53:43] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[05:37:30] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1113.eqiad.wmnet with OS bullseye
[05:38:44] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1114.eqiad.wmnet with OS bullseye
[06:17:56] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1113.eqiad.wmnet with OS bullseye completed: - an-worker1113 (**PASS**)   - Downtimed on Icinga/Alertm...
[06:21:35] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1114.eqiad.wmnet with OS bullseye completed: - an-worker1114 (**PASS**)   - Downtimed on Icinga/Alertm...
[06:22:03] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1115.eqiad.wmnet with OS bullseye
[06:46:28] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1116.eqiad.wmnet with OS bullseye
[07:05:43] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1115.eqiad.wmnet with OS bullseye completed: - an-worker1115 (**WARN**)   - Downtimed on Icinga/Alertm...
[07:27:22] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1116.eqiad.wmnet with OS bullseye completed: - an-worker1116 (**PASS**)   - Downtimed on Icinga/Alertm...
[07:48:42] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye
[07:53:43] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[07:53:44] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[08:19:24] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Gehel) a:05BTullis→03Stevemunene
[08:31:42] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Gehel)
[11:22:20] <btullis>	 I'm about to start testing a new build of archiva for T299645 - Let me know if anything seems amiss with it.
[11:27:18] <btullis>	 Looks good so far. Version 2.2.10 of archiva is installed and appears to be working correctly. I'm checking that the `archiva-gitfat-link.service` works as expected.
[11:53:44] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[11:53:44] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[12:01:14] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) While reimaging `an-worker1117.eqiad.wmnet` we found that the server did not reimage with the right mountpoints resulting in puppet error ` Error while evaluating a Function Call: Number of datanod...
[12:15:37] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) >>! In T343318#9101649, @gerritbot wrote: > Change 939651 had a related patch set uploaded (by Gmode...
[12:29:57] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) resolved: ...
[12:29:57] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[12:48:12] <wikibugs>	 (03CR) 10Mforns: "It seems reusing preexisting core fragments prevents us from organizing the fields in an optimal way for consumers down the pipeline, no? " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[12:50:00] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10xcollazo) In general, I think the document includes great knowledge to have as a newcomer to Dumps 1.0.  From a runbook perspective, I think we should include a listing of...
[12:51:15] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10JEbe-WMF) >>! In T343325#9109206, @xcollazo wrote: > In general, I think the document includes great knowledge to have as a newcomer to Dumps 1.0. >  > From a runbook persp...
[12:54:23] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10ArielGlenn) I can try to dust off and restructure the troubleshooting guide on wikitech for the sql/xml dumps, if that would be helpful. This would by no means be a replace...
[13:01:05] <wikibugs>	 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) 05Open→03Resolved
[13:02:57] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) Deploying this change.
[13:03:44] <btullis>	 !log deploying the change to the yarn log retention and compression for T342923
[13:03:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:03:48] <stashbot>	 T342923: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923
[13:06:58] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10lbowmaker)
[13:07:00] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Got some errors from the first test, but they're mostly related to the current setup. Looking into this  ` ERROR:   exit status 1  EXIT STATUS   1  S...
[13:07:03] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) I'll leave the ticket open a for a while, whilst we check to make sure that there are no unintended consequences.
[13:14:19] <wikibugs>	 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10Kizule) Sorry, I've missed notification about this, I don't know how, but that's alright.  Thank you for taking a look into this, everything looks fine to me. :)
[13:17:57] <wikibugs>	 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) >>! In T332596#9109379, @Kizule wrote: > Sorry, I've missed notification about this, I don't know how, but that's alright. >  > Thank you for taking a look...
[13:29:07] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) @Papaul fixed the thing.
[13:39:21] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10fkaelin) Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development /...
[13:44:52] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) @Jhancock.wm thanks all good now  ` papaul@asw-a-codfw> show interfaces xe-2/0/19 descriptions Interface       Admin Link Description xe-2/0/19       up...
[13:56:04] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) >>! In T342923#9109622, @fkaelin wrote: > Would it be possible to define the retention period based on the queue a appliocation is running in? It is not re...
[14:01:51] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye
[14:05:53] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking)
[14:05:56] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: Flink Operations - https://phabricator.wikimedia.org/T328561 (10bking)
[14:08:08] <gmodena>	 !log deploying refinery using scap
[14:08:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:19:12] <btullis>	 gmodena: You may want to be aware of : T334493 (there is a workaround for the refinery deployment deploy-refinery-to-hdfs step)
[14:19:12] <stashbot>	 T334493: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493
[14:20:01] <gmodena>	 btullis I'm deploying refinery with joal and we stumbled upon an issue with git fat on hadoop-test: Unhandled error:
[14:20:01] <gmodena>	 deploy-local failed: <FailedCommand> {'exitcode': 1, 'stdout': '', 'stderr': "git: 'fat' is not a git command. See 'git --help'.\n\nThe most similar commands are\n\tfetch\n\tmktag\n\tstage\n\tstash\n\ttag\n\tvar\n"}
[14:20:18] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10xcollazo)
[14:20:38] <gmodena>	 btullis I'm rolling back deployment on hadoop-test
[14:20:58] <btullis>	 gmodena: Oh, interesting. That's the second time I've seen git-fat not being installed on bullseye. I think.
[14:22:23] <btullis>	 gmodena: Ah, it's here: https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/standard_packages.pp#L43-L46
[14:22:46] <gmodena>	 btullis ack re T334493. Thanks for the heads up.
[14:24:13] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-client1001:9100:/ 0.8115% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-client1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[14:26:38] <btullis>	 gmodena: Could you try your deploy to hadoop-test again please? I have manually installed git-fat on an-test-coord1001 and I'll make a note just after this comment: https://phabricator.wikimedia.org/T279509#8936938
[14:27:25] <gmodena>	 btullis it worked
[14:27:41] <btullis>	 \o/
[14:28:40] <joal>	 Thanks btullis :)
[14:29:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space an-test-client1001:9100:/ 0.8115% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-client1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[14:29:20] <gmodena>	 !log deploying refinery with hdfs
[14:29:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:32:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:34:27] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:27] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Was able to get the deployment to staging done, login redirected to the right SSO page and I was able to enter my login details, however authenticati...
[14:43:22] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis) Adding more tags and subscribers so that we can gain visibility and prioritise this within the DPE group. I'm not necessarily sure yet who is best...
[14:47:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:49:21] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye executed with errors: - wdqs2...
[14:49:49] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis)
[14:49:52] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:46] <wikibugs>	 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) As discussed in DPE SRE meeting:  * let's try to be consistent in naming, using "Data Platform" for all services where DPE SRE is responsible
[15:28:00] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Define a list of exactly which alerts should page the Analytics team in VictorOps - https://phabricator.wikimedia.org/T296552 (10Gehel)
[15:29:07] <wikibugs>	 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel)
[15:29:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:32:59] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10Gehel) a:03BTullis
[15:37:56] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene)
[15:38:09] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Gehel) a:03RKemper
[15:49:50] <wikibugs>	 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10Gehel)
[15:50:11] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) @JMeybohm we are reimaging these hosts to Bullseye in https://phabricator.wikimedia.org/T343124 , we will revisit this once the hosts have bee...
[16:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:07:22] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10GitLab (Pipeline Services Migration🐤): Migrate Data Engineering Pipelinelib repos to GitLab - https://phabricator.wikimedia.org/T344730 (10thcipriani)
[16:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:12:02] <wikibugs>	 10Data-Platform-SRE: Export Blazegraph JNL file from wdqs1009 - https://phabricator.wikimedia.org/T344732 (10bking)
[16:16:53] <wikibugs>	 10Data-Platform-SRE: Export Blazegraph JNL file from wdqs1009 - https://phabricator.wikimedia.org/T344732 (10bking) We noticed that wdqs-blazegraph keeps restarting itself, even with Puppet disabled. To work around this, I've changed mode to 000 for `/srv/deployment/wdqs/wdqs/runBlazegraph.sh` . This file is not...
[16:17:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default_test - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[16:22:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[16:27:05] <btullis>	 joal: gmodena: It looks like there is an issue with gobblin and refine on an-test-coord1001. I can help look.
[16:27:34] <joal>	 Thanks for the ping btullis - I'm in a meeting now, will help in a bit
[16:28:54] <btullis>	 joal: ack. Maybe it's to do with the Archiva upgrade. I see `java.lang.ClassNotFoundException` errors.
[16:30:03] <joal>	 Mwarf :(
[16:32:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[16:33:51] <btullis>	 Similar with gobblin job, cannot find jars.
[16:34:02] <btullis>	 `Error opening job jar: /srv/deployment/analytics/refinery/artifacts/org/wikimedia/gobblin-wmf/gobblin-wmf-core-1.0.1-jar-with-dependencies.jar`
[16:34:28] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye completed: - wdqs2023 (**WARN...
[16:34:29] <btullis>	 That's from `journalctl -u gobblin-eventlogging_legacy_test --output short` on an-test-coord1001.
[16:36:14] <btullis>	 But it appears to be there.
[16:36:20] <btullis>	 https://www.irccloud.com/pastebin/CeZAcO7s/
[16:36:30] <joal>	 I think I know the issue btullis 
[16:36:42] <btullis>	 Great. 
[16:37:46] <joal>	 i.e: We have rolled-back an-test-coord1001, and then re-deployed
[16:40:13] <joal>	 And in those cases, scap keeps the rolled-back folder, and tries to reuse it for next deploy - But, that folder was broken (no jars), and this, scap doesn't see it
[16:41:24] <btullis>	 Ah, so do you think it will fix itself on the next scheduled gobblin runs and refine runs?
[16:41:33] <joal>	 Nope
[16:41:39] <joal>	 We need to take action
[16:41:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10xcollazo) I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control solution?  If we migrate `analytics/refine...
[16:44:53] <joal>	 here's what I suggest: we manually drop the scap rev folder on an-test-coord1001, and then redeploy
[16:45:23] <btullis>	 Would you like me to help with this? Shall we batcave?
[16:45:57] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul)
[16:46:09] <joal>	 btullis: after my meeting - give me 10 minutes :)
[16:47:19] <btullis>	 joal: Sorry for hassling. I have to be afk for a bit now.
[16:47:43] <joal>	 ack btullis - we'll do later
[16:53:32] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) 05Open→03Resolved complete
[17:00:39] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10GitLab (Pipeline Services Migration🐤): Migrate Data Engineering Pipelinelib repos to GitLab - https://phabricator.wikimedia.org/T344730 (10thcipriani)
[17:04:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:11:43] <joal>	 btullis: I'm assuming you're gone for now - ping stevemunene in case?
[17:19:46] <btullis>	 joal: I am here. Different desk, but logged in and with ssh. How can I help?
[17:20:41] <joal>	 Thanks btullis - Can you please go to an-test-coord1001
[17:21:10] <btullis>	 I'm in
[17:23:05] <joal>	 And, there, drop the /srv/deployment/analytics/refinery-cache folder please
[17:23:24] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10tchin) After experimenting a lot, I have a Datahub transformer for Kafka that generates an Event Streams platfor...
[17:23:40] <joal>	 btullis: --^
[17:23:50] <joal>	 sorry, it took me time to devise what to do
[17:24:06] <btullis>	 joal: done 
[17:24:14] <joal>	 ack, I'll try to reploy
[17:24:24] <btullis>	 Great!
[17:24:50] <joal>	 !log Redeploying refinery onto Hadoop-test to try to fix jar issue
[17:24:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:25:15] <btullis>	 I see a new refinery-cache folder.
[17:25:20] <joal>	 Expected!
[17:25:39] <btullis>	 👍
[17:26:11] <joal>	 once deploy is done (and hopefully successful), we should have the missing jars
[17:26:20] <joal>	 And the next gobblin run should succeed
[17:26:28] <joal>	 Then it'll about rerunning failed refine
[17:27:21] <btullis>	 Ok, stepping away from this keyboard for now, but will keep monitoring. Let me know how it goes.
[17:27:23] <joal>	 ok, deploy succseeful
[17:27:30] <joal>	 Thanks a milion btullis 
[17:27:50] <btullis>	 A pleasure.
[17:28:29] <gmodena>	 joal btullis ack. Saw the alerts re missing jars.
[17:30:02] <gmodena>	 catching up with history; what was the issue with the test deployment? stale dir?
[17:32:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[17:34:33] <joal>	 gmodena: let's sync on this after standup (or in standup if we can)
[17:38:51] <gmodena>	 joal ack. let's sync at standup. I need to drop around 20:00ish
[17:39:02] <joal>	 sure
[17:50:50] <inflatador>	 gmodena btullis no urgency, but I've added you as reviewers on the latest rdf-streaming-updater test patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/951551
[17:54:42] <gmodena>	 inflatador ack
[17:57:10] <inflatador>	 Thanks for all the great documentation btw
[17:57:24] <inflatador>	 really makes our lives easier ;)
[18:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[18:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:12:32] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) resolved: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[18:15:32] <joal>	 \o/
[18:15:36] <joal>	 Gobblin error resolved
[18:15:56] <joal>	 We'll talk about those issues tomorrow with gmodena, and rerun the failed jobs
[18:16:38] <joal>	 note: those errors were on test cluster, no real production isue
[18:24:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:29:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:30:35] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10tchin) From the recent meeting: - `Event Streams` will be the name of the platform - Streams are upstream to Kaf...
[18:31:52] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10bking)
[18:32:58] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis) >>! In T328472#9110768, @xcollazo wrote: > I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control...
[18:58:44] <wikibugs>	 (03PS1) 10Clare Ming: Experiment with including fragments inside data objects. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557)
[19:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:10:12] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[19:21:00] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[19:24:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:18] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) 05Open→03Resolved a:03thcipriani This task got too big to be useful.  I've broken down each individual row in...
[19:27:38] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:27:42] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:46] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[19:28:12] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:28:20] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products, 10Data Pipelines (Sprint 12): Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10VirginiaPoundstone) p:05Medium→03Triage
[19:28:44] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:29:34] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:29:54] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:30:16] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:24] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[19:49:34] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[19:49:34] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:52:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:09:57] <jinxer-wm>	 (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:11:00] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[20:11:00] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:34] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[20:19:34] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:25:18] <icinga-wm>	 PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[20:25:18] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:49:42] <icinga-wm>	 RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[20:49:42] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:57] <jinxer-wm>	 (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:07:57] <wikibugs>	 (03PS3) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)