[01:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:46] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [02:44:26] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:28] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [02:50:09] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:42] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [03:43:24] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:40] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [03:50:22] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:53:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [05:37:30] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1113.eqiad.wmnet with OS bullseye [05:38:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1114.eqiad.wmnet with OS bullseye [06:17:56] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1113.eqiad.wmnet with OS bullseye completed: - an-worker1113 (**PASS**) - Downtimed on Icinga/Alertm... [06:21:35] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1114.eqiad.wmnet with OS bullseye completed: - an-worker1114 (**PASS**) - Downtimed on Icinga/Alertm... [06:22:03] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1115.eqiad.wmnet with OS bullseye [06:46:28] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1116.eqiad.wmnet with OS bullseye [07:05:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1115.eqiad.wmnet with OS bullseye completed: - an-worker1115 (**WARN**) - Downtimed on Icinga/Alertm... [07:27:22] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1116.eqiad.wmnet with OS bullseye completed: - an-worker1116 (**PASS**) - Downtimed on Icinga/Alertm... [07:48:42] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye [07:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:53:44] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:19:24] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Gehel) a:05BTullis→03Stevemunene [08:31:42] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Gehel) [11:22:20] I'm about to start testing a new build of archiva for T299645 - Let me know if anything seems amiss with it. [11:27:18] Looks good so far. Version 2.2.10 of archiva is installed and appears to be working correctly. I'm checking that the `archiva-gitfat-link.service` works as expected. [11:53:44] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:53:44] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:01:14] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) While reimaging `an-worker1117.eqiad.wmnet` we found that the server did not reimage with the right mountpoints resulting in puppet error ` Error while evaluating a Function Call: Number of datanod... [12:15:37] 10Data-Platform-SRE, 10Patch-For-Review, 10sre-alert-triage: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) >>! In T343318#9101649, @gerritbot wrote: > Change 939651 had a related patch set uploaded (by Gmode... [12:29:57] (MediawikiPageContentChangeEnrichAvailability) resolved: ... [12:29:57] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:48:12] (03CR) 10Mforns: "It seems reusing preexisting core fragments prevents us from organizing the fields in an optimal way for consumers down the pipeline, no? " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [12:50:00] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10xcollazo) In general, I think the document includes great knowledge to have as a newcomer to Dumps 1.0. From a runbook perspective, I think we should include a listing of... [12:51:15] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10JEbe-WMF) >>! In T343325#9109206, @xcollazo wrote: > In general, I think the document includes great knowledge to have as a newcomer to Dumps 1.0. > > From a runbook persp... [12:54:23] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10ArielGlenn) I can try to dust off and restructure the troubleshooting guide on wikitech for the sql/xml dumps, if that would be helpful. This would by no means be a replace... [13:01:05] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) 05Open→03Resolved [13:02:57] 10Data-Platform-SRE, 10Patch-For-Review: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) Deploying this change. [13:03:44] !log deploying the change to the yarn log retention and compression for T342923 [13:03:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:03:48] T342923: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 [13:06:58] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10lbowmaker) [13:07:00] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Got some errors from the first test, but they're mostly related to the current setup. Looking into this ` ERROR: exit status 1 EXIT STATUS 1 S... [13:07:03] 10Data-Platform-SRE, 10Patch-For-Review: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) I'll leave the ticket open a for a while, whilst we check to make sure that there are no unintended consequences. [13:14:19] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10Kizule) Sorry, I've missed notification about this, I don't know how, but that's alright. Thank you for taking a look into this, everything looks fine to me. :) [13:17:57] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10BTullis) >>! In T332596#9109379, @Kizule wrote: > Sorry, I've missed notification about this, I don't know how, but that's alright. > > Thank you for taking a look... [13:29:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) @Papaul fixed the thing. [13:39:21] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10fkaelin) Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development /... [13:44:52] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) @Jhancock.wm thanks all good now ` papaul@asw-a-codfw> show interfaces xe-2/0/19 descriptions Interface Admin Link Description xe-2/0/19 up... [13:56:04] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) >>! In T342923#9109622, @fkaelin wrote: > Would it be possible to define the retention period based on the queue a appliocation is running in? It is not re... [14:01:51] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye [14:05:53] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) [14:05:56] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: Flink Operations - https://phabricator.wikimedia.org/T328561 (10bking) [14:08:08] !log deploying refinery using scap [14:08:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:19:12] gmodena: You may want to be aware of : T334493 (there is a workaround for the refinery deployment deploy-refinery-to-hdfs step) [14:19:12] T334493: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 [14:20:01] btullis I'm deploying refinery with joal and we stumbled upon an issue with git fat on hadoop-test: Unhandled error: [14:20:01] deploy-local failed: {'exitcode': 1, 'stdout': '', 'stderr': "git: 'fat' is not a git command. See 'git --help'.\n\nThe most similar commands are\n\tfetch\n\tmktag\n\tstage\n\tstash\n\ttag\n\tvar\n"} [14:20:18] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10xcollazo) [14:20:38] btullis I'm rolling back deployment on hadoop-test [14:20:58] gmodena: Oh, interesting. That's the second time I've seen git-fat not being installed on bullseye. I think. [14:22:23] gmodena: Ah, it's here: https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/standard_packages.pp#L43-L46 [14:22:46] btullis ack re T334493. Thanks for the heads up. [14:24:13] (DiskSpace) firing: Disk space an-test-client1001:9100:/ 0.8115% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-client1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:26:38] gmodena: Could you try your deploy to hadoop-test again please? I have manually installed git-fat on an-test-coord1001 and I'll make a note just after this comment: https://phabricator.wikimedia.org/T279509#8936938 [14:27:25] btullis it worked [14:27:41] \o/ [14:28:40] Thanks btullis :) [14:29:13] (DiskSpace) resolved: Disk space an-test-client1001:9100:/ 0.8115% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-client1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:29:20] !log deploying refinery with hdfs [14:29:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:27] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Was able to get the deployment to staging done, login redirected to the right SSO page and I was able to enter my login details, however authenticati... [14:43:22] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis) Adding more tags and subscribers so that we can gain visibility and prioritise this within the DPE group. I'm not necessarily sure yet who is best... [14:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [14:49:49] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis) [14:49:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:42] (SystemdUnitFailed) firing: refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:42] (SystemdUnitFailed) firing: (2) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:42] (SystemdUnitFailed) firing: (3) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:46] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) As discussed in DPE SRE meeting: * let's try to be consistent in naming, using "Data Platform" for all services where DPE SRE is responsible [15:28:00] 10Data-Engineering, 10Data-Platform-SRE: Define a list of exactly which alerts should page the Analytics team in VictorOps - https://phabricator.wikimedia.org/T296552 (10Gehel) [15:29:07] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) [15:29:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:32:59] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10Gehel) a:03BTullis [15:37:56] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [15:38:09] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Gehel) a:03RKemper [15:49:50] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10Gehel) [15:50:11] 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) @JMeybohm we are reimaging these hosts to Bullseye in https://phabricator.wikimedia.org/T343124 , we will revisit this once the hosts have bee... [16:04:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:22] 10Data-Engineering, 10Event-Platform, 10GitLab (Pipeline Services Migration🐤): Migrate Data Engineering Pipelinelib repos to GitLab - https://phabricator.wikimedia.org/T344730 (10thcipriani) [16:09:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:02] 10Data-Platform-SRE: Export Blazegraph JNL file from wdqs1009 - https://phabricator.wikimedia.org/T344732 (10bking) [16:16:53] 10Data-Platform-SRE: Export Blazegraph JNL file from wdqs1009 - https://phabricator.wikimedia.org/T344732 (10bking) We noticed that wdqs-blazegraph keeps restarting itself, even with Puppet disabled. To work around this, I've changed mode to 000 for `/srv/deployment/wdqs/wdqs/runBlazegraph.sh` . This file is not... [16:17:32] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default_test - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [16:22:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [16:27:05] joal: gmodena: It looks like there is an issue with gobblin and refine on an-test-coord1001. I can help look. [16:27:34] Thanks for the ping btullis - I'm in a meeting now, will help in a bit [16:28:54] joal: ack. Maybe it's to do with the Archiva upgrade. I see `java.lang.ClassNotFoundException` errors. [16:30:03] Mwarf :( [16:32:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [16:33:51] Similar with gobblin job, cannot find jars. [16:34:02] `Error opening job jar: /srv/deployment/analytics/refinery/artifacts/org/wikimedia/gobblin-wmf/gobblin-wmf-core-1.0.1-jar-with-dependencies.jar` [16:34:28] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye completed: - wdqs2023 (**WARN... [16:34:29] That's from `journalctl -u gobblin-eventlogging_legacy_test --output short` on an-test-coord1001. [16:36:14] But it appears to be there. [16:36:20] https://www.irccloud.com/pastebin/CeZAcO7s/ [16:36:30] I think I know the issue btullis [16:36:42] Great. [16:37:46] i.e: We have rolled-back an-test-coord1001, and then re-deployed [16:40:13] And in those cases, scap keeps the rolled-back folder, and tries to reuse it for next deploy - But, that folder was broken (no jars), and this, scap doesn't see it [16:41:24] Ah, so do you think it will fix itself on the next scheduled gobblin runs and refine runs? [16:41:33] Nope [16:41:39] We need to take action [16:41:40] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10xcollazo) I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control solution? If we migrate `analytics/refine... [16:44:53] here's what I suggest: we manually drop the scap rev folder on an-test-coord1001, and then redeploy [16:45:23] Would you like me to help with this? Shall we batcave? [16:45:57] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [16:46:09] btullis: after my meeting - give me 10 minutes :) [16:47:19] joal: Sorry for hassling. I have to be afk for a bit now. [16:47:43] ack btullis - we'll do later [16:53:32] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) 05Open→03Resolved complete [17:00:39] 10Data-Engineering, 10Event-Platform, 10GitLab (Pipeline Services Migration🐤): Migrate Data Engineering Pipelinelib repos to GitLab - https://phabricator.wikimedia.org/T344730 (10thcipriani) [17:04:57] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:43] btullis: I'm assuming you're gone for now - ping stevemunene in case? [17:19:46] joal: I am here. Different desk, but logged in and with ssh. How can I help? [17:20:41] Thanks btullis - Can you please go to an-test-coord1001 [17:21:10] I'm in [17:23:05] And, there, drop the /srv/deployment/analytics/refinery-cache folder please [17:23:24] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10tchin) After experimenting a lot, I have a Datahub transformer for Kafka that generates an Event Streams platfor... [17:23:40] btullis: --^ [17:23:50] sorry, it took me time to devise what to do [17:24:06] joal: done [17:24:14] ack, I'll try to reploy [17:24:24] Great! [17:24:50] !log Redeploying refinery onto Hadoop-test to try to fix jar issue [17:24:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:25:15] I see a new refinery-cache folder. [17:25:20] Expected! [17:25:39] 👍 [17:26:11] once deploy is done (and hopefully successful), we should have the missing jars [17:26:20] And the next gobblin run should succeed [17:26:28] Then it'll about rerunning failed refine [17:27:21] Ok, stepping away from this keyboard for now, but will keep monitoring. Let me know how it goes. [17:27:23] ok, deploy succseeful [17:27:30] Thanks a milion btullis [17:27:50] A pleasure. [17:28:29] joal btullis ack. Saw the alerts re missing jars. [17:30:02] catching up with history; what was the issue with the test deployment? stale dir? [17:32:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [17:34:33] gmodena: let's sync on this after standup (or in standup if we can) [17:38:51] joal ack. let's sync at standup. I need to drop around 20:00ish [17:39:02] sure [17:50:50] gmodena btullis no urgency, but I've added you as reviewers on the latest rdf-streaming-updater test patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/951551 [17:54:42] inflatador ack [17:57:10] Thanks for all the great documentation btw [17:57:24] really makes our lives easier ;) [18:04:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [18:09:42] (SystemdUnitFailed) firing: (4) refine_event_sanitized_analytics_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:32] (GobblinLastSuccessfulRunTooLongAgo) resolved: (2) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [18:15:32] \o/ [18:15:36] Gobblin error resolved [18:15:56] We'll talk about those issues tomorrow with gmodena, and rerun the failed jobs [18:16:38] note: those errors were on test cluster, no real production isue [18:24:42] (SystemdUnitFailed) firing: (3) refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:42] (SystemdUnitFailed) resolved: (2) refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:35] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10tchin) From the recent meeting: - `Event Streams` will be the name of the platform - Streams are upstream to Kaf... [18:31:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10bking) [18:32:58] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Scap: analytics/refinery: Stop using git-fat - https://phabricator.wikimedia.org/T328472 (10BTullis) >>! In T328472#9110768, @xcollazo wrote: > I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control... [18:58:44] (03PS1) 10Clare Ming: Experiment with including fragments inside data objects. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951560 (https://phabricator.wikimedia.org/T343557) [19:09:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:12] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:21:00] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:24:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:18] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) 05Open→03Resolved a:03thcipriani This task got too big to be useful. I've broken down each individual row in... [19:27:38] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:27:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:46] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:28:12] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:28:20] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products, 10Data Pipelines (Sprint 12): Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10VirginiaPoundstone) p:05Medium→03Triage [19:28:44] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:29:34] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:29:54] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:30:16] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:24] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [19:49:34] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:49:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:57] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:00] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [20:11:00] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:34] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [20:19:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:18] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [20:25:18] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:42] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [20:49:42] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:57] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:57] (03PS3) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)