[00:31:45] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:45] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:36] 10Analytics, 10Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, 10Technical-Debt: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 (10santhosh) @MNeisler Thanks for listing this options. There are a few "must have" requirements from my... [08:40:47] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > This is the lag of the most lagged rep... [09:09:31] About to start the decommissioning of analytics1058-1060 the hosts are in decommissioned state and Removed from the HDFS topology as per T317861 [09:09:31] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [09:10:04] Ack, thanks stevemunene . [09:13:08] !log Decommissioning analytics1058.eqiad.wmnet -t T338227 [09:13:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:13:11] T338227: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 [09:55:41] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Robertsky) >>! In T338033#8936671, @gerritbot wrote: > Change 929816 **merged** by jenkins-bot: > %%%[an... [09:57:11] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10Volans) After a chat with @elukey I now better understand the two pipelines: `webrequest_sampled_128`: ` varnish -> varnishkafka -> kafka jumbo -> HDFS -> druid ` `webrequest_sampled... [10:41:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:55] !log decommission host analytics1059.eqiad.wmnet -t T338408 [10:47:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:47:57] T338408: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 [11:28:01] !log decommission host analytics1060.eqiad.wmnet -t T338409 [11:28:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:28:04] T338409: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 [11:51:32] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) Done decommissioning the hosts in the first batches in /eqiad/A. Next is to [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerbero... [13:14:13] PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 605053 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:15:11] elukey: ^ is puppet on kafka-test1006 safe to re-enable now? [13:16:50] ah yes sorry, lemme fix it [13:18:16] done :) [13:18:37] Many thanks :-) [13:19:17] btullis: Sandra is onto testing a patch for refine-sanitize - She will probably have it for later today :) [13:19:47] RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:21:18] Great! I've got patchy availability this afternoon and I'm away tomorrow. I can try to fit in a deployment today, but stevemunene might also be able to help if I'm unavailable. [13:22:45] ack btullis - thanks for the heads up [13:31:44] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:33:24] PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:30] PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:18] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:34:34] PROBLEM - Check systemd state on analytics1063 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:26] heya team, joal btullis, can I help with something? [13:42:51] Hi mforns :) - Nothing from my side, thanks. All good. The alerts about nodemanagers are related to work that stevemunene is doing to decom some hadoop workers. [13:43:12] ok, btullis, thanks for the info! [13:44:12] Hi mforns - I have on purpose not taken on fixing refine alerts etc - I wish to point out during standup/retro thast we have been behind on this lately [13:44:49] Sandra is working for a puppet patch to re-migrate refine-sanitize to spark3, after the roll-back last Friday. [13:45:02] ok, thanks all [13:46:00] Steve is on ops week and I am supposedly shadowing him, but I have not been very active in doing so. Apologies for that. [14:00:58] np btullis - it's not on anyone personnally :) [14:02:33] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Krinkle) >>! In T169452#8919440, @rook wrote: >>>! In T169452#8919426, @Krinkle wrote: >> I'm trying out Supetset as per the banner on Quarr... [14:04:43] !log move varnishafka instances in eqsin to PKI - T337825 [14:04:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:04:46] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [14:04:53] all working as expected folks [14:06:12] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've now made some good progress on a change to the datahub helm charts that I believe will help with progress on the upgrade. https://gerrit.wikimedia.org/r/c/operations/deployment-chart... [14:06:23] elukey: Great! [14:06:51] \o/ [14:06:55] Thanks elukey [14:07:50] by the way folks, I made a change to wikibugs, so this IRC channel now gets updates from tickets tagged with #data-platform-sre in phabricator: https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/930834 [14:09:16] It means that we might start to see some updates here from bking and rkemper on search-platform related work, but it seemed the best thing to do in order to make sure that what we're doing is still visible here. Feel free to comment/revert if you have any suggestions. [14:11:05] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Krinkle) I noticed something rather creepy in Superset. It seems to track by default every link and query you open, even if you merely click... [14:16:44] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10JAllemandou) >>! In T337460#8946096, @Volans wrote: > My only question is if in the future it would be possible to evaluate ways to backfill data into druid in case of issues in the pi... [14:21:06] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Snaevar) Looked at Recent activity on my account and saw nothing. That is despite me having around 30 saved queries. I have alpha and OAuth... [15:02:58] RECOVERY - Check systemd state on analytics1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:03] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.0 - https://phabricator.wikimedia.org/T336286 (10mforns) Just stumbled on this task, I created the missing patch. It should work regardless of the version of Airflow that is running it. https://gitlab.wikimedia.org/repos/data-engineering/a... [15:23:10] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.0 - https://phabricator.wikimedia.org/T336286 (10BTullis) Thanks so much @mforns - That looks excellent. [15:31:45] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10mforns) >> In T337052#8865357, @mforns wrote: >> Another reason is that we hopefully(2) won't need to execu... [15:37:14] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10BTullis) @xcollazo - could you have another go please? I just tried to reproduce your error on an-test-client1001 and it's working for me. Creating the ice... [15:38:24] 10Data-Platform-SRE: The /srv volume is full on an-launcher1002 - https://phabricator.wikimedia.org/T339002 (10BTullis) 05Open→03Resolved [15:46:38] RECOVERY - Check systemd state on analytics1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:32] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) >>! In T169452#8946824, @Krinkle wrote: > In the simplest case of a select query that needs no LIMIT, indeed the parenthesis can be om... [16:04:34] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10JArguello-WMF) [16:56:39] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Ottomata) > But it also means that when an hourly partition fails refinement, it fails every time the parti... [16:57:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) > I wonder if the is_bad_rev_id field (... [16:59:17] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10mforns) > There is a _FAILURE flag that gets written, and by default previous failures are excluded from th... [17:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:23] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) mw-page-content-change-enrich has been deployed with Kurbenetes HA (ConfigMaps) on staging. So far so good during routine a... [18:38:08] (03Abandoned) 10Aklapper: Service Worker to cache locally AQS data [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/302755 (https://phabricator.wikimedia.org/T138647) (owner: 10Nuria) [18:56:22] (03PS7) 10Gmodena: page_change: add a flag for missing revision data [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) [21:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed