[00:31:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:16:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:36:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:41:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:48:36] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, 10Technical-Debt: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 (10santhosh) @MNeisler Thanks for listing this options. There are a few "must have" requirements from my...
[08:40:47] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > This is the lag of the most lagged rep...
[09:09:31] <stevemunene>	 About to start the decommissioning of analytics1058-1060 the hosts are in decommissioned state and Removed from the HDFS topology as per T317861 
[09:09:31] <stashbot>	 T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861
[09:10:04] <btullis>	 Ack, thanks stevemunene .
[09:13:08] <stevemunene>	 !log Decommissioning analytics1058.eqiad.wmnet -t T338227
[09:13:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:13:11] <stashbot>	 T338227: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227
[09:55:41] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Robertsky) >>! In T338033#8936671, @gerritbot wrote: > Change 929816 **merged** by jenkins-bot: > %%%[an...
[09:57:11] <wikibugs>	 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10Volans) After a chat with @elukey I now better understand the two pipelines:  `webrequest_sampled_128`: ` varnish -> varnishkafka -> kafka jumbo -> HDFS -> druid `  `webrequest_sampled...
[10:41:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:47:55] <stevemunene>	 !log decommission host analytics1059.eqiad.wmnet -t T338408
[10:47:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:47:57] <stashbot>	 T338408: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408
[11:28:01] <stevemunene>	 !log decommission host analytics1060.eqiad.wmnet -t T338409
[11:28:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:28:04] <stashbot>	 T338409: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409
[11:51:32] <wikibugs>	 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) Done decommissioning the hosts in the first batches in /eqiad/A. Next is to [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerbero...
[13:14:13] <icinga-wm>	 PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 605053 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:15:11] <btullis>	 elukey: ^ is puppet on kafka-test1006 safe to re-enable now?
[13:16:50] <elukey>	 ah yes sorry, lemme fix it
[13:18:16] <elukey>	 done :)
[13:18:37] <btullis>	 Many thanks :-)
[13:19:17] <joal>	 btullis: Sandra is onto testing a patch for refine-sanitize - She will probably have it for later today :)
[13:19:47] <icinga-wm>	 RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:21:18] <btullis>	 Great! I've got patchy availability this afternoon and I'm away tomorrow. I can try to fit in a deployment today, but stevemunene might also be able to help if I'm unavailable.
[13:22:45] <joal>	 ack btullis - thanks for the heads up
[13:31:44] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:33:24] <icinga-wm>	 PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:30] <icinga-wm>	 PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:34:34] <icinga-wm>	 PROBLEM - Check systemd state on analytics1063 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:26] <mforns>	 heya team, joal btullis, can I help with something?
[13:42:51] <btullis>	 Hi mforns :) - Nothing from my side, thanks. All good. The alerts about nodemanagers are related to work that stevemunene is doing to decom some hadoop workers.
[13:43:12] <mforns>	 ok, btullis, thanks for the info!
[13:44:12] <joal>	 Hi mforns - I have on purpose not taken on fixing refine alerts etc - I wish to point out during standup/retro thast we have been behind on this lately
[13:44:49] <btullis>	 Sandra is working for a puppet patch to re-migrate refine-sanitize to spark3, after the roll-back last Friday.
[13:45:02] <mforns>	 ok, thanks all
[13:46:00] <btullis>	 Steve is on ops week and I am supposedly shadowing him, but I have not been very active in doing so. Apologies for that.
[14:00:58] <joal>	 np btullis - it's not on anyone personnally :)
[14:02:33] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Krinkle) >>! In T169452#8919440, @rook wrote: >>>! In T169452#8919426, @Krinkle wrote: >> I'm trying out Supetset as per the banner on Quarr...
[14:04:43] <elukey>	 !log move varnishafka instances in eqsin to PKI - T337825
[14:04:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:04:46] <stashbot>	 T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825
[14:04:53] <elukey>	 all working as expected folks
[14:06:12] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've now made some good progress on a change to the datahub helm charts that I believe will help with progress on the upgrade. https://gerrit.wikimedia.org/r/c/operations/deployment-chart...
[14:06:23] <btullis>	 elukey: Great!
[14:06:51] <joal>	 \o/
[14:06:55] <joal>	 Thanks elukey
[14:07:50] <btullis>	 by the way folks, I made a change to wikibugs, so this IRC channel now gets updates from tickets tagged with #data-platform-sre in phabricator: https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/930834
[14:09:16] <btullis>	 It means that we might start to see some updates here from bking and rkemper on search-platform related work, but it seemed the best thing to do in order to make sure that what we're doing is still visible here. Feel free to comment/revert if you have any suggestions.
[14:11:05] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Krinkle) I noticed something rather creepy in Superset. It seems to track by default every link and query you open, even if you merely click...
[14:16:44] <wikibugs>	 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10JAllemandou) >>! In T337460#8946096, @Volans wrote: > My only question is if in the future it would be possible to evaluate ways to backfill data into druid in case of issues in the pi...
[14:21:06] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Snaevar) Looked at Recent activity on my account and saw nothing. That is despite me having around 30 saved queries. I have alpha and OAuth...
[15:02:58] <icinga-wm>	 RECOVERY - Check systemd state on analytics1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:03] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.0 - https://phabricator.wikimedia.org/T336286 (10mforns) Just stumbled on this task, I created the missing patch. It should work regardless of the version of Airflow that is running it. https://gitlab.wikimedia.org/repos/data-engineering/a...
[15:23:10] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.0 - https://phabricator.wikimedia.org/T336286 (10BTullis) Thanks so much @mforns - That looks excellent.
[15:31:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10mforns) >> In T337052#8865357, @mforns wrote: >> Another reason is that we hopefully(2) won't need to execu...
[15:37:14] <wikibugs>	 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10BTullis) @xcollazo - could you have another go please? I just tried to reproduce your error on an-test-client1001 and it's working for me.  Creating the ice...
[15:38:24] <wikibugs>	 10Data-Platform-SRE: The /srv volume is full on an-launcher1002 - https://phabricator.wikimedia.org/T339002 (10BTullis) 05Open→03Resolved
[15:46:38] <icinga-wm>	 RECOVERY - Check systemd state on analytics1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:32] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) >>! In T169452#8946824, @Krinkle wrote: > In the simplest case of a select query that needs no LIMIT, indeed the parenthesis can be om...
[16:04:34] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10JArguello-WMF)
[16:56:39] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Ottomata) > But it also means that when an hourly partition fails refinement, it fails every time the parti...
[16:57:32] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) > I wonder if the is_bad_rev_id field (...
[16:59:17] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10mforns) > There is a _FAILURE flag that gets written, and by default previous failures are excluded from th...
[17:36:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:31:23] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) mw-page-content-change-enrich has been deployed with Kurbenetes HA (ConfigMaps) on staging. So far so good during routine a...
[18:38:08] <wikibugs>	 (03Abandoned) 10Aklapper: Service Worker to cache locally AQS data [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/302755 (https://phabricator.wikimedia.org/T138647) (owner: 10Nuria)
[18:56:22] <wikibugs>	 (03PS7) 10Gmodena: page_change: add a flag for missing revision data [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699)
[21:36:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed