[00:02:29] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2100.codfw.wmnet with OS bullseye [00:05:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2101.codfw.wmnet with OS bullseye [00:26:35] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2098.codfw.wmnet with OS bullseye completed: - elastic2098 (**PASS**) - Re... [00:26:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2096.codfw.wmnet with OS bullseye [00:30:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Dzahn) [00:31:01] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2097.codfw.wmnet with OS bullseye [00:34:32] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2099.codfw.wmnet with OS bullseye completed: - elastic2099 (**PASS**) - Re... [00:40:38] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2100.codfw.wmnet with OS bullseye completed: - elastic2100 (**PASS**) - Re... [00:42:15] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2101.codfw.wmnet with OS bullseye completed: - elastic2101 (**PASS**) - Re... [00:49:07] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye [00:57:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [01:03:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2096.codfw.wmnet with OS bullseye completed: - elastic2096 (**PASS**) - Re... [01:08:07] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2097.codfw.wmnet with OS bullseye completed: - elastic2097 (**PASS**) - Re... [01:28:30] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye [02:06:08] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye [02:09:23] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors: - elastic2094 (**FAI... [02:09:29] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye [02:12:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye [02:17:43] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FAI... [02:18:29] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye [02:41:37] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye completed: - elastic1104 (**PASS**) - Do... [02:45:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye completed: - elastic1105 (**PASS**) - Do... [02:48:45] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye completed: - elastic1106 (**PASS**) - Do... [02:49:42] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye executed with errors: - elastic1103 (**FAI... [03:35:32] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:45] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors: - elastic2094 (**FAI... [04:40:38] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:52] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:46:02] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:46:08] PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:40] RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:52] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:50:02] PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:18] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:02:12] RECOVERY - Check systemd state on an-worker1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:28] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:10:12] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:10:20] RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:12] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:12] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:19:44] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:20:14] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:32] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:48] 10Data-Engineering (Sprint 7): Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10JAllemandou) [08:19:01] 10Data-Engineering (Sprint 7): Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10JAllemandou) p:05Triage→03Unbreak! [09:00:41] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [09:00:41] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [09:00:41] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [09:00:46] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [09:19:22] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10awight) Please scrub wikitech docs for mentions of `mw.eventLog.dispatch`. [09:50:23] 10Data-Engineering (Sprint 7), 10Patch-For-Review: Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10BTullis) This issue was discovered shortly after this patch was deployed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352 It relates to this... [09:56:29] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [10:10:03] could someone from the DE team comment what server groups/LDAP groups are suitable for the access requested at https://phabricator.wikimedia.org/T355395 ? [10:10:44] moritzm: I'll take a look. [10:14:02] cheers [10:33:39] 10Data-Engineering (Sprint 7): Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10BTullis) The change that uncovered this behaviour in refinery has been temporarily reverted, so we should not receive any more of these for the time being. [10:37:51] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10Chidgk1) [11:35:33] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:41] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [12:35:41] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [12:35:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [12:35:47] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [12:42:41] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [12:42:41] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [12:47:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [12:47:11] (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [13:00:53] is anyone aware of changes to kafka jumbo? [13:01:01] we just got hit by " Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 6004688 bytes when serialized which is larger than 4194304, which is the value of the max.request.size configuration.", "ecs.version": "1.2.0","process.thread.name":"flink-akka.actor.default-dispatcher-14","log.logger":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher" [13:01:22] that looks like a regression of https://phabricator.wikimedia.org/T344688 [13:03:08] config in puppet looks like expected https://github.com/wikimedia/operations-puppet/blob/4a162a9f8a2151addc00bcc64b0407a458e9c8f5/hieradata/role/common/kafka/jumbo/broker.yaml#L7 [13:25:36] brouberol: ^ [13:45:53] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10lbowmaker) [13:48:41] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) [13:48:46] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) [13:49:44] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) 05Open→03Resolved [13:52:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [13:52:11] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [13:52:41] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [13:52:41] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [13:55:41] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [13:55:41] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [13:58:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [13:58:11] (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [13:59:16] (03PS17) 10Btullis: Update to Superset version 3.1.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [14:02:47] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye [14:03:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [14:03:11] (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [14:08:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [14:08:11] (3) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [14:11:22] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10gmodena) [14:11:33] gehel brouberol ack. we followed up on slack. Looks like a client side issue. Phab at https://phabricator.wikimedia.org/T355426 [14:13:17] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10brouberol) Reference slack conversation: https://wikimedia.slack.com/archives/C02291Z9YQY/p1705664308474179 [14:14:03] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye [14:15:41] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [14:15:41] mw_page_content_change_enrich in codfw is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [14:18:11] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [14:18:11] (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [14:19:41] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [14:19:41] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [14:24:41] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [14:24:41] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [14:34:41] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [14:34:41] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [14:35:23] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [14:36:26] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [14:36:26] (2) High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [14:37:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye completed: - elastic1103 (**PASS**) - Re... [14:42:59] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10gmodena) a:03gmodena [14:43:45] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10gmodena) [14:44:21] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10gmodena) [14:49:41] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [14:49:41] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [14:50:33] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye completed: - elastic1107 (**PASS**) - Do... [14:51:33] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10gmodena) The service has been redeployed. Flink caught up (back pressure dropped), and kafka max consumer offset lag is down to 0. [14:53:41] 10Data-Engineering (Sprint 7): Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10Antoine_Quhen) a:03Antoine_Quhen [14:56:16] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10bking) a:03bking [14:56:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye [15:06:51] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Adopt iceberg as the data quality metrics table backend - https://phabricator.wikimedia.org/T352687 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/550 Webrequest: add metrics g... [15:06:53] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/550 Webrequest: add metrics generation and quality check alerting dag. [15:06:56] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Move MetricsExporter to refinery-spark - https://phabricator.wikimedia.org/T352688 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/550 Webrequest: add metrics generation and qua... [15:06:58] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_re... [15:33:02] 10Data-Engineering: db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) [15:34:15] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10BTullis) p:05Triage→03High a:03BTullis [15:35:33] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:23] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) [15:45:24] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10BTullis) I know why this is and I caused it. I have been working on T335356 and trying to upgrade our superset-staging instance to Superset 3.x However, ther... [15:47:27] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) You have two ways of reverting this, one would be to create the table manually on the slave (it doesn't have to have the same structure and do it... [15:48:00] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) And I am specifically not pasting the skip transaction command just so it is clear we normally don't like it :) [15:48:25] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10BTullis) Great, thanks. I noted that from here: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Data_inconsistency_between_nodes_%22drift%22_/_... [15:49:40] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10BTullis) I will try the first option that you mentioned. Thanks. [15:51:47] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) Ah wait, you issued the DROP on the master right? So unless the command for the creation was issued with `IF NOT EXISTS`, then your only option i... [15:52:44] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) So it would be: - Skip transaction (just 1) - Create the table manually - Replication will break again - Stop slave; start slave; [15:57:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FAI... [15:57:38] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10BTullis) I think it's all OK now. {F41699955,width=50%} ` Slave_IO_Running: Yes Slave_SQL_Running: Yes Seconds_Behind_Master: 0 ` Thanks for your help. [15:58:16] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) Excellent [15:58:23] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): db1208 analytics_meta replication broken - https://phabricator.wikimedia.org/T355435 (10Marostegui) 05Open→03Resolved [16:12:15] (03PS1) 10Xcollazo: Fix code serialization for MediawikiDumper.scala job. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991795 (https://phabricator.wikimedia.org/T346278) [16:16:39] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors: - elastic2094 (**FAI... [16:27:55] (03CR) 10Xcollazo: [C: 04-2] "As per the tests, this change does not address the serialization issue." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) (owner: 10Jennifer Ebe) [16:28:56] (03Abandoned) 10Xcollazo: Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) (owner: 10Jennifer Ebe) [16:38:47] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [16:41:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye [16:41:53] (03CR) 10Aqu: [C: 03+1] "lgtm" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991795 (https://phabricator.wikimedia.org/T346278) (owner: 10Xcollazo) [16:46:56] (03CR) 10Xcollazo: [C: 03+2] "Merging, thanks for the review aqu!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991795 (https://phabricator.wikimedia.org/T346278) (owner: 10Xcollazo) [16:58:46] (03Merged) 10jenkins-bot: Fix code serialization for MediawikiDumper.scala job. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991795 (https://phabricator.wikimedia.org/T346278) (owner: 10Xcollazo) [17:04:32] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FAI... [17:06:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [17:52:09] (03PS1) 10Aqu: [WIP] Rewrite x-analytics header hash parsing [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/991805 [18:02:08] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors: - elastic2094 (**FAI... [18:27:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors: - elastic2088 (**FAI... [18:53:48] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_r... [19:35:33] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye [20:24:48] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10MusikAnimal) It is intentionally hidden for privacy reasons. [20:27:35] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-Core-Tests, 10MediaWiki-extensions-CentralNotice, and 3 others: CentralNotice failing in browser test on master - https://phabricator.wikimedia.org/T354977 (10Umherirrender) Found the two tests making everything fail and going to skip them in https://g... [20:51:53] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: No page views by country data for Turkey - https://phabricator.wikimedia.org/T355404 (10VirginiaPoundstone) 05Open→03Declined Declining this task, because, as @MusikAnimal said, this information is intentionally hidden for... [23:35:33] (SystemdUnitFailed) firing: (8) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed