[00:36:42] (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:42] (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:49] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Expanding/adding the `AUTH_OIDC_SCOPE` doesn't seem to have had much impact on the SSO process, we are still getting the same error. ` 2023-08-28 14... [05:26:23] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye [06:04:52] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye completed: - an-worker1117 (**PASS**) - Removed from Puppet and Pu... [06:09:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:25] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) >>! In T332570#9116572, @Stevemunene wrote: > `an-worker1117` is stuck at install with an error no root filesystem is defined. Looking into this. Looking into this, we recently changed the partman... [06:11:42] (SystemdUnitFailed) resolved: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:16] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye [06:17:35] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye [06:46:11] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10SLyngshede-WMF) One issue we ran into with Gitlab also involved Gitlab not being able to locate OIDC attributes. This was as a result of how CAS returns the attri... [06:57:18] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye completed: - an-worker1118 (**PASS**) - Downtimed on Icinga/Alertm... [06:59:02] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye completed: - an-worker1119 (**PASS**) - Downtimed on Icinga/Alertm... [07:01:07] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [07:07:26] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) >>! In T305874#9125628, @SLyngshede-WMF wrote: > One issue we ran into with Gitlab also involved Gitlab not being able to locate OIDC attributes. Thi... [07:12:12] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye [07:12:33] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye [07:45:47] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) >> You can try switching the format to "FLAT" as with Gitlab, that might help datahub locate the attributes >> >> Example fro... [07:51:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye completed: - an-worker1120 (**PASS**) - Downtimed on Icinga/Alertm... [07:54:21] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye completed: - an-worker1121 (**PASS**) - Downtimed on Icinga/Alertm... [07:58:03] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [08:10:19] (03PS2) 10Joal: Remove unused cassandra module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/940154 [08:21:00] (03PS1) 10Joal: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) [08:40:19] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) a:03awight [08:41:05] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) [09:44:33] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye [09:44:51] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye [10:15:27] 10Analytics-Radar, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10phuedx) AFAICT this regression was introduced in {3c4021e9be27b0ecaeb131b71b0dc9ccb5600939}. >>! In... [10:25:23] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye completed: - an-worker1123 (**PASS**) - Downtimed on Icinga/Alertm... [10:27:23] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye completed: - an-worker1122 (**PASS**) - Downtimed on Icinga/Alertm... [10:41:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:52:10] !log Deploy airflow-dags/analytics [10:52:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:23] !log Update mediawiki_history_check_denormalize airflow job variables to send job-reports to both data-engineering-alerts and product-analytics [11:01:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:57] 10Data-Engineering, 10Product-Analytics: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10JAllemandou) Code has been deployed and airflow-variable updated. The "product-analytics@wikimedia.org" list should receive an email in case of success or er... [11:30:49] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) @mforns and @Milimetri... [12:04:41] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data... [12:12:25] heads up, dse-k8s-etcd1003 will briefly go down for a ganeti node reboot [12:39:53] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}_click schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [12:41:24] (03PS4) 10Phuedx: Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [12:41:56] (03CR) 10CI reject: [V: 04-1] Add analytics/metrics_platform/{app,web}_click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [12:47:05] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) @BTullis @Stevemunene I'm homing in on https://analytics.wikimedia.org/published/datasets/one-off/ as a final resting place for this data set and wanted to ch... [12:52:45] and dse-k8s-etcd1002 [12:52:46] (03PS5) 10Phuedx: Add analytics/metrics_platform/{app,web}/click schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [13:00:12] 10Data-Engineering, 10Java-Scala-Standardization, 10Code-Health: Integrate SonarCloud analysis as part of the analytics refinery builds - https://phabricator.wikimedia.org/T258800 (10Gehel) [13:00:32] 10Data-Engineering, 10Java-Scala-Standardization, 10Code-Health: Cleanup Maven dependencies in analytics/refinery - https://phabricator.wikimedia.org/T258802 (10Gehel) [13:03:22] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/click schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [13:44:59] 10Data-Platform-SRE: Rework our gitlab runner on VPS Cloud - https://phabricator.wikimedia.org/T330915 (10Gehel) 05Open→03Declined This seem to be lacking too much context. If someone knows more details, please feel free to re-open! [14:38:57] (03PS1) 10Gerrit maintenance bot: Add tly.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/952972 (https://phabricator.wikimedia.org/T345170) [14:41:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:24] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye [14:51:12] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye [15:06:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:14] 10Data-Platform-SRE: Export Blazegraph JNL file from wdqs1009 - https://phabricator.wikimedia.org/T344732 (10Gehel) 05Open→03Resolved [15:24:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:58] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye completed: - an-worker1124 (**PASS**) - Downtimed on Icinga/Alertm... [15:30:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:54] (03CR) 10Mforns: Add Metrics Platform fragments by platform only (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:31:16] (03CR) 10Mforns: Add analytics/metrics_platform/{app,web}/click schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [15:31:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) 05Open→03Resolved a:03Gehel It looks like the federation is configured on our side, but ther... [15:32:01] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye completed: - an-worker1125 (**PASS**) - Downtimed on Icinga/Alertm... [15:36:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:57] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:35] 10Data-Engineering, 10Product-Analytics: Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10mpopov) Thank you, Joseph! [16:10:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:16] (03CR) 10Phuedx: Add Metrics Platform fragments by platform only (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [16:25:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:29] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10OSefu-WMF) HI @BTullis following up on the discussion from T344257#9114255. I'm still encountering issues running queries in Superset. [16:51:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:52] (03CR) 10Joal: "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/952972 (https://phabricator.wikimedia.org/T345170) (owner: 10Gerrit maintenance bot) [17:16:39] (03CR) 10Joal: [V: 03+2 C: 03+2] Add tly.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/952972 (https://phabricator.wikimedia.org/T345170) (owner: 10Gerrit maintenance bot) [17:26:43] (SystemdUnitFailed) resolved: user@114.service Failed on stat1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:57] 10Data-Engineering, 10Product-Analytics, 10Data Engineering and Event Platform Team (Sprint 1): Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10JAllemandou) [17:45:35] 10Data-Engineering, 10Product-Analytics, 10Data Engineering and Event Platform Team (Sprint 1): Email notifications of new MediaWiki history snapshot availabilty - https://phabricator.wikimedia.org/T344854 (10JAllemandou) a:03JAllemandou [17:49:07] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Jclark-ctr) [17:49:09] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-eqiad: hw troubleshooting: ipmi down for wdqs1005.eqiad.wmnet - https://phabricator.wikimedia.org/T345081 (10Jclark-ctr) 05Open→03Resolved a:05Papaul→03Jclark-ctr performed flea power drain idrac connection came back [17:51:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:39] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [18:11:49] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) stat1011 C 3 , U40 Port 40 Cableid 3750 [18:27:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) I've added config and redeployed rdf-streaming-updater via helmfile on the DSE cluster, but I don't see any evidence that the... [18:38:15] (03CR) 10Gmodena: [C: 03+1] "LGTM. The removal of deprecated code seems inline with the desired behavior described in phab." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal) [19:01:43] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:37] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) @BTullis re the MW content enrichment job mentioned in the task: Should w... [19:14:23] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [19:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:24] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) a:03Jclark-ctr [19:16:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:43] (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:03] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10AKanji-WMF) [20:40:42] (03CR) 10Neil Shah-Quinn (WMF): [C: 03+1] Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal) [20:46:22] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/49 elasticsearch: bump elastic plugins version [20:48:08] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10CodeReviewBot) bking closed https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/48 elasticsearch: Update wmf-elasticsearch-search-plugins [21:04:42] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) I can confirm that the Flink app is writing to [[ https://github.com/wikimedia/operations-deployment-charts/blob/master/helmf... [23:12:12] (03PS2) 10Jon Harald Søby: Remove text-transform:capitalize; and clean up capital letter use [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/928802