[01:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:28] (SystemdUnitFailed) firing: (19) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:23:28] (SystemdUnitFailed) firing: (19) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:15] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.684 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [03:44:13] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.9047 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [05:23:31] (SystemdUnitFailed) firing: (19) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:28] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:19] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10JMeybohm) The logging pipeline on k8s is described here https://wikitech.wikimedia.org/wiki/Kubernetes/L... [12:48:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:07] (03PS3) 10Snwachukwu: Migrate pageview druid load hql queries to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910520 (https://phabricator.wikimedia.org/T334104) [13:00:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10tchin) a:03tchin [13:35:06] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) Okay, so, from reading those docs and looking at app logs in logstash, e.g. [[ https://logstas... [13:41:51] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Ottomata) @DDeSouza done. [14:10:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Patch-For-Review: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) Deployed stream config. It should now be possible to produce events to the `mediawiki.page_outl... [14:10:39] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [14:16:36] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) @JMeybohm I'd like to proceed, but first we need to create the flink-operator namespace in staging-eqiad a... [14:17:23] (03PS4) 10Snwachukwu: Migrate pageview druid load hql queries to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910520 (https://phabricator.wikimedia.org/T334104) [14:20:01] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) You will need to add the namespace like you did in DSE (https://gerrit.wikimedia.org/r/c/operations/deploy... [15:01:56] 10Data-Engineering, 10Event-Platform Value Stream: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) [15:03:15] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ottomata) > Please, file another ticket for that discussion and keep this ticket for future reference for the user_is_temp column and any discussion related to it. @JayC... [15:04:10] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) [15:04:22] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) [15:08:36] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) I'm particularly interested in this decision because Event Platform is about to 'release' a new [[ https://phabricator... [15:34:45] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10LSobanski) [15:36:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10JArguello-WMF) [15:37:28] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10JArguello-WMF) [15:45:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) @tchin I think we should move forward with this python ECS logging as planned. At the very le... [15:57:58] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10colewhite) >>! In T335802#8833411, @Ottomata wrote: > It'd be nice to have our ECS log messages consumed... [16:31:09] 10Data-Engineering-Planning: Data Engineering Pairing system - https://phabricator.wikimedia.org/T327790 (10JArguello-WMF) [16:33:28] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) After discussions with serviceops about use of Persistent Volume Clai... [16:35:27] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Okay thank you! [16:48:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:01] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) @colewhite we are logging in ECS format for Java logs, so the `log` field from docker json-fil... [16:54:58] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) > We extract the structured logs from the "log"field when shipped by rsyslog input-file @tchin... [17:00:39] 10Data-Engineering, 10Event-Platform Value Stream: Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10Ottomata) [17:31:16] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors: - an-wo... [18:11:08] 10Data-Engineering, 10Event-Platform Value Stream: Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10Ottomata) Hm, looks like i'm going to need to add (or enable?) some perms to manage leases in the flink-operator namespace: ` User \"system:serviceaccount:flink-op... [19:09:17] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Tgr) One thing to note is that some of these user types (temp users, system users, normal users) are stored in both the user tab... [19:12:36] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10Ottomata) [19:18:26] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10Ottomata) [19:54:36] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) In https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/914867, @pfischer is mod... [20:19:46] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10daniel) I support the effort to introduce the concept of a user type (or, more appropriately, as Tgr pointed out, actor type) in... [20:31:00] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10Ottomata) Asking flink user mailing list. https://lists.apache.org/thread/yq89jm0szkcodfocm5x7vqnqdmh0h1l0 [20:31:21] (03CR) 10Kimberly Sarabia: "This is a follow-up ticket to 911412 to add the skin field to `editatteamptstep` and `mediawiki_web_ui_scroll`" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [20:33:10] (03PS1) 10Xcollazo: Add iceberg version of referrer_daily table. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917404 (https://phabricator.wikimedia.org/T335305) [20:48:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:01] (03CR) 10Clare Ming: "I would separate these patches per schema" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [21:19:27] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10pfischer) Currently the `link_target` entity does map to MW's `LinkTarget`. It is related to t... [21:22:10] !log deployed airflow analytics for a quick fix [21:22:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:23:16] 10Data-Engineering, 10Product-Analytics: Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10kzimmerman) p:05Triage→03High