[00:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:44] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:59:29] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a schema uses oneOf with different types - https://phabricator.wikimedia.org/T337855 (10Ottomata) a:03tchin [01:20:06] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) [01:20:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20), 10MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), and 2 others: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client ... - https://phabricator.wikimedia.org/T286344 [01:20:19] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops: k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10Ottomata) 05Open→03Resolved Ya dunno why we didn't! [01:20:40] 10Data-Engineering-Planning, 10Patch-For-Review: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10Ottomata) [01:21:21] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [01:21:25] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Goal: BUOD-KR1-Q3: Require that all new schema/instruments are created with the MEP system - https://phabricator.wikimedia.org/T259157 (10Ottomata) 05Open→03Resolved a:03Ottomata Being bold and declining. [01:21:29] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Goal: BUOD-KR1-Q3: Require that all new schema/instruments are created with the MEP system - https://phabricator.wikimedia.org/T259157 (10Ottomata) 05Resolved→03Declined [01:21:31] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [01:22:38] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Goal: BUOD-KR1-Q3: Require that all new schema/instruments are created with the MEP system - https://phabricator.wikimedia.org/T259157 (10Ottomata) [01:24:01] 10Analytics-Radar, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Epic: Review and evolve client environment around EventLogging - https://phabricator.wikimedia.org/T240462 (10Ottomata) 05Open→03Resolved a:03Ottomata Most relevant substasks are resolved. [01:24:58] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Goal: BUOD-KR1-Q3: Require that all new schema/instruments are created with the MEP system - https://phabricator.wikimedia.org/T259157 (10Ottomata) [01:28:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) 05Open→03Resolved [01:29:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:31:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:20] !log hadoop-yarn-resourcemanager restart for T317861 [05:41:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:41:23] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [05:49:03] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) The binlogs I deleted for s1 were not that old but it's okay. s3 binlogs were quite old though. deleted most of them. I think templatelinks needs optimization in s3. I spot checked a wiki between differen... [05:56:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:08] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B): eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10gmodena) a:03gmodena [08:16:23] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) Sorry @stevemunene, but you've removed the wrong host from the cluster. Your patch removed an-worker1058 instead of analytics10... [08:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:22:01] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-enginee... [08:22:13] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) [08:24:15] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10kostajh) >>! In T336084#8904646, @Mayakp.wiki wrote: > Based on the findings of t... [08:27:31] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) Thanks for catching that @BTullis, issuing a revert of the submitted patch. [08:43:48] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10gmodena) Hey @jnuche, Couple of questions re integrating this workflow in o... [09:01:27] ok to install the latest Postgresql sec updates on the an-db hosts now? [09:02:15] Erm. Can you wait a sec for an-db1001 please? It will cause a bump to airflow instances I think, so I'd rather watch closely. [09:05:19] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10gmodena) a:03gmodena [09:05:45] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) What I would do for this ticket is to create a new decom sub-ticket for each, using the form here: https://phabricator.wikimedia... [09:06:59] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) Wer can find the ecising journal nodes using cumin like this: ` btullis@cumin1001:~$ sudo cumin A:hadoop-hdfs-journal 5 hosts wi... [09:07:36] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10gmodena) @tchin an alternative path for coverage reporting could be integrating with https://gitlab.wikimedia.org/repos/releng/docpub/-/blob/main/README.md and lin... [09:08:50] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) In terms of choosing a replacement journal node, how about if we add another 10 to the value, as per the pattern of an-worker108... [09:13:58] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) Ah, I didn't think about the topology. We would like to make sure that it is row aware, in case we were to lose a whole row (lik... [09:16:26] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) So how about if we choose the first an-worker node that is in row E. which is an-worker1142. https://netbox.wikimedia.org/dcim/d... [09:17:13] btullis: sure, thing. just ping me when it works for you [09:58:28] (03PS9) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [10:06:58] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:28] (03PS10) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [10:28:28] (03PS11) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [10:29:37] (03CR) 10Peter Fischer: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [10:30:07] (03CR) 10Peter Fischer: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [10:33:03] moritzm: Is 11:00 UTC (28 mins from now) OK for you? I'm asking people to let me know if there are any errors resulting from running Airflow jobs. [10:34:51] sounds good to me [10:40:30] https://usercontent.irccloud-cdn.com/file/GkeSbvaL/image.png [10:40:44] * btullis Spammed everyone [10:48:58] ack [10:50:16] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10jnuche) > As a test, I wanted to trigger .docpub:publish-docs (derived) manua... [11:00:08] all good to go? [11:00:21] Yes please. [11:00:35] and done [11:01:03] for an-db1001, not sure if 1002 also needs some kind of syncup? if not I can also do that one next [11:01:17] Nice. Thanks for your patience. You can do an-db1002 at any time. [11:01:28] k, doing that now then [11:02:27] No errors from the analytics jobs yet. Hopefully that 's a good sign that the sqlalchemy behind airflow is happy to reconnect automatically. [11:03:01] PROBLEM - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:03:12] Spoke too soon. [11:05:25] !log restart airflow-scheduler service on an-launcher1002 for postgresql restart [11:05:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:06:15] RECOVERY - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:06:30] !log restart airflow-scheduler service on an-launcher1004 for postgresql restart [11:06:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:07:50] !log (correction) that should have read an-airflow1004 for platform_eng instance [11:07:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:08:25] !log restart airflow-scheduler service on an-airflow1002 for research instance [11:08:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:12:09] !log restart airflow-scheduler service on an-airflow1005 for search instance [11:12:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:12:54] !log restart airflow-scheduler service on an-airflow1006 for product_analytics instance [11:12:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:13:51] !log restart airflow-scheduler service on an-test-client1001 for analytics_test instance [11:13:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:17:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-pyth... [11:17:23] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) [11:29:15] !log service hadoop-yarn-resourcemanager restart for T317861 [11:29:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:29:21] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [11:59:43] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [12:12:22] stevemunene: You're logging that hadoop-yarn-resourcemanager against the wrong ticket aren't you? Which host has the problem with yarn at the moment? [12:18:12] That's my mistake, meant to restart the namenodes. No problems with observed with yarn btullis [12:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:24:29] btullis: about T315426, do you need help in any way? [12:26:51] (03PS6) 10Nick Ifeajika: Remove all console logs Change 'timestamp' to 'dt' remove totals handler object [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) [12:30:19] (03CR) 10CI reject: [V: 04-1] Remove all console logs Change 'timestamp' to 'dt' remove totals handler object [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [12:47:06] 10Data-Engineering-Planning, 10Data-Platform-SRE: Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) [12:47:38] 10Data-Engineering, 10Event-Platform Value Stream: jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10Ottomata) [12:47:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10Ottomata) [12:49:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10Ottomata) [12:50:59] 10Data-Engineering-Planning, 10Data-Platform-SRE: Rebuild hive-hcatalog package for bullseye to address missing symlinks - https://phabricator.wikimedia.org/T337465 (10JArguello-WMF) [12:51:30] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10JArguello-WMF) [12:51:36] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10JArguello-WMF) [12:52:25] gehel: Yes, I think I could do with a hand to work out why I haven't seen any change here: https://phabricator.wikimedia.org/T315426#8903291 [12:53:00] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10JArguello-WMF) [12:53:02] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10JArguello-WMF) [12:53:15] btullis: I'm happy to be the rubber duck if that helps. Or I can try to find someone who actually understands those things [12:55:57] I've almost finished my draft of the wikireplicas outage incident doc from yesterday: https://docs.google.com/document/d/1yo0pCpOSQ4waAPtWU06xMQUs7UMhM5wnAj2U4NOObj8/edit [12:56:27] ...although I'm hoping that traffic might be able to add more detail about the pybal issue, then I'll look again at the wikireplicas themselves. [13:01:22] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10Ottomata) Hm, when will we publishing docs? I had assumed just on tag releases? The coverage is probably more useful on main? [13:08:27] (03CR) 10Ottomata: Encode redirect targets in page change events. (034 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [13:09:16] !log EventStreamConfig - temporarily Disable canary events and hadoop ingestion for development.network.probe stream - T332024 [13:09:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:09:20] T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 [13:13:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10JArguello-WMF) [13:22:44] (03PS1) 10Jameel Kaisar: Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) [13:23:29] (03CR) 10CI reject: [V: 04-1] Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [13:26:36] (03PS2) 10Jameel Kaisar: Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) [13:26:37] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should partition by and process by wiki_id,page_id - https://phabricator.wikimedia.org/T338169 (10Ottomata) [13:26:39] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10Ottomata) [13:29:18] (03PS3) 10Jameel Kaisar: Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) [13:49:51] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) [13:52:47] (03PS12) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [13:53:39] (03CR) 10Peter Fischer: "Thanks, extracted redirect_target_page definition to definitions" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [13:54:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [13:55:15] 10Data-Engineering: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10MoritzMuehlenhoff) [13:57:30] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) a:05gmodena→03None [13:57:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) [13:58:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) a:03gmodena [14:02:31] (03PS4) 10Jameel Kaisar: Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) [14:11:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10JameelKaisar) This should also not be allowed. ` ctx: type: object additionalProperties: type:... [14:18:21] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10Ottomata) [14:28:45] gehel: I am now making good headway with the wikireplicas view work. No need for rubber duck after all \o/ Found a bug. [14:29:05] I saw the comment := [14:44:02] (03CR) 10CDanis: [C: 03+2] Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [14:44:33] (03Merged) 10jenkins-bot: Fix: Add schema to ctx field of network probe schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/927669 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [15:07:06] !log deployed airflow analytics to try and fix the edit_hourly DAG again [15:07:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:47] 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to Hive event tables - https://phabricator.wikimedia.org/T307040 (10xcollazo) Just passing by to say that it would be nice to see this ticket happen. CC @lbowmaker. [15:27:19] btullis, stevemunene, I can not reach https://yarn.wikimedia.org/, I read the scrollback and I imagine it has to do with the coments above. Just to let you know! [15:50:00] mforns: o/ It seems a little weird, in theory the yarn master shouldn't be on 1002 [15:52:52] !log restart yarn resourcemanager on an-master1002 to restore the Yarn UI (that works only when the active yarn RM is on 1001) [15:52:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:11] mforns: should work now [15:53:29] stevemunene: o/ see above --^ Remember to check the Yarn status after restarts :) [15:56:52] Thanks for the spot joal , I will keep that in mind elukey thanks. [16:01:24] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) [16:03:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [16:06:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refactor EventBus extension Hooks to use new hook system - https://phabricator.wikimedia.org/T320655 (10Ottomata) [16:07:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [16:07:33] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [16:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:22:18] aqu: We have stayed in the same meeting! [16:22:39] stevemunene: I don't think that we need to restart the yarn resourcemanager now, when we include/exclude nodes. Maybe you found some outdated documentation. [16:23:42] stevemunene: We have this now, so it refreshes automatically: https://github.com/wikimedia/operations-puppet/blob/production/modules/bigtop/manifests/hadoop/resourcemanager.pp#L69-L74 [16:25:54] Thanks btullis updating the docs to remove the restart statement [16:26:09] Great, thanks. [16:28:43] (03CR) 10Milimetric: Remove all console logs Change 'timestamp' to 'dt' remove totals handler object (039 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [16:37:18] (03PS7) 10Nick Ifeajika: Remove all code relating to the totals data subset [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) [16:41:13] (03CR) 10CI reject: [V: 04-1] Remove all code relating to the totals data subset [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [16:58:51] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) The `rsync-published.service` is yet to function as expected on the stat host. ` Jun 06 16:15:08 stat1009 systemd[1]: Starting Rsync push to analytics.wikimedia.org web host... Jun 0... [17:11:10] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) >>! In T336036#8906836, @Stevemunene wrote: > The `rsync-published.service` is yet to function as expected on the stat host. > ` > Jun 06 16:15:08 stat1009 systemd[1]: Starting Rsync pus... [17:16:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:13] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) The service is up and running as expected, thanks @BTullis ` stevemunene@stat1009:/srv$ systemctl status rsync-published.service ● rsync-published.service - Rsync push to analytics... [17:18:39] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:59] (03PS8) 10Milimetric: Create new knowledge-gaps endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [17:20:16] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Well done" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [17:21:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:59] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:31] 10Analytics, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Documentation, and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [18:02:07] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10gmodena) a:05gmodena→03None [18:04:04] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-enginee... [18:45:12] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engine... [18:51:39] elukey: thank you! it works for me now :] [18:51:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:24] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-pyth... [19:06:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:25] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10dancy) >>! In T337400#8901472, @tchin wrote: > Ok so just recounting my experiments: > > I used a build flag to copy the insides of the kokkuri container into the... [19:30:56] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10gmodena) Project documentation is available at https://doc.wikimedia.org/dat... [19:32:57] (03CR) 10Ottomata: Encode redirect targets in page change events. (034 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [19:50:29] (03CR) 10Ottomata: "Thanks for struggling with this and being patient with my comments. I'd like to take a try at getting this right, if you don't mind? You " [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [19:56:46] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10tchin) >>! In T337400#8907322, @dancy wrote: > > Can you point me to the job output showing the hang? https://gitlab.wikimedia.org/repos/data-engineering/mediawi... [20:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:48:25] (03CR) 10Milimetric: simplify totals query and write it to the same destination table as by_category (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [20:55:39] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Airflow job to load Knowledge Gap metrics into Cassandra - https://phabricator.wikimedia.org/T337060 (10CodeReviewBot) milimetric opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/420 Add knowledge_g... [20:55:50] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Airflow job to load Knowledge Gap metrics into Cassandra - https://phabricator.wikimedia.org/T337060 (10CodeReviewBot) [23:06:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed