[01:52:09] RECOVERY - Check systemd state on analytics1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:46] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:39] 10Data-Engineering, 10Research: Adding data from centralauth to the lake and the mediawiki_history dataset - https://phabricator.wikimedia.org/T282657 (10Pablo) Thank you @leila for checking. This action is no longer necessary, so I will proceed to close the ticket, [07:18:55] 10Data-Engineering, 10Research: Adding data from centralauth to the lake and the mediawiki_history dataset - https://phabricator.wikimedia.org/T282657 (10Pablo) Thank you @leila for checking. This action is no longer necessary, so I will proceed to close the ticket. [07:19:41] 10Data-Engineering, 10Research: Adding data from centralauth to the lake and the mediawiki_history dataset - https://phabricator.wikimedia.org/T282657 (10Pablo) 05Open→03Resolved [08:01:33] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10gmodena) > It looks like doing this for any timestamps that are provided by MediaWiki to EventBus will be more d... [08:01:46] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:13] stevemunene: o/ let's downtime the decom nodes, multiple alerts are firigin [08:34:16] *firign [09:28:35] 10Data-Platform-SRE: Restart buster query service hosts (wdqs/wcqs) to apply java8 sec upgrades - https://phabricator.wikimedia.org/T340482 (10MoritzMuehlenhoff) >>! In T340482#8968619, @RKemper wrote: > This should be done, but I haven't yet ran a validation command to sanity check that the correct version is i... [09:42:46] elukey: Steve is off today. I'll downtime them. [09:45:08] super thanks! [09:46:55] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0453fd24-8db4-4ba7-9753-ae2833e9b5fb) set by btullis@cumin1001 for 7 days, 0:00:00 on 8 host(s)... [09:48:18] I've just realised, today is my two-year wikiversary \o/ [09:48:57] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) [09:49:47] btullis: wow! [09:51:57] Time to party hard :-) [09:52:12] btullis: congrats! :) [09:55:02] Cheers. Might take myself out to lunch to celebrate. [09:56:33] Happy wikiversary btullis! [09:57:37] \ (•◡•) / Thanks aqu [10:29:37] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) p:05Triage→03Medium [10:29:47] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10BTullis) p:05Triage→03High [10:30:41] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) 05Open→03Resolved [10:30:48] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [10:35:28] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 B): [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10gmodena) [10:43:43] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > It's ok for staging, but for prod deployments we would need HA state To clarify: we should be able to recover even if Co... [11:41:32] (03PS1) 10Btullis: Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) [11:45:17] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) a:05Stevemunene→03BTullis I'm going to look into this task because the cookbook is failing. For some reason, an-test-worker1003 isn't booti... [11:46:12] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) p:05Triage→03High [12:08:41] (03CR) 10Btullis: [C: 03+2] Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:10:37] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10MoritzMuehlenhoff) >>! In T329363#8971253, @BTullis wrote: > I'm going to look into this task because the cookbook is failing. For some reason, an-test-... [12:13:02] (03CR) 10CI reject: [V: 04-1] Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:13:42] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) >>! In T329363#8971308, @MoritzMuehlenhoff wrote: >>>! In T329363#8971253, @BTullis wrote: >> I'm going to look into this task because the cook... [12:14:27] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:33:47] (03CR) 10CI reject: [V: 04-1] Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:35:48] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10MoritzMuehlenhoff) >>! In T329363#8971313, @BTullis wrote: > Oh thanks! It falls back to the installed OS. I'll do as you suggest with the sre.hardware.... [13:00:08] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:05:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10JArguello-WMF) [13:08:55] !log upgrading idrac firmware of an-test-worker1003 via the cookbook for T329363 [13:08:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:08:58] T329363: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 [13:11:21] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Use ECS logging fields when adding extra info to mediawiki-event-enrichment - https://phabricator.wikimedia.org/T337399 (10Ottomata) [13:13:27] (03CR) 10CI reject: [V: 04-1] Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:17:27] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [13:25:09] * btullis !log upgrading an-test-worker1003 to bullseye, after upgrading firmware [13:25:14] !log upgrading an-test-worker1003 to bullseye, after upgrading firmware [13:25:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:26:25] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:29:42] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye execu... [13:29:47] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:29:50] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed... [13:30:29] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:46:59] 10Data-Engineering: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10mforns) [13:54:34] 10Data-Engineering: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Thank you, @mforns! 🙏 Looking forward to having this up and running :) [13:56:34] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10xcollazo) >>! In T340067#8970553, @gmodena wrote: > What are cases where millisecond fractions matter to Event P... [14:08:24] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed... [14:08:38] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:43:58] (03PS16) 10Nick Ifeajika: Add query to load data from knowledge_gap.content_gap_metrics to cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [15:03:01] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:08:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [15:13:28] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10Ottomata) p:05Triage→03High [15:14:26] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform][NEEDS GROOMING] should we guarantee ordering in Mediawiki Stream Enrichment? - https://phabricator.wikimedia.org/T311603 (10Ottomata) 05Open→03Resolved a:03Ottomata This should be done, even though we process in async, we... [15:14:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform][NEEDS GROOMING] should we guarantee ordering in Mediawiki Stream Enrichment? - https://phabricator.wikimedia.org/T311603 (10Ottomata) a:05Ottomata→03gmodena [15:15:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Restart Strategy for Enrichment Service - https://phabricator.wikimedia.org/T325359 (10Ottomata) 05Open→03Invalid Handled by flink operator and in config/documentation [15:15:53] 10Data-Engineering, 10Event-Platform Value Stream: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10Ottomata) a:03tchin [15:15:55] (03CR) 10CI reject: [V: 04-1] Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:15:58] 10Data-Engineering, 10Event-Platform Value Stream: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10Ottomata) p:05Triage→03Medium [15:21:24] 10Data-Engineering, 10Event-Platform Value Stream: mw-page-content-change-enrich should partition by and process by wiki_id,page_id - https://phabricator.wikimedia.org/T338169 (10Ottomata) p:05Triage→03Medium [15:23:36] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed... [15:24:37] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [15:33:54] 10Data-Engineering, 10Discovery-Search, 10Event-Platform Value Stream: Flink Enrichment job alerting - https://phabricator.wikimedia.org/T340666 (10Ottomata) [15:34:09] 10Data-Engineering, 10Discovery-Search, 10Event-Platform Value Stream: Flink Enrichment job alerting - https://phabricator.wikimedia.org/T340666 (10Ottomata) p:05Triage→03Medium [15:34:20] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) @MoritzMuehlenhoff this host is still noot booting into PXE. I've updated the BIOS, iDrac, and NIC to the latest versions. {F37122172,width=40... [15:35:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [15:44:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refactor EventBus extension Hooks to use new hook system - https://phabricator.wikimedia.org/T320655 (10Ottomata) a:05Ottomata→03None [16:46:11] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed... [16:49:42] 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) a:03BTullis [16:50:53] 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) [16:56:14] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) [16:57:48] (03CR) 10Btullis: [C: 03+2] "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:05:31] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10fkaelin) I agree @leila, we can close this as resolved. There are already follow-up tasks for historical dumps (T333419) and making them available on DE... [17:15:05] 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: `an-test-coord1002.eqiad.wmnet` - an-test-coord1002.eqiad.wmnet (**WARN**) - Downtimed host... [17:17:08] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10MoritzMuehlenhoff) Adding @Papaul Does that ring a bell? I think we had some systems recently where we could not use the most recent NIC firmware, but w... [17:17:58] (03Merged) 10jenkins-bot: Use the latest version of the create_indices script [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/933890 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [17:19:48] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) a:05BTullis→03Jclark-ctr [17:21:02] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [17:52:14] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) 05Open→03Resolved [18:00:00] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) 05Open→03Declined I'm declining this task. I realized that we let eventgate handle setting meta.d... [18:05:12] 10Data-Engineering, 10Product-Analytics: TikTok referral data gaps - https://phabricator.wikimedia.org/T340677 (10Isaac) [21:22:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) > could implement Draft-3 required field support in eventutilities-core...