[00:35:27] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Suppress tracebacks for Kerberos errors - https://phabricator.wikimedia.org/T345219 (10nshahquinn-wmf) [01:37:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:24] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10Marostegui) The database has been sanitized, the `_p` database created and the grant added. The views addition can proceed now. [06:03:36] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 (10Marostegui) p:05Triage→03Medium [06:18:35] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye [06:19:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye [06:58:57] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye completed: - an-worker1126 (**PASS**) - Downtimed on Icinga/Alertm... [07:01:07] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye completed: - an-worker1127 (**PASS**) - Downtimed on Icinga/Alertm... [07:09:19] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye [07:09:31] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye [07:15:58] stevemunene: o/ [07:16:08] are you checking the hadoop metrics? [07:16:20] I see some corrupt blocks reported by the 1002's namenode - https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-3h&to=now&viewPanel=39 [07:16:33] it may be a stale jmx metric, but we should check before proceeding [07:16:37] o/ elukey [07:16:38] do you know how? [07:18:25] just the basics from https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks [07:23:58] yeah fsck is good! [07:24:16] we have seen this bug before, namely the standby namenode reporting some issues and the active not [07:24:21] but better be sure :) [07:24:50] when you operate on the cluster always check the namenode metrics in grafana every now and then, just to avoid surprises (my 2c) [07:29:15] Thanks elukey , both report 0 CORRUPT files [07:31:14] super then we are fine [07:31:27] metrics will clear after the next restart probably [07:37:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye completed: - an-worker1129 (**PASS**) - Downtimed on Icinga/Alertm... [07:51:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye completed: - an-worker1128 (**PASS**) - Downtimed on Icinga/Alertm... [08:01:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:42] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:44] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10Gehel) @awight: @BTullis is out this week and we really want his input on this. Sorry for the delay, we'll try to move this forward early next week. [08:30:40] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) @Gehel Thanks for the acknowledgement! There's no huge rush, waiting a week or two to hear back is fine. We know we'll publish the data //somewhere//, so we... [08:40:41] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [08:49:48] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [08:50:11] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [09:00:20] Thank you elukey for keeping an eye on everything <# [09:00:26] <3 [09:01:58] <3 [09:07:39] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [09:25:43] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) > [] Request the deletion of any previously logged data, if necessary Any previously–logge... [11:49:57] heads up, dse-k8s-etcd1001 will briefly go down for a ganeti node reboot [12:14:20] Hi stevemunene and btullis - We (aqu and myself) have an ops issue with the hadoop cluster due to the ongoing reimaging - Could you please join us? https://meet.google.com/mgr-ndbe-tka [12:14:47] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950183 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [12:15:19] Hi joal joining in 2 [12:26:42] !log restart hadoop-yarn-nodemanager.service on an-worker1147 [12:26:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:41:07] !log disable puppet on an-worket1147 test hadoop-yarn log aggregation compression algorithm The compression was set to gzip but should have been set to gz [12:41:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:06:13] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Hannah_Bast) Is it possible to configure Blazegraph to send the following Accept header: ` Accept: appli... [13:13:00] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [13:13:20] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [13:13:50] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 3 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) 05Open→03Resolved a:03phuedx Being **bold**. [13:21:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:04] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10bking) @Epidosis sorry for the delay on this ticket. We've added your endpoint, can you please test it an... [13:30:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:32] 10Data-Platform-SRE, 10Discovery-Search, 10Patch-For-Review: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/49 elasticsearch: bump elastic plugins version [13:51:41] 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10bking) [14:06:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:56] !log restart hadoop-yarn-nodemanager.service on an-worker10[78-99].eqiad.wmnet in batches of 2 with 3 minutes in between [14:08:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:10:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10Epidosis) Thanks! I checked with an easy one, https://w.wiki/7Mv3, and it fails due to Could not identify... [14:22:52] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) a:03bking [14:31:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:02] !log restart hadoop-yarn-nodemanager.service on an-worker11[00-28].eqiad.wmnet in batches of 2 with 3 minutes in between [14:46:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:13:43] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE, 10Product-Analytics: Functionality to share & view notebooks - https://phabricator.wikimedia.org/T156934 (10mpopov) [15:13:50] (03PS7) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [15:13:58] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Product-Analytics: Internal nbviewer instance for sharing notebooks among 'wmf' and 'nda' members - https://phabricator.wikimedia.org/T290693 (10mpopov) 05Open→03Declined Actually this was solved (in a way) by T305082: https://www.mediawiki.org/wiki/GitLa... [15:20:42] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10bking) Thanks for the quick response. It could very well be our fault, as [[ https://phabricator.wikimedia... [15:21:27] (03PS8) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [15:22:18] (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:36:24] o/ joal aqu Done with the restarts on the reimaged hosts, moving on to the buster hosts that were not affected. [15:37:11] Thank you! [15:37:43] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) dbstore1008 E 2. U 41. Port 38 Cableid 230304500161 dbstore1009. F 2. U 40. Port. 39 Cableid 230304500156 [15:43:07] !log restart hadoop-yarn-nodemanager.service on an-worker11[29-48].eqiad.wmnet in batches of 2 with 3 minutes in between [15:43:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:22:24] joal aqu we are done with the restarts, all looks good but still monitoring. [16:23:08] ack stevemunene - Thanks a lot! [18:19:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:17] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10RKemper) a:03Papaul [18:27:43] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10RKemper) [18:28:23] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: mediawiki page_content_change should generate new meta.id field - https://phabricator.wikimedia.org/T341277 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/med... [18:47:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:24] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939265 (https://phabricator.wikimedia.org/T341888) (owner: 10KCVelaga) [19:21:54] 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking) 05Open→03Resolved [19:22:20] 10Data-Platform-SRE, 10Discovery-Search: Create and publish new elastic dev image - https://phabricator.wikimedia.org/T344841 (10bking) This is complete, moving to "done". [19:31:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:48] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10Papaul) [21:15:54] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10Papaul) 05Open→03Resolved @bking IDRAC and BIOS updated. All yours. As for 10/03/2023 the latest IDRAC version for R430 is iDRAC 2.84.84.84 [21:58:40] 10Analytics-Radar, 10Data-Engineering-Icebox, 10SRE, 10Traffic, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [21:58:59] 10Analytics-Radar, 10Data-Engineering-Icebox, 10SRE, 10Traffic, and 3 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [23:53:08] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Research-Freezer: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices - https://phabricator.wikimedia.org/T336715 (10Mayakp.wiki) Decisions: With the assumption that we tag pre fetch requests...