[00:16:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:32] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:18] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:58] PROBLEM - puppet last run on an-worker1145 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:53:42] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:42] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:12] (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:12] (SystemdUnitFailed) resolved: (2) kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:25] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Stevemunene) [08:29:25] 10Data-Engineering, 10Content-Transform-Team, 10Event-Platform: [session length] Investigate slight drop at sessions of 30 minutes or more - https://phabricator.wikimedia.org/T280254 (10Aklapper) [08:29:48] 10Data-Engineering, 10Content-Transform-Team, 10Event-Platform: [session length] Change domain of event collection to avoid ad-blocker issue - https://phabricator.wikimedia.org/T280256 (10Aklapper) [08:34:35] 10Data-Engineering-Planning, 10Data Engineering and Event Platform Team, 10Data Pipelines: [Iceberg] Migrate event_sanitized_iceberg to event_sanitized - https://phabricator.wikimedia.org/T311737 (10Aklapper) Please do add also codebase project tags to tasks and not only team tags as WMF loves to change teams. [08:42:37] FYI; I'm doing a rolling restart of aqs to pick up the c-ares security updates [08:43:06] Ack, many thanks. Of the cassandra services? [08:43:22] no, just the node-based aqs service itself [08:43:31] OK, thanks. [08:43:34] node uses c-ares heavily under the hood [08:45:49] Gotcha, thanks. [08:53:27] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [09:24:05] dse-k8s-etcd1001 will briefly go down for a Ganeti reboot [09:24:27] moritzm: 👍 thx [09:32:42] (SystemdUnitFailed) firing: nagios-nrpe-server.service Failed on dse-k8s-etcd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:42] (SystemdUnitFailed) resolved: nagios-nrpe-server.service Failed on dse-k8s-etcd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:31] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've noticed that the production and staging instances of datahub share a single schema registry, namely `karapace1001.eqiad.wmnet:8081` I think that this is likely to cause issue for us... [10:30:33] 10Data-Engineering, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/77 Add sch... [10:42:33] btullis: o/ [10:42:40] I am going to deploy eventgate-main [10:43:16] elukey: OK, is this a new schema? [10:45:18] elukey: Because if it is, this step might not be necessary any more, since Andrew did this: https://phabricator.wikimedia.org/T340166 [10:46:06] btullis: ah nono sorry it is a kafka queueing settings [10:46:34] elukey: OK, still cool by me :-) [10:47:42] ack thanks! :) [10:49:48] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Oh, now we have a really useful error from the kafka-setup job. ` Error while executing config command with args '--command-config /tmp/connection.properties --bootstrap-server kafka-test... [10:53:46] I'm about to fail back the hadoop namenode service from an-master1002 to an-master1001 [10:53:51] https://www.irccloud.com/pastebin/04zCSN4K/ [10:55:56] !log `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` on an-master1001 [10:55:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:56:47] https://www.irccloud.com/pastebin/dKmh3ews/ [11:15:14] 10Analytics, 10Data-Engineering-Icebox: Create a tool checking HDFS data size - https://phabricator.wikimedia.org/T256644 (10JAllemandou) [11:16:17] 10Data-Engineering: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10ntsako) I also experienced similar issues to the one above. I was running my Spark application on `stat1004` via Airflow using the `analytics-priva... [11:16:22] 10Analytics, 10Data-Engineering-Icebox: Create a tool checking HDFS data size - https://phabricator.wikimedia.org/T256644 (10JAllemandou) Hi @Gopavasanth, I updated the task description and title. Let me know if you wish more details! [11:46:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [11:51:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [11:55:44] 10Data-Engineering, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/66 ev... [12:21:48] moritzm: Are you happy for me to go ahead and create a new karapace VM: https://phabricator.wikimedia.org/T341464 [12:43:16] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10tchin) [12:55:53] btullis: sure thing, can you use group A? [12:56:02] it's the least used currently [13:10:17] Great, thanks. [13:11:09] 10Data-Platform-SRE, 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) [13:19:43] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Initial testing of the internal schema registry for datahub didn't work very well, so rather than proceeding with that right now I'm going to create a second karapace instance in {T341464... [13:20:36] 10Data-Engineering, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/66 ev... [13:26:00] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Stevemunene) [13:26:32] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Stevemunene) [13:26:47] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Stevemunene) [13:27:05] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Stevemunene) [13:27:08] 10Data-Platform-SRE, 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS bullseye [13:27:17] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Stevemunene) [13:27:31] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Stevemunene) [13:27:52] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Stevemunene) [13:28:27] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Stevemunene) [13:28:57] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Stevemunene) [13:29:25] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Stevemunene) [13:29:40] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Stevemunene) [13:53:03] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS b... [13:54:43] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) 05Open→03Resolved [14:00:34] 10Data-Platform-SRE: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 (10BTullis) [14:00:50] 10Data-Platform-SRE: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 (10BTullis) p:05Triage→03High a:03BTullis [14:01:37] 10Data-Platform-SRE: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 (10BTullis) I've logged in via the SOL console and I can see that there is a problem with the storage controller. This kind of thing is scrolling past on the console. ` [14278134.771367] systemd[22798]: confd.service: Fa... [14:02:33] !log powered off an-worker1145 for T341481 [14:02:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:02:36] T341481: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 [14:03:07] Giving it a few minutes to think about what it's done. [14:03:58] PROBLEM - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:22] !log powered on an-worker1145 [14:04:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:05:14] ACKNOWLEDGEMENT - SSH on an-worker1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:05:14] ACKNOWLEDGEMENT - Hadoop DataNode on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:05:14] ACKNOWLEDGEMENT - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Cold booted for T341481 [14:06:09] 10Data-Platform-SRE: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 (10BTullis) Cold booted the host. We'll see if this reinitializes the storage system, or whether it fails to boot. [14:07:26] RECOVERY - Host an-worker1145 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:07:36] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:38] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:09:34] RECOVERY - Hadoop DataNode on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:11:12] 10Data-Platform-SRE: an-worker1145 has a problem - https://phabricator.wikimedia.org/T341481 (10BTullis) Server appears to have booted correctly and all services are recovering. [14:14:48] RECOVERY - puppet last run on an-worker1145 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:14:50] 10Data-Platform-SRE, 10Discovery-Search (Current work): Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [15:15:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Determine whether or not to change CPU frequency governor on Search Platform-owned hosts - https://phabricator.wikimedia.org/T340554 (10bking) [15:19:38] 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) [15:21:30] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work): Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10bking) [15:29:53] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10bking) [15:32:51] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10KaiOS-Wikipedia-app (Discovery), 10Patch-For-Review: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10bking) [15:36:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document SRE steps for deploying a new WDQS (and WCQS) host - https://phabricator.wikimedia.org/T330714 (10bking) [15:37:29] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Diagnose and fix WDQS deployment process - https://phabricator.wikimedia.org/T341290 (10bking) 05Open→03Declined a:03bking [15:42:01] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10TJones) [15:42:05] 10Data-Platform-SRE, 10Discovery-Search (Current work): Determine whether or not to change CPU frequency governor on Search Platform-owned hosts - https://phabricator.wikimedia.org/T340554 (10bking) 05Open→03Invalid a:03bking [15:44:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) [17:08:34] (03CR) 10Ebernhardson: [C: 03+1] "do we need to decide anything else before merging this?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [17:10:39] (03CR) 10DCausse: "@Sam adding you as a reviewer per https://wikitech.wikimedia.org/wiki/Event_Platform/Maintainers. I don't think anyone in the search platf" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [17:46:20] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10EBernhardson) It looks like we added only the link, could we add a paragraph about ho... [19:38:14] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage WDQS servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10bking) This is complete. Closing... ` ansible codfw_tbd -i wdqs.hosts -m shell -a "cat /etc/debian_version" wdqs2017.codfw.wmnet | CHANGED | rc=0 >> 11.7 wdqs2016.codfw.wm... [19:43:48] 10Data-Engineering, 10Data-Engineering-Wikistats: no view data by country for the last month (June 2023) - https://phabricator.wikimedia.org/T341523 (10Rtfroot) [21:23:50] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WCQS/WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) [21:23:53] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10KaiOS-Wikipedia-app (Discovery), 10Patch-For-Review: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10bking) [21:38:58] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Update: I forgot to target 2013 in my last command, here is the latest list of hosts that need a data trans... [21:54:01] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) This is a first! I've successfully ingested sample data to the staging deployment of datahub. This is great because it shows that end-to-end ingestion works with 0.10.4. {F37135199,width=... [22:30:45] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I'm going to aim for an upgrade of the production deployments tomorrow at approximately 10:00 UTC. I'll take a `mydumper` backup of the database on an-coord1001 before I start, in case I...