[01:29:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:33] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:24:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:11] PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:59] (03PS1) 10Clare Ming: Add custom schemas for *uiactionstracking instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) [04:27:28] (03CR) 10CI reject: [V: 04-1] Add custom schemas for *uiactionstracking instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [04:30:27] (03PS2) 10Clare Ming: Add custom schemas for *uiactionstracking instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) [04:31:41] (03CR) 10Clare Ming: "Not sure if we should try adding just one custom schema for both instruments or keep them separate -- seems reasonable to combine?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/978718 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [04:33:31] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:34:01] RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:05] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:43] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:51] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:38:29] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:25] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:39:42] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:42] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:03] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:45:09] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:43] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:59:47] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:34] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10brouberol) @pfischer Once you agree on the config, I can create and configure the topic for you. As this is to be a compacted... [08:12:34] 10Data-Platform-SRE, 10SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) Thanks @MoritzMuehlenhoff ! [08:28:08] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye [08:28:10] !log reimage druid1010 to pick up the right raid config and corresponding partman recipe T336043 [08:28:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:28:26] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [09:04:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:14] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye completed: - druid1010 (**WARN**) - Downtimed on Icinga/Aler... [09:22:05] stevemunene: I heard from the grapevine that you wanted to pair on some puppet work? [09:29:19] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) druid10[09-11] now have all been reimaged with the right raid config and we can proceed with the decommission of druid100[4-6] once druid1010 is fully back in the cluster and balanced. ` stevemunene@druid... [09:29:59] o/ brouberol yes, would you be available later in the day or tomorrow? [09:43:08] 10Data-Platform-SRE, 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/1 Add initial files for building superset [09:52:02] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10JMeybohm) [09:53:16] later in the day, for sure [09:53:19] 10Data-Platform-SRE, 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) I have made some progress on the Superset image in [[https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/1|this MR]]. I w... [10:06:56] 10Data-Platform-SRE, 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/52 Add the data-engineering/superset project... [10:11:13] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10dcausse) [11:08:47] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) Following up on this: There were 2 deprecation notices emitted (see https://grafana.wikimedia.org/d/000000037/mw-j... [11:09:04] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) 05Open→03Resolved a:03phuedx [11:15:20] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10elukey) @pfischer option A) is fine, if there is a way to add the new traffic incrementally (to double check space used by th... [11:33:50] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10brouberol) [11:37:41] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10brouberol) I have created https://gitlab.wikimedia.org/repos/data-engineering/kerberos-kinit with a blubber file as well as a kokkuri-based release pipeline. [11:37:45] 10Data-Platform-SRE: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) I have made new kerberos principals and keytabs for the new coordinators. ` hive/an-coord1003.eqiad.wmnet@WIKIMEDIA analytics/an-coord1003.eqiad.wmnet@WIKIMEDIA hadoop/an-coord1003.eqiad.wmnet@WIKIMED... [11:40:39] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10CodeReviewBot) mfossati updated https://gitlab.wikimedia.org/repos/structured-data/section-image-r... [11:41:46] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [11:42:17] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10mfossati) ### Report | script | coalesce | files before | files after | partition before | partiti... [11:43:05] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10mfossati) [11:46:28] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10CodeReviewBot) brouberol opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/53 Request access... [11:48:54] 10Data-Platform-SRE: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) One thing that is crucial here is for us to make sure that the keytabs for hive and presto that are deployed to the new coordinators have the principals which match the service names (the DNS CNAME) t... [11:51:00] 10Quarry: [bug] query/77794: "This query was stopped" - https://phabricator.wikimedia.org/T352211 (10Novem_Linguae) [12:04:53] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) @elukey, sure. We would start onboarding smaller wikis first (test, it, fr) before moving on to the bigger ones. @... [12:08:15] 10Data-Platform-SRE: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) As per the guidelines here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos/Administration#Create_a_custom_principal_and_keytab_entry Adding the existing `hive/analytics-hive.eq... [12:39:14] 10Quarry: [bug] query/77794: "This query was stopped" - https://phabricator.wikimedia.org/T352211 (10Boshomi_Phabricator) [12:42:31] 10Quarry: [bug] query/77794: "This query was stopped" - https://phabricator.wikimedia.org/T352211 (10taavi) The wiki replicas were accidentally running on shorter-than-usual network timeouts due to the work going on in {T346947}. That's now been fixed, try again? [13:04:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:51] If anyone is free to review this small patch, I'd be grateful. It brings an-coord1003 into service, ready to take over from an-coord1001: https://gerrit.wikimedia.org/r/c/operations/puppet/+/979086 [13:13:52] btullis stevemunene: whenever you get the chance, here's a _tiny_ PR https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/977994. Thanks! [13:13:57] btullis: looking [13:44:11] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:45:07] 10Quarry, 10Data-Services: [bug] query/77794: "This query was stopped" - https://phabricator.wikimedia.org/T352211 (10taavi) 05Open→03Resolved a:03taavi [13:49:40] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) 05Open→03Resolved The keytabs have been added to the private hieradata in puppet, under `role/common/deployme... [13:49:42] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [14:00:22] 10Data-Engineering (Sprint 5), 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [S] Coalesce section alignment image suggestions output - https://phabricator.wikimedia.org/T347558 (10JAllemandou) Thanks folks :) This will make HDFS a lot happier <3 [14:46:14] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC this will only be an internal tool, not ex... [14:54:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:18] (KafkaReplicationFactorTooLow) firing: (393) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [14:59:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:21] * brouberol looks at KafkaReplicationFactorTooLow [15:00:24] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) > Implement a serializer that gives us some flexibility wrt storing result medatata. E.g. keep result key fields (in... [15:03:25] (KafkaReplicationFactorTooLow) resolved: (1276) SKafka topic replication factor is too low - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:06:41] I opened https://gerrit.wikimedia.org/r/979116 to improve this ^ message [15:21:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [15:21:26] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) 05Open→03Resolved [15:56:12] 10Data-Engineering (Sprint 5), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/549 Provide Airflow metrics develo... [16:36:02] 10Data-Engineering, 10Data Products (Data Product Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10Milimetric) >>! In T351909#9366161, @phuedx wrote: > Is it possible to have the monitorin... [16:36:42] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) 05Open→03Resolved I believe this work is complete. Closing, but please reopen if we missed anything. [17:19:53] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10EBernhardson) For deleting the topic, if we need to pause all writers and consumers that can relatively easily be done. test... [17:22:02] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) >>! In T330176#9371468, @brouberol wrote: > I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC... [17:38:24] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) Oh I see. Indeed, if we can't resolve the k8s service names outside of within the k8s cluster, then yes, we'd indeed need that. Point taken, thank you! [17:41:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:57] !log reran refine_event for mediawiki_cirrussearch_request [17:41:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:06] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10brouberol) The partition count can be changed on the fly (only increased, never decreased), that's no i... [17:44:42] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:23] 10Data-Engineering, 10Data Products (Data Product Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10JAllemandou) @gmodena is working on adding data-quality metrics on the webrequest dataset... [18:49:49] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Moving to "blocked/waiting" until we have confirmation on the reload data. [18:50:00] 10Data-Platform-SRE, 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10bking) @Gehel Is this a duplicate of T347504? [18:50:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: Define a docker image containing kerberos-related tooling - https://phabricator.wikimedia.org/T352406 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/53 Request access to t... [18:59:45] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) After some thought, I think the problem is the blackbox check's association with miscweb. We are actually cutting around miscweb when we ac... [19:07:24] Looks like I have been disconnected all day without even realizing :( sorry team [19:54:10] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [19:58:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm [19:58:09] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm [19:58:13] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm [19:58:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [19:58:25] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [20:37:22] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm completed: - elastic1103 (**PASS**)... [20:37:29] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm completed: - elastic1106 (**WARN**)... [20:37:43] 10Data-Platform-SRE, 10Data Pipelines: Can't save dagrun notes in airflow after 2.7.3 migration - https://phabricator.wikimedia.org/T352483 (10EBernhardson) [20:38:24] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm completed: - elastic1104 (**PASS**)... [21:45:57] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2092.codfw.wmnet with OS bookworm completed: - elastic2092 (**PASS**)... [21:46:17] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm [21:50:29] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [21:51:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [21:54:21] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [21:54:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:24:01] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2093.codfw.wmnet with OS bookworm completed: - elastic2093 (**PASS**)... [22:24:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm [23:05:44] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [23:06:05] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm completed: - elastic2095 (**PASS**)... [23:06:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm executed with errors: - elastic2094... [23:11:36] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm [23:16:58] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm [23:25:05] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [23:25:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) 05Open→03Resolved Apologies for the confusion. We have already migrated the r... [23:31:18] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [23:31:20] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [23:36:22] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2097.codfw.wmnet with OS bookworm [23:37:17] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [23:44:23] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2098.codfw.wmnet with OS bookworm [23:55:56] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm completed: - elastic2096 (**PASS**)... [23:56:16] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2099.codfw.wmnet with OS bookworm