[00:50:11] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10Pppery) [02:19:44] 10Data-Engineering, 10API Platform: Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10nshahquinn-wmf) [02:25:01] 10Data-Engineering, 10API Platform: Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10nshahquinn-wmf) [06:38:59] joal: o/ bonjour [06:39:07] if you have time later on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964848 [07:20:19] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I found a bug in `topicmappr` as the chunked force rebuild cause a broker to be listed multiple times among the replica list: ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reass... [07:21:34] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) I'm then dropping the `--force-rebuild` flag, even if it will lead to a less optimal placement at the end. We can always correct with a rebalancing, should we need it. ` brouberol@kafka-jumbo1010:~/to... [07:36:14] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff) [07:47:13] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) In the meantime, I've reported the issue upstream: https://github.com/DataDog/kafka-kit/issues/432 [07:54:53] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 3 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Antoine_Quhen I've enabled your access on the... [08:02:00] brouberol: o/ just checking, the jumbo under replicated alert is related to maintenance right? [08:03:23] oops, sorry, I forgot to prolong the silence. my bad! [08:03:25] yes it is [08:03:43] let me set a silence for ~6h [08:04:00] nono all good, I imagined, just double checking :) [08:05:24] for context, I'm evacuating the webrequest_text topic from brokers 1001->1006 (in 6 phases), after which we can shut them down, and I'll be ~done working on jumbo [08:05:47] modulo a bullseye upgrade for 1007->1009 [08:23:32] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) Hi @Stevemunene and @BTullis, thank you again for making us aware of this important issue! It turns out that these timers are indeed critical for different teams, so we must... [08:26:30] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Scap deployment on Hadoop test cluster broken - https://phabricator.wikimedia.org/T347491 (10BTullis) @Antoine_Quhen - Are you happy for us to resolve the ticket? Have you tested another deployment to the Hadoop test cluster, since I installed `g... [08:28:33] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [08:38:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10Gehel) p:05Triage→03Low [08:39:07] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10Gehel) p:05Triage→03High [08:39:17] 10Data-Platform-SRE: Set requests (not limits) for cirrus-streaming-updater in k8s - https://phabricator.wikimedia.org/T348350 (10Gehel) p:05Triage→03Medium [08:39:19] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) p:05Triage→03High [08:40:27] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10Gehel) p:05Triage→03Medium [08:41:18] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10Gehel) p:05Triage→03Medium [08:42:34] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10Gehel) p:05Triage→03Low [08:43:52] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10Gehel) p:05Triage→03Medium [08:47:44] 10Data-Platform-SRE, 10SRE-OnFire, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) p:05Triage→03High [08:48:10] 10Data-Platform-SRE: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10BTullis) p:05Triage→03High [08:48:16] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) p:05Triage→03High [08:48:19] 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10Gehel) p:05Triage→03High [08:48:30] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10Gehel) p:05Triage→03High [08:48:35] 10Data-Platform-SRE, 10Foundational Technology Requests, 10Epic: Create a DSE Kubernetes cluster with support for persistent storage from Ceph - https://phabricator.wikimedia.org/T327267 (10Gehel) p:05High→03Low [08:48:53] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10Gehel) p:05High→03Low [08:50:13] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10Gehel) p:05Triage→03High [08:50:33] 10Data-Platform-SRE: Refactor sre.elasticsearch.rolling-operation to use spicerack improvements - https://phabricator.wikimedia.org/T345880 (10Gehel) p:05Triage→03Low [08:52:31] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - conflicting dependencies between r-base and other - https://phabricator.wikimedia.org/T343823 (10Gehel) p:05Triage→03Low [08:53:03] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Install jupyterhub separately from conda-analytics - https://phabricator.wikimedia.org/T321512 (10Gehel) p:05Triage→03Low [08:53:30] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Explore the use of Airflow notifiers for more flexible DAG failure handling - https://phabricator.wikimedia.org/T343234 (10Gehel) p:05Triage→03Low [08:53:34] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Gehel) p:05Triage→03Low [08:55:31] 10Data-Engineering, 10Data-Platform-SRE: Misconfigured proxies on data-engineering hosts - https://phabricator.wikimedia.org/T326302 (10Gehel) p:05Triage→03Low [09:08:31] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) 05Open→03Resolved [09:09:20] elukey: I just +1ed your CR for eventstreams - Thanks so much <3 [09:10:16] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) p:05Triage→03High [09:12:10] joal: <3 [09:15:28] Good morning btullis - Would you please deploy our new AQS druid config so that new data shows up? [09:15:38] here's the patch btullis: Bump mediawiki_history_snapshot to 2023-08 [09:15:41] woops [09:15:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/965059 here it is actually btullis [09:16:06] I think I already did this and forgot to roll-restart aqs. [09:16:31] I have a feeling that sguptas asked me last week and I merged it, but failed to deploy. [09:16:35] hm, I don't think so - the value in file was not set to 2023_09 [09:16:58] OK, thanks. I will merge and deploy your patch. My memory must be faulty. [09:18:02] 10Data-Engineering, 10API Platform: Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10JAllemandou) This is due to a forgotten procedure at the beginning of month. It's in the plan of Data Engineering to automate this procedure so that it becomes automatic that ne... [09:18:27] 10Data-Engineering, 10API Platform, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10JAllemandou) a:03JAllemandou [09:23:44] joal: This one I merged on Oct 03 : https://gerrit.wikimedia.org/r/c/operations/puppet/+/963050 [09:25:00] https://www.irccloud.com/pastebin/zBij7nNF/ [09:27:21] !log trigger rolling-restart of aqs services with `sudo cumin -b 2 -s 20 A:aqs 'systemctl restart aqs'` [09:27:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:36:15] 10Data-Engineering, 10API Platform, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10BTullis) Sincere apologies. The omission was totally mine. I merged this change last week: https://gerrit.wikimedia.org/... [09:57:40] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10BTullis) Great, thanks. > ...bringing it down will make basically 11 large wikis inaccessible in half of the world. Yeah, let's try to avoid doing that :-) I will check b... [10:55:45] 10Data-Platform-SRE: [DataHub] Users are redirected to the wrong screen on logout and from certain urls. - https://phabricator.wikimedia.org/T347149 (10BTullis) Thanks for the update @Stevemunene - It seems remarkable that nobody seems to have noticed this issue before, but it's good that they seem to be priorit... [11:20:42] (SystemdUnitFailed) firing: (2) druid-broker.service Failed on druid1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:01] --^ New druid server being added to the cluster, adding a downtime to silence the alerts before the onboarding is done. [11:28:17] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1ef45f2c-6b73-432c-9d22-7d378d8653d7) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with r... [12:28:33] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [12:55:38] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) [12:56:11] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) `quarry-bastion.quarry.eqiad1.wikimedia.cloud` deploying [13:00:15] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9238635, @SD0001 wrote: > we don't use Puppet at all for this project? Puppet is used* *in the usual confusing puppet ways. The following directories in the puppet repo will do things to quarry: ` ./modules/p... [13:05:56] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_text-phase1.json --execute --throttle 80000000 kafka-reassign-partitions --zookeeper conf1007.e... [13:16:20] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) Note: all partition reassignment logs and snippets were added to https://phabricator.wikimedia.org/T336044 by mistake, since the b... [13:56:18] btullis: FYI I've prepped the changes in configuration that will need to be deployed after all kafka-jumbo100[1-6] brokers are emptied and before we can decom them: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965159/, https://gerrit.wikimedia.org/r/c/analytics/refinery/+/965166/ & [13:56:18] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965164/ [14:20:23] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) 05Open→03Resolved [14:27:45] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) @SD0001 moving k8s discussion from T348184 to here to match ticket descriptions. I've deployed quarry-bastion.quarry.eqiad1.wikimedia.cloud in T348642 and installed kubectl on it ` KUBE_VERSION="v1.24.... [14:28:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Should we introduce static typing to Event Platform nodejs codebases? - https://phabricator.wikimedia.org/T345389 (10tchin) If we do introduce something, we should use JSDoc3 and follow what's happening on this ticket T... [14:48:01] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10Jclark-ctr) @taavi What vlan are these going to be I would like to verify with @cmooney that these can go into these racks before i physically move them. [14:59:03] btullis: hm - I must have forgotten to rebase my repo - my bad sorry :S [14:59:54] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) Thanks @Jclark-ctr yes these can go in E4 or F4 no problem. [15:01:11] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/27 [15:05:44] milimetric: wow 11y \o/ [15:52:38] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [15:52:46] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed with... [16:05:10] (03PS4) 10Joal: Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) [16:05:58] (03CR) 10Joal: Add unique-devices Iceberg schemas and scripts (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [16:28:34] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [16:39:10] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) (owner: 10Joal) [16:43:33] 10Data-Platform-SRE: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking) 05Open→03Declined [16:44:42] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [16:44:51] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit... [16:48:59] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [16:49:04] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye executed wit... [16:53:42] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [17:02:47] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10BPirkle) [17:35:21] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 3): Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10nshahquinn-wmf) >>! In T347430#9201371, @BTullis wrote: > Also, I believe th... [17:56:11] 10Data-Engineering, 10Data-Engineering-Wikistats: Add Farsi/Persian to WikiStats interface languages - https://phabricator.wikimedia.org/T348674 (10Arian_Ar) [18:36:01] 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10RKemper) [18:49:50] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye completed: -... [18:53:40] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [18:53:48] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) 05Open→03Resolved [19:38:16] 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking) [19:39:07] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10EBernhardson) a:03EBernhardson [19:41:49] 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking) Space usage alerts for the Search team added in [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/964934 | this PR ]] [19:53:19] 10Data-Platform-SRE, 10Discovery-Search: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) [20:00:08] 10Data-Platform-SRE: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10bking) Per today's discussion in the Wednesday Meeting™, we decided that this repo was not the best candidate for moving to git-lfs, as it does not currently use git-fat. Closing... [20:17:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:22] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) I've updated the branch associated with this ticket. And it does seem to get quarry running in minikube. Suggesting that we are close to being able to go to k8s? Though we probably need T316958 before g... [20:19:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:48] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/28 [20:29:22] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [20:32:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:39] 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10bking) [20:54:41] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [21:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:51] 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10RKemper) [21:31:39] 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10RKemper) Reboot complete. Systemd units are happy. We noticed a `Warning` in the output of the `logstash.service`. I'll create a new ticket to look into that. [21:34:48] 10Data-Platform-SRE: Address illegal reflective access on apifeatureusage* - https://phabricator.wikimedia.org/T348696 (10RKemper) [21:35:13] 10Data-Platform-SRE: Address illegal reflective access on apifeatureusage* - https://phabricator.wikimedia.org/T348696 (10RKemper) [21:36:25] 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10RKemper) Created T348696 for the `Illegal Reflective Access` warning [22:23:09] (03PS1) 10Kimberly Sarabia: Refactor schema structure [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965258 (https://phabricator.wikimedia.org/T346106)