[03:43:38] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:43:38] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:43:38] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:43:38] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:33:01] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) a:03Stevemunene [08:54:57] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) Verifying the cluster availability and resources via ` stevemunene@cumin1001:~$ sudo cookbook -d sre.ganeti.resource-report eqiad DRY-RU... [08:58:36] I'm about to start the scheduled reboots of the stats servers in a few minutes, starting with stat1004 and proceeding to stat1009 [09:13:22] stat1004 reboot completed. [09:17:21] proceeding to reboot stat1005 [09:23:33] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:54] It looks like an-worker1124 is having some problems. Lots of CPU stuck messages e.g. [09:26:00] https://www.irccloud.com/pastebin/3nOh0KAq/ [09:26:19] I will pre-emptively reboot it. [09:27:15] !log rebooted an-worker1124 due to CPU lockups [09:27:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:13] proceeding to reboot stat1006 [09:40:04] proceeding to reboot stat1007 [09:47:01] PROBLEM - Host an-worker1124 is DOWN: PING CRITICAL - Packet loss = 100% [09:49:33] proceeding to reboot stat1008 [09:55:07] RECOVERY - Host an-worker1124 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [09:55:09] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:41] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:30] I had to give an-worker1124 a power cycle from IPMI because it wasn't shutting down cleanly. Now it's OK. [10:53:26] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) I've checked the user `mhay` and the only interesting data appears to be a checked out copy of https://gitlab.wikimedia.org/htriedman/stat-spark3 on stat1005 in `/home/mhay/stat-spark3` Th... [10:54:50] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) Similar thing for `skyenet`, there are a few directories on stat1005. ` ====== stat1005 ====== total 12 drwxrwxr-x 7 37818 wikidev 4096 Apr 13 2022 old -rwxrwxr-x 1 37818 wikidev 676 Apr... [10:57:25] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) tmlt-tmager only has a single working copy on of that same repository on stat1005 ` ====== stat1005 ====== total 4 drwxrwxr-x 7 37826 wikidev 4096 May 16 2022 stat-spark3` [10:59:17] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) `damiendf` user only has the same working copy, but this time it is on stat1008/ ` ====== stat1008 ====== total 4 drwxrwxr-x 8 37889 wikidev 4096 Jul 13 2022 stat-spark3 ` [11:02:25] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) `dpujol` had no files of interest, so I removed the home directories with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::st... [11:04:22] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) Similarly, I have deleted the home directories of `dasm` since there were no files of interest. [11:04:51] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10BTullis) [11:07:38] 10Data-Engineering: Check home/HDFS leftovers of aranyap - https://phabricator.wikimedia.org/T340945 (10BTullis) The only files of interest are the following two directories on stat1006: ` ====== stat1006 ====== total 8 drwxrwxr-x 5 43621 wikidev 4096 Mar 13 19:29 centralnotice_analytics drwxrwxr-x 11 43621 wik... [11:19:30] 10Data-Engineering: Check home/HDFS leftovers of appledora - https://phabricator.wikimedia.org/T340948 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Isaac. Removed the user's home directories with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::ma... [11:22:45] 10Data-Engineering: Check home/HDFS leftovers of jminor - https://phabricator.wikimedia.org/T340978 (10BTullis) 05Open→03Resolved a:03BTullis No files of interest. Executed: `sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/... [11:25:52] 10Data-Engineering: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) There is quite a bit of data to consider here, across stats servers, HDFS, and Hive. I'll make some enquiries to find out who should be the primary contact for considering what to do with this. ` ====... [11:39:13] 10Data-Engineering: Check home/HDFS leftovers of ntsako - https://phabricator.wikimedia.org/T343189 (10BTullis) I'm going to be quite cautionus here, because there is quite a lot of data and it looks like it might cross over with: {T342269} @WDoranWMF , @VirginiaPoundstone - Would I be right in thinking that you... [11:42:09] 10Data-Engineering: Check home/HDFS leftovers of hshaath - https://phabricator.wikimedia.org/T335263 (10BTullis) 05Open→03Resolved a:03BTullis No files of interest. Removed the home directories with `sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master... [11:43:38] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:43:38] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:44:40] 10Data-Engineering: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10BTullis) I wonder if @WDoranWMF might have a view as to whether these files have any value, or could perhaps nominate someone to make that decision. Thanks. [11:53:21] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10BTullis) I've executed the following commands: ` btullis@stat1005:/home/paramd$ sudo mkdir /home/dsaez/paramd-archive btullis@stat1005:/home/paramd$ sudo mv /home/paramd/* /home/dsaez/paramd-archive btullis@sta... [12:02:15] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10karapayneWMDE) hello, apologies for the delay (was on holiday) public key is: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIN2rcD7HPK... [12:32:52] 10Data-Engineering: Check home/HDFS leftovers of hghani - https://phabricator.wikimedia.org/T335264 (10KCVelaga_WMF) Hi @BTullis @Hghani (re)joined WMF on the newly formed Movement Insights team, and his access has been reinstated T322145#8862574 So, I guess this task need not be done anymore. [12:35:27] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) >>! In T342546#9089720, @karapayneWMDE wrote: > hello, apologies for the delay (was on holiday) > > public key... [12:35:37] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) [12:37:07] 10Data-Engineering: Check home/HDFS leftovers of ilooremeta - https://phabricator.wikimedia.org/T335265 (10KCVelaga_WMF) @BTullis I checked the files. I made a backup of a few data files, and the rest that is required are on GitHub/GitLab - so everything can be removed. [12:52:27] (03PS1) 10Sharvaniharan: Updated documentation: Image recs schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/948556 [12:54:18] (03CR) 10Sharvaniharan: "Hi Shay. Please review this minor documentation change and merge the patch. Thank you :)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/948556 (owner: 10Sharvaniharan) [13:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:52] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning, 10MW-1.40-notes (1.40.0-wmf.1; 2022-09-12): Generate $wgEventLoggingStreamNames from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10phuedx) [13:05:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:17] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye [13:48:04] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Stevemunene) [13:50:10] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) created the vm with `sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics --os buster --cluster eqiad --g... [13:51:22] 10Data-Engineering: Check home/HDFS leftovers of ilooremeta - https://phabricator.wikimedia.org/T335265 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @KCVelaga_WMF. I removed the hime directories with `sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::m... [13:51:50] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) 05Open→03Resolved [13:51:57] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [13:53:11] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye [13:55:33] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [13:57:28] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) We are unblocked on T342546 , Working to merge the tasks listed as in progress and as ready to merge on the ticket. [14:08:37] 10Data-Engineering: Check home/HDFS leftovers of neilpquinn-wmf - https://phabricator.wikimedia.org/T340524 (10BTullis) 05Open→03Resolved a:03BTullis Great! Thanks @nshahquinn-wmf I have removed the posix home directories with: `sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::mast... [14:17:37] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) a:05bking→03None [14:18:25] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10bking) @BTullis I unassigned this from myself as I'm not actively working on it. I'm guessing it can probably be closed, but leaving that up to you. [14:19:14] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) wdqs10[03-05] will be decommissioned soon, so we're going to skip those. Work continues on the other hosts... [14:20:12] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) > Oh! I see that you've already done this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/945640 but we haven't [[htt... [14:23:47] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [14:42:20] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed: - wdqs1016 (**WARN**) - Downtimed on Icinga/Alertman... [14:46:21] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10jbond) [14:57:07] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye completed: - wdqs2012 (**WARN**) - Downtimed on Icinga/Alertman... [15:03:23] 10Data-Engineering: Check home/HDFS leftovers of hghani - https://phabricator.wikimedia.org/T335264 (10BTullis) 05Open→03Declined Thanks @KCVelaga_WMF for the update and welcome (back) @Hghani. I'll decline this ticket. [15:08:31] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:03pfischer [15:10:26] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10BTullis) Thanks @fkaelin - I can certainly archive the HDFS user directory and the home directory from stat1007, but archiving the Hive database is a bit more involved. Would you be happy for me to delete th... [15:12:11] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10Gehel) [15:13:04] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Gehel) [15:16:42] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1093.eqiad.wmnet with OS bullseye [15:17:07] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, 10Event-Platform: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [15:25:00] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, 10Event-Platform: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10Urbanecm_WMF) Moving to Triaged on our end, but feel free to ping us if there's something we... [15:32:42] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896326 (https://phabricator.wikimedia.org/T330766) (owner: 10Phuedx) [15:33:25] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/893998 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [15:40:49] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, and 4 others: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) Being **bold**. [15:43:38] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:43:38] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:55:28] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10BTullis) Archived HDFS home directory ` btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mkdir /wmf/data/archive/user/bmansurov btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run... [15:55:41] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1093.eqiad.wmnet with OS bullseye completed: - an-worker1093 (**PASS**) - Downtimed on Icinga/Alertmanag... [15:57:24] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10fkaelin) Thanks @BTullis, please go ahead and delete the hive database. [15:58:48] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1094.eqiad.wmnet with OS bullseye [15:59:03] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) 05Open→03Resolved [15:59:09] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) [16:37:54] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1094.eqiad.wmnet with OS bullseye completed: - an-worker1094 (**PASS**) - Downtimed on Icinga/Alertmanag... [16:43:26] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @elukey, @Joe, thank you for your feedback! I revisited the size estimations, here are the updated n... [16:43:54] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:05pfischer→03None [17:03:03] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1095.eqiad.wmnet with OS bullseye [17:42:00] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1095.eqiad.wmnet with OS bullseye completed: - an-worker1095 (**PASS**) - Downtimed on Icinga/Alertmanag... [18:23:04] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge... [18:29:06] (03CR) 10Shay Nowick: [C: 03+1] "Thanks for documenting change" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/948556 (owner: 10Sharvaniharan) [18:44:07] 10Data-Platform-SRE, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Priority Backlog 📥): wdqs: replace git-fat with git-lfs - https://phabricator.wikimedia.org/T316876 (10RKemper) Patch was merged here: https://gerrit.wikimedia.org/r/947928 [18:44:19] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580 (10EBernhardson) ignore the above, patch attached to wrong ticket. [18:53:32] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) I looked into these, the attached patch should fix it but it leaves an open quest... [18:54:47] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) It seems the CodeReviewBot doesn't update the ticket when changing the ticket in... [18:55:46] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10CodeReviewBot) ebernhardson updated https://gitlab.wikimedia.org/repos/data-engineering/airflow... [19:01:29] 10Data-Engineering: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10odimitrijevic) @BTullis These are good to be removed [19:39:42] (SystemdUnitFailed) firing: ferm.service Failed on dse-k8s-worker1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:38] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:43:38] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:09:42] (SystemdUnitFailed) resolved: ferm.service Failed on dse-k8s-worker1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:04] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) Per pairing discussion with Ryan, we believe this work is complete. The actual migration work continues in T343124 . [21:25:25] 10Data-Platform-SRE: Decommission wdqs10[03-05] - https://phabricator.wikimedia.org/T344198 (10bking) [21:48:04] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10bking) [22:10:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [22:15:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [22:28:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [22:33:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [23:43:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:43:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability