[00:31:42] (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:33:35] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [05:21:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:42] (SystemdUnitFailed) firing: (3) hdfs-balancer.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:33:36] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:42:06] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10elukey) @pfischer it is not clear from the google spreadsheet what is the architecture that you have in mind,... [07:52:16] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) I am uneasy suggesting one of the type of events without a clearer picture of what changes in what case.... [08:46:42] (SystemdUnitFailed) firing: (3) hdfs-balancer.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:20] 10Data-Engineering, 10Data-Platform-SRE: [opsweek] hdfs-balancer​ failure - https://phabricator.wikimedia.org/T344045 (10BTullis) [09:16:42] (SystemdUnitFailed) firing: (2) hdfs-balancer.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:00] 10Data-Engineering, 10Data-Platform-SRE: [opsweek] hdfs-balancer​ failure - https://phabricator.wikimedia.org/T344045 (10BTullis) This is now working. Here's the last bit of the log from re-running the script on an-test-client1001. ` Aug 11 09:13:45 an-test-coord1001 kerberos-run-command[377134]: Aug 11, 2023... [09:32:02] 10Data-Engineering, 10Data-Platform-SRE: [opsweek] hdfs-balancer​ failure - https://phabricator.wikimedia.org/T344045 (10BTullis) 05Open→03Resolved [09:46:20] 10Data-Engineering: Check home/HDFS leftovers of cmacholan - https://phabricator.wikimedia.org/T330121 (10BTullis) 05Open→03Resolved a:03BTullis Removed the user's home directories with: ` btullis@cumin1001:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hado... [09:54:14] 10Data-Engineering: Check home/HDFS leftovers of ilooremeta - https://phabricator.wikimedia.org/T335265 (10BTullis) There are some files on stat1005 that may be of interest, but nothing else. @KCVelaga_WMF what would like us to do with these files? Can we remove them, or would you like them transferred to your o... [09:57:11] 10Data-Engineering, 10Product-Analytics: Remove home/HDFS leftovers of xihua - https://phabricator.wikimedia.org/T337711 (10BTullis) 05Open→03Resolved a:03BTullis I removed the home directories with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::ma... [10:03:49] 10Data-Engineering: Check home/HDFS leftovers of toddleroux / ryanmax / afandian2 - https://phabricator.wikimedia.org/T325527 (10BTullis) Removed home directories for `toddleroux` and `afandian2` with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::... [10:04:05] 10Data-Engineering: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10BTullis) [10:04:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:56] 10Data-Engineering: Check home/HDFS leftovers of hghani - https://phabricator.wikimedia.org/T335264 (10BTullis) There is some data belonging to this user on three of the stats boxes. @KCVelaga_WMF - would you be able to make an assessment of whether this should be retained please, or whether it is safe for us to... [10:09:50] 10Data-Engineering: Check home/HDFS leftovers of ktsouroupidou - https://phabricator.wikimedia.org/T335012 (10BTullis) 05Open→03Resolved a:03BTullis No data present. ` check-user-leftovers ktsouroupidou ====== stat1004 ====== total 0 ====== stat1005 ====== total 0 ====== stat1006 ====== total 0 ======... [10:13:00] 10Data-Engineering: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10BTullis) Pinging @odimitrijevic - Are you happy for me to remove these old files that were owned by Emil, or would you like us to review or archive them? Thanks. [10:14:50] 10Data-Engineering: Check home/HDFS leftovers of akhatun - https://phabricator.wikimedia.org/T326157 (10BTullis) Pinging @Gehel - Are you happy for us to remove these files and the hive database, or would you rather that we archive them somewhere? Thanks. [10:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:42] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10BTullis) Hi @diego - I can look into getting this data for you. Let's start with the stat boxes; the majority if it is at `stat1005:/home/paramd` There's a total of 12 GB here, which includes 3 GB across two c... [10:52:49] 10Data-Engineering: Check home/HDFS leftovers of neilpquinn-wmf - https://phabricator.wikimedia.org/T340524 (10BTullis) Hi @nshahquinn-wmf - I've noticed that there are still some files on the stats servers owned by your previous account: `neilpquinn-wmf`. ` ====== stat1005 ====== total 140 drwxrwxrwx 5 12049 wi... [11:02:18] 10Data-Engineering: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10BTullis) Thanks @AndyRussG_volunteer for your input. I've started to look into this now. For reference, I've created a paste with the full output of what we look for, which correlates with what you mention.... [11:08:34] 10Data-Engineering: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10BTullis) > It might be nice if the .ipynb and maybe the .hql files could be preserved somewhere for possible future reference? @XenoRyet - Do you agree that we should do this? Should I make a tarball of all... [11:18:27] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10BTullis) Pinging @fkaelin - Did you have a chance to look at these files: T320367#8429551 and work out what you would like us to do with them? We can transfer ownership, archive them, or delete them. You'll... [11:22:55] 10Data-Engineering: Deploy an-test-launcher1002 as a Ganeti VM to test high-availability of scheduled jobs - https://phabricator.wikimedia.org/T288767 (10BTullis) 05Open→03Declined I'm declining this task in light of the fact that we have moved significantly further on our airflow migration and it is no long... [11:31:42] (SystemdUnitFailed) resolved: monitor_refine_eventlogging_legacy_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:36] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:33:36] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:39:21] 10Data-Engineering: Check home/HDFS leftovers of akhatun - https://phabricator.wikimedia.org/T326157 (10Gehel) We can just remove them. Anything important should already be under version control. [12:12:30] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10JMeybohm) >>! In T343236#9079448, @BTullis wrote: > Maybe what we should do is: > * Move datahub t... [12:32:50] (SystemdUnitFailed) firing: clean-confd-rundir.service Failed on kafka-jumbo1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:07] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:03] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:50] (SystemdUnitFailed) resolved: clean-confd-rundir.service Failed on kafka-jumbo1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:34] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye [14:21:44] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye [15:03:57] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye executed with errors: - wdqs2008 (**FAIL**) - Downtimed on Icin... [15:07:11] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye executed with errors: - wdqs2009 (**FAIL**) - Downtimed on Icin... [15:33:36] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:33:36] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [16:10:54] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) Hi @BTullis > Would you like me to move this in bulk to a new directory within your home, such as: /home/dsaez/paramd-archive This sounds good and enough! Thanks [16:58:47] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10fkaelin) This went unnoticed, sorry. We can safely archive all this data. Thank you. [17:22:19] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) a:03EBernhardson [18:42:59] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye [19:08:49] 10Data-Engineering, 10Event-Platform: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10KSarabia-WMF) [19:30:36] 10Data-Engineering, 10Inuka-Team, 10KaiOS-Wikipedia-app, 10Product-Analytics: Correctly detect pageviews from Wikipedia KaiOS app - https://phabricator.wikimedia.org/T344071 (10mpopov) [19:33:17] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye completed: - wdqs2010 (**WARN**) - Downtimed on Icinga/Alertman... [19:33:51] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:33:51] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [19:44:29] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye [20:31:20] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye executed with errors: - wdqs2011 (**FAIL**) - Downtimed on Icin... [20:33:15] 10Data-Engineering: Check home/HDFS leftovers of neilpquinn-wmf - https://phabricator.wikimedia.org/T340524 (10nshahquinn-wmf) @BTullis thanks for working on this! Everything you mentioned, on the stat servers and in Hive, is safe to delete. [22:19:18] 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10bking) Based on recent cookbook runs, it appears that the "lvs_strategy=both" option is not working. Leaving this as a note to myself to look at it Monday. [22:55:48] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) Headed out for the weekend, we are exactly halfway: ` sudo cumin A:wdqs-all 'cat /etc/debian_version' 30 hosts will be targeted: wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet =... [23:33:51] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:33:51] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability