[00:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.29% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:23:28] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:44] (SystemdUnitFailed) firing: (20) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:28] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:30] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:12] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:48] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:43] (SystemdUnitFailed) firing: (20) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:10] RECOVERY - Check systemd state on kafka-jumbo1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:04] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:02] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:44] RECOVERY - Check systemd state on an-worker1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:02] RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:14] RECOVERY - Check systemd state on an-presto1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:43] (SystemdUnitFailed) firing: (20) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:24] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:28] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:56] RECOVERY - Check systemd state on an-presto1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:52] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:44] (SystemdUnitFailed) firing: (19) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:42] RECOVERY - Check systemd state on an-presto1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:43] (SystemdUnitFailed) firing: (19) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:44] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:42] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:43] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:43] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:16] RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:43] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.288% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:49:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.288% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:27:41] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [08:50:55] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) The `_schemas` topic is actually legit and should not be removed, as it is where `Karapace` stores its data: https://github.com/Aiven-Open/karapace#backing-up-your-karapace. Although... [08:54:28] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) >>! In T346887#9220603, @Ottomata wrote: > You can also probably delete ANY topic that has ksql in it. We've never used KSQL in prod. ` brouberol@kafka-jumbo1010:~$ for topic in $... [09:01:53] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) The 3 remaining topics with RF=1 are empty: ` brouberol@kafka-jumbo1010:~$ kafka topics --describe | grep 'ReplicationFactor:1' Topic:faust-app-__assignor-__leader PartitionCount:1... [09:02:44] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka topics --describe | grep 'ReplicationFactor:1' brouberol@kafka-jumbo1010:~$ ` We no longer have a topic with RF=1. We can now work on adding mo... [09:10:51] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [09:38:07] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) Hit a bit of a block with the reimage at the partitioning step, exploring the options to find the best way forward for `druid1008` {F41524833} {F41524831} [09:38:15] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) [09:45:03] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10MSantos) [09:46:32] 10Analytics, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Gehel) 05Declined→03Open Re-opening after discussion with @brouberol, having better auto discovery is still interesting. [09:46:50] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, and 2 others: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline - https://phabricator.wikimedia.org/T211247 (10Gehel) [09:47:08] 10Analytics, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Gehel) a:05Ottomata→03brouberol [09:49:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:57] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors: - druid1008 (**FAIL**) - Downtimed on... [10:22:22] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [10:34:23] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors: - druid1008 (**FAIL**) - Removed from... [11:05:52] !log testing SAL and logging [11:05:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:25:49] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [12:11:03] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:11:25] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors: - druid1008 (**FAIL**) - Removed from... [12:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:19:52] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [12:20:47] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) There has been an improvement, but it's still not working correctly. Here's a screenshot from the page with the preview contai... [12:44:07] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) Hi @SCampos-WMF I've tested the settings in https://gerrit.wikimedia.org/r/977057 manually, and they s... [12:44:47] !log removing oozie configuration from core hadoop files with https://gerrit.wikimedia.org/r/c/operations/puppet/+/974647 for T341893 [12:44:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:44:50] T341893: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 [13:39:09] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors: - druid1008 (**FAIL**) - Removed from... [13:46:04] 10Analytics, 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10REsquito-WMF) @Ottomata We have deployed our changes to prod. Is there any place or anyhow we ca... [13:49:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:56] !log roll-restarting hadoop masters on test cluster for T341893 [14:12:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:12:03] T341893: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 [14:27:00] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [14:27:13] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors: - druid1008 (**FAIL**) - Removed from... [14:33:58] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10SCampos-WMF) Hey @Btullis, thanks for addressing this issue! I'll generate a ticket and share it with our technic... [14:38:21] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) We fixed a partman recipe issue that was causing some errors, then proceeded as expected with the expected options below then {F41525119} selected Yes from the image below {F41525854} The... [14:38:45] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye [14:44:23] btullis: with the monitor for kafka topics with RF=1 merged, and the stale topics deleted, all that's left to do is remove these old topics from the datahub data. You mentioned I needed to find a conda env in a stat box with datahub installed, is that right? [14:46:28] or should I maybe create one with just datahub, in my home dir, and cleanup after myself? [14:47:32] (I found /home/aqu/afdeb/usr/lib/airflow/envs/airflow_2.3.1_1/bin/datahub on stat1004) [14:47:36] brouberol: Yes, you can do it on a stat box. Either a conda-analytics environment or a basic python venv will work. [14:48:01] Hang on, I'll look out an example. [14:49:46] Oh, this was the comment that I was thinking of, but it wasn't as useful as I had thought. It doesn't have the creation of the environment, just using it to do a manual ingestion. https://phabricator.wikimedia.org/T327884#8574080 [14:50:15] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub#Manual_Ingestion_Example [14:58:33] !log merging 974649: Remove all remaining references to oozie and clean up | https://gerrit.wikimedia.org/r/c/operations/puppet/+/974649 for T341893 [14:58:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:58:36] T341893: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 [15:12:01] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye completed: - druid1008 (**PASS**) - Removed from Puppet and... [15:19:30] Afk for a bit [16:11:03] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:12:49] * brouberol back [16:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:13:41] btullis: I'm struggling to get the datahub CLI to talk to the server. I'm getting various SSL related errors. I'm curious: did you ever get it to work? [16:13:53] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [16:14:03] Oh yes, hang on. There is an environment variable that helps. Let me dig it out. [16:15:03] 🙏 [16:15:48] I dug a bit on the box and found a .datahubenv file with [16:15:48] gms: [16:15:48] server: https://datahub-gms.discovery.wmnet:30443 [16:15:48] so I copied that [16:16:04] Try this `REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt` and then your command. [16:16:07] aaah [16:16:14] perfect, thank you [16:16:15] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/datahub/ingestion/ingest_daily_dag.py#L60 [16:17:03] It's something to do with the conda-analytics environment not using the right system CA file by default. [16:17:31] it worked! [16:17:47] do we have that documented somewhere? If not, I'll make sure to do so [16:18:34] I remember that we had this ticket back when it was `anaconda-wmf`, before it was `conda-analytics` https://phabricator.wikimedia.org/T306197 [16:18:50] I thought it was going to go away with conda-analytics, but it hasn't. [16:20:40] I seem to have removed said documentation, thinking that it was fixed: https://wikitech.wikimedia.org/w/index.php?title=Data_Engineering/Systems/DataHub&diff=prev&oldid=2091885 [16:22:14] ack, thank you. I'll write a little something in our doc then [16:22:37] I'm about to soft delete the topics from datahub, and if everything goes right, I'll hard delete them as well [16:27:16] `Took 40.972 seconds to soft delete -1 versioned rows and 0 timeseries aspect rows for 1 entities.` jeez take your time datahub, no-one's in a rush or anything [16:28:19] so deleting -1 rows means that it inserted one row?!?!?! :D [16:29:09] hahaha [16:29:28] up is down, down is up, what is true anymore? [16:29:34] False [16:30:40] * brouberol slow claps [16:31:17] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) We can now delete these topics from datahub, from stat1004: ` (2023-05-05T16.44.55_milimetric) milimetric@stat1004:~$ export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt (... [16:55:06] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) 05Open→03Resolved [17:06:22] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10cmooney) @robh @Jclark-ctr I kicked off the reimage of an-worker1160 again. I think the problem here wasn't actually an error on the DHCP config, but a problem we have... [17:49:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:44] 10Data-Engineering, 10Data Products: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10mforns) [19:44:55] 10Data-Engineering, 10Data Products: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10gmodena) Just had a chat with @JAllemandou , this could be a good use case for {T349763}. > Compromise: If we ch... [20:11:17] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:49:59] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed