[00:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:10] (03CR) 10Milimetric: [C: 03+2] "I of course don't know the details but the code looks good to me. The only thought I had was that we can make Iceberg tables now. I don" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/927784 (owner: 10Nmaphophe) [00:25:16] (03CR) 10Milimetric: [V: 03+2] GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/927784 (owner: 10Nmaphophe) [00:25:33] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/927784 (owner: 10Nmaphophe) [00:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:06:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:45] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:45] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:44] (PuppetDisabled) resolved: Puppet disabled on analytics1069:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=analytics&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:36:45] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:53] Hi joal I need some help with the monitor_refine_events service are you available? [08:19:34] Hi stevemunene - I just reran the failed refine job - I'm checking its log now [08:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:27] I will have a look at the failures of the produce_canary_events job that we're seeing too. [08:23:30] stevemunene: The rerun succeeded :) [08:25:26] stevemunene: the monitor failure was due to that failed instance - with the rerun, the monitor job should be fixed [08:26:26] joal: Steve mentioned that there is a refinery source version to be deployed, is this right? I couldn't see it on the etherpad. [08:26:58] hm, I don't know btullis - If it's not on the etherpad, it is not dpeloyed :) [08:32:12] Thanks joal [08:33:14] I have just sent an email about the produce_canary_events failure. Steve, this is on you and me too. [08:35:12] joal: Could you check for me please? Do you have +2 rights on https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/928800 [08:38:02] m.ilimetric added a +2 to the eventgate-wikimedia patch last week to fix the canary alerts, but I definitely do not have the rights. [08:38:17] thanks btullis [08:39:51] btullis: I don't have +2 on that repo :( [08:42:04] stevemunene: there is a new refine-failure - do you wish we look at this together, and that I give you more context about those things? [08:42:17] joal: OK, good to know. I suspect that the Analytics group (https://gerrit.wikimedia.org/r/admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members) should have +2 rights, so I'll send a request. [08:42:26] ack btullis [08:42:29] thank you [08:43:05] In the meantime, I will make a patch for the canary events on that repo anyway. [08:43:51] We also have one failed airflow task [08:45:17] Yes joal shall we have a look at it [08:45:21] joal: I'll look at the airflow task and I'll try to write up clearly what I do in my email. [08:45:46] btullis, stevemunene : should we do a quick batcave to sync? [08:46:09] Yes, to the batcave [08:46:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:46] !log cleared airflow task for `projectview_geo.move_data_to_archive` [08:57:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:01:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:45] (SystemdUnitFailed) resolved: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:45] (SystemdUnitFailed) resolved: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:17] stevemunene: monitor_refine_event succeeded :) We're all fixed in that regard :) [09:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:57] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10dcausse) Revision create events... [10:01:46] maven learning circle starting in maven learning circle in https://meet.google.com/ibf-ghno-gbm [10:02:45] btullis: ^ [10:02:55] joining! [10:05:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:52] gehel: Sorry, I had to decline the meeting because it clashed with my exam. I've finished now and I'm back at my desk. Hopefully I'll e able to join the next one. [10:46:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:59] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:44] I have created the following ticket about getting +2 rights on eventgate-wikimedia T340106 - Currently I've only requested +2 access for SREs but there's an open question about whether all Analytics group members should have it. [11:36:46] T340106: Allow members of ldap/ops group +2 rights to the eventgate-wikimedia repository - https://phabricator.wikimedia.org/T340106 [11:43:39] +1 nice :) [11:44:14] I filed the last 3 code reviews for varnishkafka and PKI - eqiad, drmds and esams [11:44:49] will probably do eqiad today/tomorrow, and the europe ones early next week [11:44:53] then we should be done :) [11:45:08] (IIUC all kafka clients will run on PKI after that) [11:45:40] I also filed a change to run cassandra on PKI, if Eric likes it I'll bother people to migrate aqs to it :) [11:45:59] (I'll test it on ml-cache first I promise, no weird settings to test) [11:46:28] ah no wait aqs is not on 4.x yet, so IIUC no hot reload [11:46:32] we'll wait :) [11:46:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:51] hopefully on 4.x puppet will update the cert, and cassandra will pick it up without restart [11:46:56] a dream [11:48:17] Cool, yes I've been following along with the cassandra change, but haven't added my +1 yet. Watched a few revisions go past. [11:49:19] Happy to give it a final check if you'd like me to now. [11:51:30] btullis: it is ready yes! [11:51:45] elukey: Looking now :) [11:51:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:09] <3 [12:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:41] elukey: Looks good. How was your upgrade to cassandra 4.1 on ml-cache? Did you upgrade with data in place? Wondering how feasilble it is for AQS. [12:06:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:34] btullis: thanks! In my case we don't have data yet and the upgrade was very smooth, but IIUC Eric is going to migrate all the clusters to bullseye first and then upgrade [12:18:45] the procedure should be safe for data in place IIUC [12:21:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:11] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) Yes! Thank you! [12:31:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:45] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:47] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate and eventstreams helm chart to use automatic kafka egress networkpolicies and envoy service mesh - https://phabricator.wikimedia.org/T335024 (10Ottomata) a:05Ottomata→03tchin [13:14:01] !log deploying the new eventgate-wikimedia container to eventgate-main [13:14:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:16:57] !log move varnishafka instances in eqiad to PKI - T337825 [13:16:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:17:00] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [13:30:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:45] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:33] 10Quarry: Widespread puppet agent failures in project quarry - https://phabricator.wikimedia.org/T340114 (10rook) [13:39:02] 10Data-Engineering, 10Event-Platform Value Stream: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10xcollazo) >>! In T340067#8954079, @Ottomata wrote: > @xcollazo this task should be pretty easy to do if you want to try your hand at some PHP! Are y... [13:41:28] 10Data-Engineering, 10Event-Platform Value Stream: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) Hah! Just trying to encourage everyone to feel a little autonomy when they want something done! :D [13:43:32] !log Remove analytics106[1-3] from the HDFS topology [13:43:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:47:25] !log run puppet on hadoop-masters [13:47:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:51:39] eqiad varnishkafkas restarted, all good! [14:01:14] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10JAllemandou) >>! In T267648#8954... [14:02:44] !log running sre.hadoop.roll-restart-masters restart the Namenodes to completely remove any reference of analytics106[1-3] T317861 [14:02:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:02:47] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [14:11:40] !log redeploying datahub to staging to try to get upgrade to 0.10.0 working. [14:11:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:18] 10Quarry: Widespread puppet agent failures in project quarry - https://phabricator.wikimedia.org/T340114 (10rook) This appears to be a result of https://gerrit.wikimedia.org/g/operations/puppet/+/c117354fecaee25ae05153808192830f9db8bf76/modules/profile/manifests/quarry/base.pp as https://gerrit.wikimedia.org/r/a... [14:36:28] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I've now installed this on an-test-client1001. ` btullis@an-test-client1001:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Inf... [14:36:47] !log restarted airflow-webserver and airflow-scheduler on an-test-client1001 with version 2.6.1. [14:36:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:40:48] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) Initial results from the web interface seem good. I will monitor for successful DAG runs. [14:49:13] stevemunene: There are some more failed airflow tasks. Do you want to look at them, or shall I? [14:52:21] We could pair on it btullis [14:52:50] OK, to the batcave. [14:54:52] Oh, these latest failures are all from the test cluster. That's probably a result of my testing version 2.6.1 of airflow. Still, let's have a look together. [15:07:44] !log clearing task for refine_webrequest_hourly_test_text hour 13:00 [15:07:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:11:15] I had to clear the state of all downstream tasks from the failed task, which was new to me. [15:18:18] !log cleared status for aqs_hourly.wait_for_webrequest run 13:00 and the downstream task on an-test-client1001. [15:18:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:23:57] (03PS1) 10Elukey: druid: update the webrequest live's supervisor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/932263 (https://phabricator.wikimedia.org/T340097) [15:25:23] (03CR) 10Joal: [C: 03+1] ":)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/932263 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:25:39] <3 [15:28:30] (03CR) 10Btullis: [C: 03+1] druid: update the webrequest live's supervisor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/932263 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:29:37] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10JMeybohm) Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)? > If we try to do a... [15:33:58] 10Data-Engineering-Planning, 10DBA: Move Mediawiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 (10Milimetric) Getting organized here, the pieces we need are thus: * AQS 2.0 endpoint will serve responses to /statistics/{wiki project}/{query}/{date as yyyy-mm} ([[ https://gerrit... [15:40:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] druid: update the webrequest live's supervisor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/932263 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:50:12] !log update the webrequest_sampled_live druid kafka supervisor to add the https field - T340097 [15:50:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:50:15] T340097: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 [15:52:29] webrequest live updated, super smooth [15:52:44] \o / [15:57:11] !log adding new bigtop-1.5 packages to apt.wikimedia.org for bullseye [15:57:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:01:38] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) I've got another set of packages that I'm happy with now and an upgrade on an-test-worker1001 was... [16:12:13] btullis: As mforns spoke about vizualizing nested types in superset, I found this (look for nested in the page) https://apache-superset.readthedocs.io/en/latest/installation.html [16:12:17] Could we try ? [16:19:51] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) /eqiad/B/ hosts to be decommissioned have been successfully Excluded from HDFS and YARN and removed from the HDFS topology, moving on to decommissi... [16:21:56] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10Ottomata) No, we've only seen it in staging. There are some suspicious events in both the flink-operator and the a... [16:28:39] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10JMeybohm) >>! In T338233#8927836, @gmodena wrote: >> Let's verify this with Search and SRE ServiceOps > > @JMeybohm @dcausse we'd l... [16:35:12] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) The symlinks are present in the hive-hcatalog package, where we expect to find them: ` btullis@an... [16:44:28] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I have re-enabled an-test-worker1001 and the nodemanager is running successfully. ` btullis@an-test-worker1001:~$ sudo run-puppet-agent Info: U... [16:45:13] joal: yes we can. [16:45:37] That'd be great :) [16:50:50] 10Data-Platform-SRE, 10Data Pipelines: Enable the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) [16:52:57] Thanks a lot btullis :) [16:53:07] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/932298 this is the simplest approach, but it applies to both superset-next and superset production uniformly. [16:53:45] How soon would you like it? I could push it out to superset-next now, if it would help. [16:54:38] btullis: let's do that! I'll test [16:56:47] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've got the datahub-upgrade pod running. No conclusive results yet. ` btullis@deploy1002:~$ kubectl logs -f datahub-main-system-update-job-6qhzd datahub-system-update-job ERROR StatusLo... [16:57:22] joal: OK, doing it now. I'll disable puppet on the prod superset instance while we test. [17:00:35] https://www.irccloud.com/pastebin/OkKj5kOG/ [17:00:55] joal: Please commence testing on superset-next.wikimedia.org [17:01:06] ack btullis - will do [17:01:56] btullis: it's not perfect, but it does work! [17:02:19] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) Network connection error, so probably a missing network policy. ` 2023-06-22 16:59:49.791 ERROR 1 --- [ main] c.l.g.factory.entity.EbeanServerFactory : Failed to connect to the... [17:03:32] milimetric, mforns - would you mind taking a quick look? [17:06:19] checking [17:10:45] I have quite some errors about not being able to get metadata :( [17:10:52] hm, btullis / joal: this doesn't seem ready, it's able to expand the event struct on some tables from the event database, but it fails to load the schema for navigationtiming. And query results still show just an array of all the values of the struct without the keys [17:11:43] yeah, I think Superset support for this is just not great, I wonder why it's such a hard problem... maybe we can send a patch for that config option [17:11:45] milimetric: result showed me multiple columns [17:12:37] joal: what query did you try, I did `select * from event.navigationtiming limit 10;` [17:12:41] I have to step away from the keyboard for a while. It's fine to (a) leave this as-is for a few days, (b) revert on superset-next or (c) roll out to superset production. Let me know what you think. [17:13:10] (a) is fine, but (b) and re-assess before (c) [17:13:36] (a) is fine for me :) [17:17:24] milimetric: batcave for a minute? [17:17:29] omw [17:49:07] 10Quarry, 10Patch-For-Review: Widespread puppet agent failures in project quarry - https://phabricator.wikimedia.org/T340114 (10rook) 05Open→03Resolved [18:23:37] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10Framawiki) >>! In T302154#8914904, @Andrew wrote: > Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted. thanks, deleted. [19:28:11] 10Data-Engineering, 10Data Pipelines: Update API with May Net New Content Data - https://phabricator.wikimedia.org/T339159 (10Iflorez) 05Open→03Resolved p:05Triage→03High [20:46:38] 10Data-Engineering, 10Event-Platform Value Stream: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) [21:22:34] 10Data-Engineering, 10Event-Platform Value Stream: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) [21:23:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) FYI Am trying to collect various things I do for 'parental leave transition' docs at https://wikitech.w... [21:46:26] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) > would we make the ev...