[03:05:37] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10EChukwukere-WMF) Smoke test couple of the Page views endpoints using the below code and it returned appropriate...
[08:18:15] <joal>	 Hey folks, I'm still with Naé today - I'll make sure to handle the deploy (we missed last week)
[09:06:23] <elukey>	 btullis, nfraison_ o/ added another 3 days of silence to ceph nodes (spam on #operations etc..)
[09:07:43] <nfraison_>	 ack
[09:22:05] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox: Chart data from analytics.wikimedia.org do not fully specify macOS or Windows versions. - https://phabricator.wikimedia.org/T269722 (10Aklapper) a:05mforns→03None @mforns: Removing task assignee as this open task has been assigned for more than two years - See the em...
[09:22:27] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Icebox, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10Aklapper) a:05fkaelin→03None @fkaelin: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task...
[09:22:31] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10Aklapper) a:05nettrom_WMF→03None @nettrom_WMF: Removing task assignee as this open task has been assigned for more than two ye...
[09:24:41] <wikibugs>	 (03PS1) 10Urbanecm: Add fields to analytics/mediawiki/mentor_dashboard/visit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/901539 (https://phabricator.wikimedia.org/T325117)
[09:35:42] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Aklapper) a:05Ottomata→03None @Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 2...
[09:35:46] <wikibugs>	 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition  is complete - https://phabricator.wikimedia.org/T252585 (10Aklapper) a:05Ottomata→03None @Ottomata: Removing task assignee as thi...
[09:35:50] <wikibugs>	 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Aklapper) a:05Ottomata→03None @Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to...
[09:36:02] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Aklapper) a:05Ottomata→03None @Ottomata: Removing task assignee as this open task has been assigned for m...
[09:46:51] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade db1108 to Bullseye - https://phabricator.wikimedia.org/T304492 (10BTullis) Pausing this work whlst we wait for the replacement server.
[09:47:15] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney)
[09:48:35] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate search_satisfaction.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329880 (10pfischer) a:03pfischer
[10:02:38] <joal>	 Hi team - I'm gonna start deploy - I'll use the dpeloyment etherpad and tasks in ready-to-deploy
[10:04:31] <joal>	 I'll first deploy refinery, then Airflow (no change in refinery-source)
[10:06:25] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/896334 (owner: 10Jennifer Ebe)
[10:11:37] <wikibugs>	 (03PS1) 10Joal: Add compute_mediawiki_history_reduced.hql in hql folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/901545
[10:13:26] <joal>	 !log Deploy refinery with sqoop
[10:13:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:13:34] <joal>	 !log Deploy refinery with scap sorry
[10:13:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:25:29] <joal>	 !log Pause pageview_actor airflow job during HDFS refinery deploy and alter table update
[10:25:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:26:16] <joal>	 !log Deploy refinery onto HDFS
[10:26:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:31:35] <nfraison_>	 !log deploy last changes on k8s dse cluster (dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater, enable spark operator mutation webhook, Allow communication from spark pods to HDFS/Hive)
[10:31:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:41:22] <joal>	 !log Alter wmf.pageview_actor table adding referer_data field
[10:41:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:41:32] <joal>	 !log Unpause pageview_actor airflow dag
[10:41:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:49:18] <nfraison_>	 !log deployment last changes on k8s dse cluster failed due to certificate secret creation failure due to timeout contacting pki.discovery.wmnet
[10:49:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:50:34] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Ah, we have a small issue with the hive package. The spec says that hive depends on `python`, which in b...
[11:01:17] <joal>	 !log Deploy analytics airflow code
[11:01:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:22:02] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis)
[12:22:05] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) As per feedback from @MoritzMuehlenhoff, I have added the `python-is-python3` package to the `profile::h...
[12:29:50] <wikibugs>	 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning, 10Wikimedia-Hackathon-2023: Allow JavaScript errors to fail CI builds - https://phabricator.wikimedia.org/T318902 (10kostajh) There are some helpful notes T301464#8704967 that apply to this task.
[12:50:10] <joal>	 Hi team - I have deployed Airflow code, new jobs work, but some old jobs are failing :(
[13:15:52] <joal>	 The problem comes from parameters passed to the job not having been interpreted as dates :(
[13:16:20] <joal>	 I need to dig a bit more onto how to fix that :(
[13:21:16] <nfraison_>	 !log deploy last changes on k8s dse cluster (dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater, enable spark operator mutation webhook, Allow communication from spark pods to HDFS/Hive)
[13:21:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:31:22] <btullis>	 joal: Anything I can do to help at all?
[13:32:42] <joal>	 btullis: I'm in meeting now, I'll be back to the issue after - I'll ping you
[13:36:12] <wikibugs>	 (03CR) 10Snwachukwu: Copy add_partition hql script from Oozie to Hql folder. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/900389 (https://phabricator.wikimedia.org/T330200) (owner: 10Snwachukwu)
[14:25:52] <SandraEbele>	 joal: I see that the code for simple skeinoperator was modified in the new change. This is probably the reason for the recent dag failures. The dags are the ones using archiveoperator which uses the simpleskeinoperator.
[14:26:29] <joal>	 thank you SandraEbele - I was about to look in that direction, you beat me to it :)
[14:27:21] <joal>	 I'm currently having issues with my local test environement - I have recreated a new conda environement, pip-installed, and when I pytest I get the error: No module named 'airflow'
[14:27:34] <joal>	 Has any of you seen that before?
[14:30:22] <joal>	 aqu, SandraEbele, jennifer_ebe - Any idea about my issue just above?
[14:31:17] <SandraEbele>	 Yes, I also saw it in the when merging my branch to the airflow main branch recently. Now we need to remove airflow  as a parameter of test function from dag test file
[14:32:02] <joal>	 hm, not sure about what ou mean SandraEbele :S
[14:33:51] <joal>	 It's as if airflow was not part of the dependencies installed on conda for the dags
[14:36:04] <SandraEbele>	 Oooh … I see. Sorry misunderstood you. Did you use the conda-environment.lock.yml to create your test conda environment?
[14:36:39] <joal>	 Nope I didn't - I created the conda-env from my own conda base env
[14:37:16] <joal>	 I guess that's the thing :)
[14:37:22] <SandraEbele>	 conda env create --name airflow-dags -f conda-environment.lock.yml
[14:37:42] <joal>	 SandraEbele: <3 I'll update the docs right now
[14:39:03] <SandraEbele>	 Yes the docs need to be updated. For now It’s in the readme file of the airflow day repo.
[14:39:14] <joal>	 Ack!
[14:50:14] <wikibugs>	 10Data-Engineering, 10Observability-Alerting: Migrate eventgate check_prometheus checks to alertmanager - https://phabricator.wikimedia.org/T309009 (10Arnoldokoth) a:03Arnoldokoth
[14:52:10] <joal>	 SandraEbele: I confirm our solution has worked for me - thanks a milion :)
[14:56:49] <joal>	 aqu, SandraEbele, jennifer_ebe - if you have a minute: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/310
[14:57:49] <joal>	 I'll investigate the failed jobs when I get back from caring the kids
[15:05:47] <joal>	 ebernhardson: Good morning :) For when you wake up, we have a broken test on glent_weekly.py (see https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/83452 for isntance) - would you mind taking a look?
[15:06:36] <joal>	 ebernhardson: I'll also need some of your time if possible to talk about the changes you made to the SkeinOperator - we have jobs failing since I dpeloyed them
[15:06:48] <joal>	 Gone for now, will be back in ~2h
[15:08:10] <aqu>	 joal: I would add the `script` parameter in this list: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/operators/hdfs.py#L16-19
[15:08:24] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) a:05EChukwukere-WMF→03BPirkle
[15:16:56] <ebernhardson>	 joal: hmm, i'll check it out. What about if we changed the repo settings to prevent this from happening in the future? I'm not sure what exactly needs to change, but in gerrit you can't merge a patch that breaks the test suite of its own repo
[15:36:13] <btullis>	 If anyone has a few minutes to look at this, I'd be grateful. https://gerrit.wikimedia.org/r/c/operations/puppet/+/901604 
[15:37:12] <btullis>	 It switches the spark network shuffler jars from the spark2 to spark3 version. It only affects the test cluster for now.
[15:41:45] <aqu>	 ebernhardson I've just done it. + squash at merge default to True.
[15:42:23] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:25] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:47] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:56] <ebernhardson>	 aqu: cool, thanks
[15:42:57] <icinga-wm>	 PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:57] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:15] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:58] <btullis>	 These errors above are acknowledged in #wikimedia-operations
[15:44:01] <icinga-wm>	 PROBLEM - Check systemd state on analytics1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:27] <icinga-wm>	 PROBLEM - Check systemd state on aqs1020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:00] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:07] <icinga-wm>	 PROBLEM - Check systemd state on aqs1017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:59] <icinga-wm>	 PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:17] <icinga-wm>	 PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1148 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:37] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:31] <icinga-wm>	 PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:31] <icinga-wm>	 PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:13] <icinga-wm>	 RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:29] <icinga-wm>	 RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:43] <icinga-wm>	 RECOVERY - Check systemd state on aqs1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:43] <icinga-wm>	 RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:55] <icinga-wm>	 RECOVERY - Check systemd state on aqs1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:09] <icinga-wm>	 RECOVERY - Check systemd state on aqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:50] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:29] <icinga-wm>	 RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:09] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:13] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:19] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:13] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:21] <icinga-wm>	 RECOVERY - Check systemd state on analytics1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:39] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:21] <icinga-wm>	 RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:07:47] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Stevemunene)
[16:09:46] <wikibugs>	 (03PS2) 10Urbanecm: Add fields to analytics/mediawiki/mentor_dashboard/visit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/901539 (https://phabricator.wikimedia.org/T325117)
[16:13:34] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) a:05BPirkle→03SGupta-WMF
[16:23:09] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF)
[16:24:11] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10API Platform (Sprint 06), 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) @SGupta-WMF Can you help us with your sign-off to mark this as resolved? Thanks!
[16:47:08] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Add log_search to monthly sqoop list - https://phabricator.wikimedia.org/T332621 (10nettrom_WMF) I got curious and checked whether `log_search` is available on the standard replicas, and it isn't because it contains private data. T85756 exists to also make it available...
[17:00:25] <joal>	 ebernhardson: reading our PR about simpleSkeinOperator, it seems I merged the PR for mediacounts before I should have, right?
[17:00:38] <joal>	 reading YOUR, nout our sorry
[17:01:06] <ebernhardson>	 joal: hmm, looking
[17:01:33] <ebernhardson>	 joal: there have been a couple instances recently of patches that would have failed tests if they were rebased before merging, not just yours
[17:01:46] <joal>	 yes, right - 
[17:01:49] <ebernhardson>	 related to me adding the new fixture based testing, and when i added the new flake8 lints
[17:02:15] <ebernhardson>	 so far i've just been cleaning them up, they haven't been too hard to do
[17:02:15] <joal>	 it seems urgent that we merge that PR for the 'script' fix, and deploy it
[17:02:52] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:54] <ebernhardson>	 joal: hmm, i was writing  a complete fix but it is going to take a little more time. I'll submit a quick workaround for now so we can unblock everything
[17:03:21] <joal>	 ebernhardson: the SimpleSkein PR seems to be what we're missing - Am I missing something else?
[17:04:12] <ebernhardson>	 joal: i thought you were refereing the the current test suite failure, where glent trys to template something and fails because i do an int(foo) cast on a jinja template instead of the final value
[17:04:27] <ebernhardson>	 (main branch failing CI)
[17:06:09] <joal>	 Indeed there is this as well - I was more thinking the issue we're having with the SimpleSkein failing for us right now due to the 'script' parameter not being templated
[17:06:33] <joal>	 The main branch failing doesn't help either, but I can deal with that force-merging if needed :)
[17:07:28] <ebernhardson>	 joal: ahh, https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/307 should do the trick there, i hadn't noticed anyone else having that issue, but i ran into that while trying to deploy one of our dags yesterday
[17:08:05] <joal>	 that's exactly what I meant ebernhardson :) Are you ok for me to merge and dpeloy that ?
[17:08:05] <ebernhardson>	 i suppose that patch mixes up a couple things...in gerrit i would have submitted 2 or 3 stacked patches but i'm still getting used to gitlab
[17:08:17] <ebernhardson>	 joal: ya, that one is ready afaik
[17:08:22] <joal>	 ack ebernhardson
[17:08:40] <joal>	 thanks a lot, I'm merging as we speak :)
[17:12:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:10] <joal>	 ebernhardson: I merge the PR, and there still are some failing tests (different from the glent one)
[17:16:52] <joal>	 One is from me (see https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/310
[17:17:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:13] <joal>	 Ah crap - Actually, I can't submit anymore - we have blocked this with btullis - I remeber it now
[17:18:16] <joal>	 hm
[17:18:27] <ebernhardson>	 joal: sec i have a patch that will fix CI i think
[17:18:35] <joal>	 ebernhardson: <3
[17:19:06] <wikibugs>	 10Data-Engineering, 10IP Masking, 10Product-Analytics: Clarify definitions around anonymous and temporary editors - https://phabricator.wikimedia.org/T332205 (10kzimmerman) p:05Triage→03High
[17:19:45] <ebernhardson>	 joal: i think https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/313 will do it, passes locally at least
[17:20:16] <ebernhardson>	 oh, i guess i need to rebase that one though :) sec
[17:21:13] <wikibugs>	 10Data-Engineering, 10IP Masking, 10Product-Analytics: Clarify definitions around anonymous and temporary editors - https://phabricator.wikimedia.org/T332205 (10Mayakp.wiki) a:03Mayakp.wiki We will begin working on this in Q4FY22-23 with @jwang and @Milimetric
[17:27:28] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:21] <joal>	 ebernhardson: I'll merge as soon as it passes CI :)
[17:28:58] <joal>	 ebernhardson: thank you so much :)
[17:30:58] <ebernhardson>	 joal: np
[17:35:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:16] <joal>	 CI fixed! \o/ - I'm dpeloying this version, hopefull it'll fix our prod jobs
[17:36:06] <ebernhardson>	 joal: if not lemme know, if i broke something least i can do is try and fix it too :)
[17:37:40] <joal>	 ebernhardson: I should have been more carefull about those changes - I understood the principles, but I had not digged into them really
[17:39:12] <joal>	 !log Deploy airflow, hopefully fixing HDFSArchiver jobs
[17:39:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:40:44] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:26] <ebernhardson>	 joal: out of curiosity, whats the error the jobs are giving?
[17:42:26] <joal>	 ebernhardson: The skein job fails, and in skein logs I could see that templated values were not interpreted
[17:44:09] <ebernhardson>	 ahh ok, hopefully that is fixed then. It's slightly curious that worked before, i initially came to look at the skein operators because templating wasn't working in the first place
[17:44:42] <ebernhardson>	 but maybe it did in certain circumstances
[17:47:17] <joal>	 I must confess I have not digged those details ebernhardson - I just checked a rerun, it's working :) Thanks again for the quick turnaround
[17:48:31] <ebernhardson>	 glad it's working now at least :) 
[17:48:33] <joal>	 !log rerun failed airflow tasks
[17:48:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:48:42] <joal>	 and also thanks for the CI fix ebernhardson!
[17:50:45] <joal>	 ebernhardson: while I have your attention: thank you for the awesome work on adding the Skein templates - this will prevent us from unexpected issues in the future for sure :)
[17:52:44] <ebernhardson>	 joal: i sure hope so, they've been quite helpful to us in the past. Although they can be a bit verbose...when 100 tasks all have the exact same change i try to quickly page through the diff and verify only the thing i expected changed, but there have been times when a problem snuck in there but i didn't notice it among all the ones that look fine
[17:52:59] <ebernhardson>	 wish i could think of a reasonable way to collapse similar diffs
[17:53:40] <joal>	 not an easy problem !!
[18:07:27] <nfraison_>	 FI first spark job running on K8S DSE cluster and reading from hdfs test cluster (relying on hadoop delegation token): https://phabricator.wikimedia.org/T331859#87158900 (dumb job from spark example jar).
[18:07:27] <nfraison_>	 Tomorrow will apply same FW rules in prod so we should be able to do the same from prod.
[18:14:47] <btullis>	 nfraison_: Excellent!
[18:23:05] <wikibugs>	 (03CR) 10Joal: Copy add_partition hql script from Oozie to Hql folder. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/900389 (https://phabricator.wikimedia.org/T330200) (owner: 10Snwachukwu)
[18:38:52] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-c...
[19:33:56] <wikibugs>	 10Data-Engineering, 10IP Masking, 10Product-Analytics: Clarify definitions around anonymous and temporary editors - https://phabricator.wikimedia.org/T332205 (10Mayakp.wiki)
[19:52:35] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-clien...
[20:09:32] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-c...
[20:10:38] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-clien...
[21:21:49] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-c...
[21:30:14] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-clien...