[00:10:40] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10tstarling) >>! In T333223#8804661, @Tchanders wrote: > Thanks @Ladsgroup . I'd be happy to go with this, but before we do, I'd like to hear from @tstarling and/or @daniel... [00:18:40] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:13] (SystemdUnitFailed) firing: (11) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:36] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:13] (SystemdUnitFailed) firing: (11) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:14] (SystemdUnitFailed) firing: (10) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:13] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:34] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @Ladsgroup During this operation, replication codfw -> eqiad is still active, so as there are codfw masters involved (even... [08:37:07] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 5.4 - https://phabricator.wikimedia.org/T295661 (10elukey) [08:38:59] 10Analytics-Clusters: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) [08:39:02] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) 05Open→03Declined New procedure found and documented in T295661 [08:42:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10elukey) Hi folks! In T295661 I updated the AMD ROCm stack for our GPU to 5.4, but I added support only for Bullseye. When you migrate stat100[5,8] it should be sufficient... [08:42:28] btullis: o/ --^ [08:42:58] GPUs are working on k8s, so all tests went fine! [08:53:12] nice! [09:20:13] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:59] elukey: That's excellent! [10:05:41] Are you still planning to move those remaining two GPUs from Hadoop to Lift Wing? [10:06:05] btullis: yes yes when DCops has time :) [10:07:39] I am also wondering if the 4 gpus on train wing would be more useful on lift wing, but probably if I ask dcops to move them again they will kill me :D [10:07:56] the most pressing use case for the next fiscal will surely be serving [12:14:19] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ottomata) @tstarling do you have a preference for user_is_temp boolean vs user_type with value? [12:17:35] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) >>! In T333223#8810342, @Ottomata wrote: > @tstarling do you have a preference for user_is_temp boolean vs user_type with value? Maybe this comment would answ... [12:33:31] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ottomata) Right, but in that comment 'here' is in relation to the event schema we were designing. I think we are considering such a concept in MW core now? [12:57:29] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10fgiunchedi) [13:00:13] (SystemdUnitFailed) firing: (10) jupyter-stevemunene-singleuser-conda-analytics.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:59] 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of a community supported analytics tool - https://phabricator.wikimedia.org/T169452 (10rook) [13:21:26] (03PS1) 10Barakat Ajadi: CentralNoticeTiming: Remove CentralNoticeTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/912852 (https://phabricator.wikimedia.org/T334550) [13:50:59] (03CR) 10Milimetric: Migrate unique devices druid loading queries to Airflow/SparkSQL (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [14:14:32] RECOVERY - IPMI Sensor Status on aqs2008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:23:49] (03CR) 10Milimetric: Migrate unique devices druid loading queries to Airflow/SparkSQL (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [14:34:46] (03CR) 10Snwachukwu: "Overall Code LGTM." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [14:49:08] PROBLEM - Host an-worker1147 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:44] RECOVERY - Host an-worker1147 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:50:58] PROBLEM - IPMI Sensor Status on an-worker1147 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:07:23] (03CR) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [15:11:19] (03CR) 10Mforns: Migrate queries for webrequest_sampled_128 to /hql (Airflow/Spark3) (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/911890 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [15:33:21] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10mforns) [15:38:14] Heya milimetric :] I'm trying to delete a table from datahub (deprecated unique_devices_project_wide), but can not find the way. Should I be able to? Otherwise, do you know how to do it? [15:40:25] ha... I see that even if it is deprecated and empty, datahub has syncronized the table 15 hours ago, maybe once it syncs again and sees that the tables are deleted, it will automatically remove the table from datahub? [15:44:10] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10mforns) I dropped the tables from Hive: ` wmf.unique_devices_project_wide_monthly wmf.unique_devices_project_wide_daily ` Also, checked that there's no reference... [15:58:38] 10Data-Engineering-Planning: Data Engineering Pairing system - https://phabricator.wikimedia.org/T327790 (10JArguello-WMF) [16:10:46] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10JayCano) I believe it's still relevant here, it's out of scope for the work at hand. If we wanted to add a `user_type` attribute, we would need much more work and discuss... [17:00:13] (SystemdUnitFailed) firing: (10) jupyter-stevemunene-singleuser-conda-analytics.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:00] @mforns you can delete but it's a manual process, there's docs on it. I'd say we should automate it more [17:01:26] if it's not urgent I'd make a task to automate deletion and prioritize it. The manual way is not awesome [17:01:36] milimetric: ok, no problem will do! thanks :] [17:04:39] https://datahubproject.io/docs/api/tutorials/modifying-datasets/#delete-dataset [18:24:36] 10Data-Engineering, 10Data Pipelines: [datahub] Implement automatic deletion of datasets with deleted data sources - https://phabricator.wikimedia.org/T335528 (10mforns) [19:53:53] 10Data-Engineering, 10Product-Analytics (Kanban): Product Analytics ETL Migration: Pilot (MediaSearch ETLs) - https://phabricator.wikimedia.org/T333208 (10xcollazo) [19:53:55] 10Data-Engineering-Planning, 10Data Pipelines, 10Epic: Support for Product Analytics Data Pipelines Migration to Airflow - https://phabricator.wikimedia.org/T332997 (10xcollazo) [21:00:13] (SystemdUnitFailed) firing: (10) jupyter-stevemunene-singleuser-conda-analytics.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed