[02:11:29] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 1.472% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:19:22] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:29] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 1.117% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:19:26] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:15] 10Data-Engineering: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10Stevemunene) [07:53:21] 10Data-Engineering: Check home/HDFS leftovers of hshaath - https://phabricator.wikimedia.org/T335263 (10MoritzMuehlenhoff) [07:54:50] 10Data-Engineering: Check home/HDFS leftovers of hghani - https://phabricator.wikimedia.org/T335264 (10MoritzMuehlenhoff) [07:55:37] 10Data-Engineering: Check home/HDFS leftovers of ilooremeta - https://phabricator.wikimedia.org/T335265 (10MoritzMuehlenhoff) [08:22:01] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): MegaRAID error on an-worker1110 - https://phabricator.wikimedia.org/T334832 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=07e44cdf-7e7b-43db-8dbf-58cadfec44f7) set by btullis@cumin1001 for 2 days, 0:00:00 on 1... [08:55:22] (03PS3) 10Aqu: Add unit tests on raw webrequest data loss reports job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) [08:57:12] (03PS4) 10Aqu: Add unit tests on raw webrequest data loss reports job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) [08:58:28] (03CR) 10Aqu: "Thanks for the review!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [09:01:30] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks for the changes" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [09:39:50] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): MegaRAID error on an-worker1110 - https://phabricator.wikimedia.org/T334832 (10BTullis) Upgrading firmware with the following command: ` btullis@an-worker1110:~$ sudo ./SAS-RAID_Firmware_700GG_LN_25.5.9.0001_A17.BIN Collecting inv... [09:40:12] !log upgrading RAID controller firmware an an-worker1110 T334832 [09:40:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:40:14] T334832: MegaRAID error on an-worker1110 - https://phabricator.wikimedia.org/T334832 [09:48:05] (03PS5) 10Aqu: Add unit tests on raw webrequest data loss reports job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) [09:51:42] (03CR) 10Aqu: [V: 03+2 C: 03+2] Add unit tests on raw webrequest data loss reports job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/908776 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [10:01:21] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): MegaRAID error on an-worker1110 - https://phabricator.wikimedia.org/T334832 (10BTullis) After the reboot, things look a little better. ` btullis@an-worker1110:~$ sudo megacli -LDInfo -LAll -aAll | grep "Cache Policy:" Default Cache... [10:01:37] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): MegaRAID error on an-worker1110 - https://phabricator.wikimedia.org/T334832 (10BTullis) 05Open→03Resolved [10:11:29] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.9079% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:19:29] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:17] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.41-notes (1.41.0-wmf.1; 2023-03-20), 10Metrics-Platform-Planning (Metrics Platform Kanban): Value for performer.registration_dt should be a string, not an integer - https://phabricator.wikimedia.org/T331972 (10phuedx) 05Open→03Resolved Bei... [12:47:41] (03CR) 10Ottomata: [C: 03+2] Add event schema for ML classification change on current page state [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/905965 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [12:48:12] (03Merged) 10jenkins-bot: Add event schema for ML classification change on current page state [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/905965 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [12:56:33] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Q4 eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10JArguello-WMF) 05Open→03Resolved [12:59:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10JArguello-WMF) [12:59:20] 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10JArguello-WMF) [12:59:30] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10JArguello-WMF) [12:59:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10JArguello-WMF) [12:59:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10JArguello-WMF) [13:00:43] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JArguello-WMF) [13:00:54] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JArguello-WMF) [13:16:42] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10JArguello-WMF) [13:16:44] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10JArguello-WMF) [13:16:46] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10JArguello-WMF) [13:19:03] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 12): 2 additional new wikis - https://phabricator.wikimedia.org/T332070 (10JArguello-WMF) [13:19:11] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12), 10Patch-For-Review: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JArguello-WMF) [13:30:35] ottomata: Heya - let me know if you wish to discuss my CR on hdfs-sync - I didn't do exactly as we originally planned [13:56:44] ottomata: sorry I hadn't seen your review - I sent a new patch with a version of the changes you suggested - let me know if that's what you expected [14:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.6914% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:22:33] hello team! joal, yt? I have some questions about the webrequest sampled job that I'm testing now in spark3-sql... can you help me please? :] [14:31:32] joal: hi just saw these IRCs (my client wasn't connected?) anyway, added more commentts, looking good! [14:59:00] mforns: I'm reviewing to learn and will merge soon, if that's ok [14:59:25] (your druid airflow MR) [15:07:26] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [15:07:32] milimetric: sure! thanks :] [15:08:28] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 12): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10JArguello-WMF) [15:08:32] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [15:19:26] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10JArguello-WMF) 05Open→03Resolved [15:19:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10JArguello-WMF) 05Open→03Resolved [15:20:12] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:54] 10Data-Engineering, 10Patch-For-Review: Determine which team should own airflow1005/update contact info - https://phabricator.wikimedia.org/T334522 (10bking) [15:27:50] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10LSobanski) [15:28:27] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [17:40:52] joal: o/ am running Puppet Catalog Compiiler and getting errors on your pattch [17:41:11] rather than doing that and then commenting, I can show you how to do that (unless you already know how?!) [17:41:42] i do it this way: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility) [17:41:54] gets me [17:41:54] https://puppet-compiler.wmflabs.org/output/910761/40805/ [18:00:32] I'm gonna learn how to that ottomata [18:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.4745% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:39:18] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) I've read the discussion in {T308017} and I think `user_is_temp` field on `user` table is the better option than: - user types field on user table: because as... [18:51:50] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Use new PageUndeleteComplete hook to emit mediawiki.page_change undelete event - https://phabricator.wikimedia.org/T328308 (10OwenRB) 05Open→03Resolved a:03OwenRB I think the above patch makes this resolved now? [18:51:53] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks, 10MW-1.40-notes (1.40.0-wmf.26; 2023-03-06): Create PageUndeleteComplete hook, analogous to PageDeleteComplete - https://phabricator.wikimedia.org/T321412 (10OwenRB) [18:57:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Use new PageUndeleteComplete hook to emit mediawiki.page_change undelete event - https://phabricator.wikimedia.org/T328308 (10Ottomata) This change should be deployed with the [[ https://wikitech.wikimedia.org/wiki/Deployments/Train | depl... [19:20:12] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:12] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:12] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:20] (03PS1) 10Kimberly Sarabia: Creates web schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/911412 (https://phabricator.wikimedia.org/T335309) [22:16:43] (DiskSpace) firing: Disk space an-test-worker1002:9100:/ 0.2585% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace