[00:53:28] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:28] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:31] (SystemdUnitFailed) firing: (19) jupyter-stevemunene-singleuser-conda-analytics.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:24] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10gmodena) a:03gmodena [09:18:28] (SystemdUnitFailed) firing: (19) jupyter-stevemunene-singleuser-conda-analytics.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:29] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) For the journal node (analytics1069) we already have some good documentation about the process here: https://wikitech.wikimedia.org/wi... [10:36:09] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10gmodena) Discussed with @Ottomata, and we opted for //option 2: We log and raise a more functional error message.// Currently our AP... [10:38:16] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10gmodena) [10:59:03] (03PS1) 10Barakat Ajadi: CentralNoticeTiming: remove Central Notice Timing in refinery [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917855 (https://phabricator.wikimedia.org/T334550) [11:33:10] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10DDeSouza) Thanks @Ottomata [11:45:41] 10Data-Engineering, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Marostegui) [11:46:38] 10Data-Engineering, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Marostegui) 05Open→03Stalled This is stalled until the replacement for db1108, is productionized {T334055} [11:46:53] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ladsgroup) While I definitely support bringing some order to this mess, I need to warn over a lot of conceptual complexities on... [11:46:58] 10Data-Engineering, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Marostegui) [11:48:31] (03CR) 10Mforns: "Queries look good to me! +1" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) (owner: 10Milimetric) [11:48:35] (03CR) 10Mforns: [C: 03+1] Adapt virtualpageview druid scripts to spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) (owner: 10Milimetric) [12:25:01] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10gmodena) [12:27:32] !log rebooting eventlog1003 for T325132 [12:27:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:30:50] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Tgr) Leaving aside bot, which isn't really a user type, I think the meaningful types are normal user, system user, temp user, im... [12:54:27] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) [12:56:37] 10Data-Engineering, 10Data-Persistence, 10IP Masking, 10Anti-Harassment (AHaT Sprint 30: Avanton Gold Cone): Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Tchanders) [12:58:57] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=19671b41-1b94-43a0-9de6-433c868243f3) set by btullis@cumin1001 for 1 day, 0:00:00 on 1 host(s)... [12:59:28] !log upgrading SAS RAID controller firmware on an-worker1088 for T336077 [12:59:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:59:33] T336077: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 [13:01:36] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) This was the battery statis information from this controller. ` btullis@an-worker1088:~$ sudo megacli -AdpBbuCmd -aALL... [13:02:11] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) This is the relevant command and output: ` btullis@an-worker1088:~$ sudo ./SAS-RAID_Firmware_700GG_LN_25.5.9.0001_A17.BIN Collecting inventory... .^C... [13:02:30] !log rebooting an-worker1088 after firmware upgrade for T336077 [13:02:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:10:23] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10daniel) I agree that "bot" is not a user type. The unfortunate reality is that "bot" a permission that *allows* a given user to... [13:17:20] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) [13:18:32] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:29] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) I've created a child ticket for #ops-eqiad to replace the battery for the RAID controller on this host. [13:41:20] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10Jclark-ctr) [13:51:45] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) 05Open→03Resolved a:03BTullis Hyper-efficent work there from @Jclark-ctr. Many thanks. Battery replaced and the RAID error has gone. Here's the... [13:51:46] mforns: re: druid doesn't this write to an intermediate JSON temp file anyway? https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/connectors/DataFrameToDruid.scala#L216 [13:51:59] (also I thought the coalesce 6 thing was for the cassandra cluster) [14:29:10] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10pfischer) @achou, search team started working on [[ https://phabricator.wikimedia.org/T325315 | ingesting redirects in their up... [14:29:20] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12): Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10Ottomata) Its a bug. https://issues.apache.org/jira/browse/FLINK-32041 [14:31:06] (03CR) 10Btullis: [C: 03+2] "recheck due to new pipelines" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:36:28] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) Removing bot from the description. I think I put it in there because we have an [[ https://github.com/wikimedia/schem... [14:36:37] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Ottomata) [15:02:50] 10Data-Engineering, 10Data-Persistence, 10Event-Platform Value Stream, 10IP Masking, 10Platform Engineering: MediaWiki user types - https://phabricator.wikimedia.org/T336176 (10Tgr) It can change when a normal user is converted into a system user (`User::newSystemUser` with the `'steal'` flag) although i... [15:11:59] (03CR) 10Xcollazo: "This change is ready for review." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917404 (https://phabricator.wikimedia.org/T335305) (owner: 10Xcollazo) [15:13:43] milimetric: yes [15:14:14] milimetric: but it does not alter the number of partitions IIRC [15:14:40] Oh! coalesce 6 was indeed for the cassandra cluster... [15:15:15] milimetric: I can give you a +2 if you want to deploy now :-) [15:19:55] (03CR) 10Mforns: [C: 03+2] Adapt virtualpageview druid scripts to spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) (owner: 10Milimetric) [15:37:49] mforns: I see. I think if we don't need coalesce for performance on the Spark side, we shouldn't add it. Because lots of little files are fine when DataframeToDruid reads, and we can tune that to coalesce before writing the temp if we want. But I think it's fine for now, we can talk more as a group about it [15:38:02] (also depends when you wanted to write the next version of DataframeToDruid) [15:51:02] (03PS2) 10Krinkle: CentralNoticeTiming: remove Central Notice Timing in refinery [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917855 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [15:51:47] milimetric: We've got a stack of three refinery patches that could use guidance / deployment. starting from ^ by Barakat ( https://gerrit.wikimedia.org/r/c/analytics/refinery/+/917855 ), and then two below that by myself and Peter. [15:52:21] Ive added you as reviewer, but if I should add someone else / follow a different process, let me know! [15:53:07] sorry I didn't see those before Krinkle, I'll review now [16:28:34] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Fix hiveserver2 related errors on bullseye hadoop clients and workers - https://phabricator.wikimedia.org/T336281 (10BTullis) [16:28:54] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Fix hiveserver2 related errors on bullseye hadoop clients and workers - https://phabricator.wikimedia.org/T336281 (10BTullis) [16:29:00] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [16:30:02] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Fix hiveserver2 related errors on bullseye hadoop clients and workers - https://phabricator.wikimedia.org/T336281 (10BTullis) [16:30:23] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 12): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I've made a subtaks for {T336281} and I'll have a look at this now. [16:37:03] (03CR) 10Hashar: "I have asked Zuul to add this change to its "postmerge" pipeline which would cause it to run the jobs as if the change just got merged:" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [16:38:02] (03PS4) 10Milimetric: Adapt virtualpageview druid scripts to spark [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) [16:38:58] (03CR) 10Milimetric: "k, coalesced following a conversation with Jo. I kind of hand-waivey guessed at the right coalesce level, but with the data size and the " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/912360 (https://phabricator.wikimedia.org/T334105) (owner: 10Milimetric) [17:13:08] 10Data-Engineering, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10mpopov) a:03Mayakp.wiki [17:18:32] (SystemdUnitFailed) firing: (18) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:28] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:28] (03CR) 10Jdlrobson: [C: 04-1] "Per Clare's suggestion would be great to pull out editattemptstep as this is not a web team schema." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [18:44:15] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for MW api requests - https://phabricator.wikimedia.org/T333575 (10Ottomata) [18:53:56] ping milimetric - wanna chat? [19:14:37] hi joal [19:14:47] is it too late? I was in a meeting [19:15:05] Heya - I have another meeting in 15min, let's talk if you wish :) [19:15:13] to the batcave! [20:23:14] 10Data-Engineering-Planning, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for MW api requests - https://phabricator.wikimedia.org/T333575 (10Ottomata) [20:23:20] 10Data-Engineering-Planning, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for MW api requests - https://phabricator.wikimedia.org/T333575 (10Ottomata) a:03Ottomata [20:50:58] (03PS1) 10Urbanecm: [Growth] Personalized praise: Add database [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917951 (https://phabricator.wikimedia.org/T325117) [20:51:18] (03PS2) 10Urbanecm: [Growth] Personalized praise: Add database [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917951 (https://phabricator.wikimedia.org/T325117) [21:33:28] 10Data-Engineering-Planning, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 12): mediawiki-event-enrichment in k8s should use mwapi-async envoy listener for MW api requests - https://phabricator.wikimedia.org/T333575 (10Ottomata) This is done. [22:08:32] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:37] 10Data-Engineering, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Mayakp.wiki) @Milimetric , thank you so much for simulating the User-Agent deprecation impact and sharing your results. Here's what Im planning to do next:... [23:26:35] (03PS1) 10Kimberly Sarabia: Merge "CentralNoticeTiming: Remove CentralNoticeTiming schema" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917966 [23:28:03] (03Abandoned) 10Kimberly Sarabia: Merge "CentralNoticeTiming: Remove CentralNoticeTiming schema" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/917966 (owner: 10Kimberly Sarabia)