[05:27:18] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) >>! In T317861#8959123, @elukey wrote: > @Stevemunene I still see the following from the hdfs topology: > > ` > Rack: /eqiad/default/rack > 10.... [05:31:27] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Stevemunene) [05:32:09] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Stevemunene) [05:32:46] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Stevemunene) [05:33:00] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Stevemunene) [05:33:55] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Stevemunene) [05:34:15] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Stevemunene) [06:27:00] !log Excluding analytics106[4-6] from HDFS and YARN as we Decommission them [06:27:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:28:26] !log run puppet on hadoop-masters [06:28:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:34:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:07] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:36:45] (SystemdUnitFailed) firing: (4) refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:15] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Replace db1108 with db1208 - https://phabricator.wikimedia.org/T334055 (10Marostegui) Can we please give this some priority? This is the only Buster host we still have. [07:06:45] (SystemdUnitFailed) firing: (6) refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:45] (SystemdUnitFailed) firing: (4) refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:56] Hi elukey just as you had mentioned, despite being decommissioned and removed from the topology, analytics106[1-3] still seem to be active in the cluster cc btullis I am holding off on the hosts decommissioning for now. [08:18:26] https://usercontent.irccloud-cdn.com/file/S46QFYCL/61-63 [08:19:45] https://usercontent.irccloud-cdn.com/file/bKWdiGr5/image.png [08:20:11] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10MoritzMuehlenhoff) [08:21:37] stevemunene: it is weird, the net topology is updated and the namenodes were restarted [08:27:23] stevemunene: how can we proceed? Do you have any suggestion? [08:29:32] maybe I know [08:30:05] stevemunene: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/906017/2/hieradata/role/common/analytics_cluster/hadoop/master.yaml ? [08:30:13] you added both the fqdn and the shortname [08:30:28] I suspect that with analytics106[1-3] we added only the fqdn [08:30:38] so the decom process didn't kick off for hdfs [08:31:41] so one way to proceed could be to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 temporarily [08:31:52] Yes for most of the hosts we have only used the fqdn including analytics1058-1060 [08:32:17] stevemunene: yeah but IIRC hdfs takes only the shortname [08:32:29] this is why in the docs we added both, I recall that I had a similar issue [08:32:53] so, we can temporarily add the correct config, and then check if it works [08:33:11] from https://saturncloud.io/blog/how-to-correctly-remove-nodes-in-hadoop/ it seems that we could avoid the namenode restarts [08:33:20] with `hdfs dfsadmin -refreshNodes` [08:34:00] Ahaa, thanks. I am currently in a training session I shall do this once done and fully engaged [08:35:01] sure no problem :) [08:37:41] Morning all. I'm going to be upgrading presto in production to version 0.281 shortly. It should be fairly painless, but there's a chance of some instability whilst the services restart. [08:39:16] Ack btullis [08:43:08] stevemunene: elukey: I think it's because of this change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 - this re-enables analytics106[1-3] because they hadn't actually been fully decommissioned. [08:44:00] btullis: o/ but in theory they should have been, my theory is that we didn't add the shortnames to the hosts.exclude [08:44:13] so when Steve restarted the namenodes, the datanodes didn't get excluded [08:44:31] the change above has not been applied yet [08:46:07] I like the `hdfs dfsadmin -refreshNodes` - that will save time. We could do the same as we did with YARN and automate this like this: https://github.com/wikimedia/operations-puppet/blob/production/modules/bigtop/manifests/hadoop/resourcemanager.pp#L15 [08:48:09] ah interesting, yes! [08:48:19] it would save a lot of restarts [09:01:18] OK, I see your theory about the shortname vs fqdn. I'm happy to accept that it might be right, but I'm not sure yet. When we did analytics1058, which is the one that Steve and I did whilst pairing on it, we definitely used only the FQDN: https://gerrit.wikimedia.org/r/c/operations/puppet/+/927667/ [09:02:25] We verified that it went into the decommissioning state with a screenshot here: https://phabricator.wikimedia.org/T317861#8908671 [09:03:56] Although we noted that it's interesting, even when it has completed the decommissioning phase and is 'decommissioned' it doesn't remove data from the host, so the green bar doesn't get any smaller. https://phabricator.wikimedia.org/T317861#8908799 [09:04:51] yeah but that makes sense, no point for the node in doing more things [09:05:28] the fqdn vs short-name theory is the only one that I get, IIRC we got bitten in the past by something similar [09:05:44] maybe fqdn and refresh nodes works [09:06:21] btullis: an easy test could be to disable puppet on an-master1001, change the hosts.exlude file and refresh the nodes [09:06:29] if they go into decom we are good [09:06:49] (it seems fine for this use case to avoid other code reviews but only imho) [09:11:04] Yeah, I'm going to leave this for stevemunene I think. I've got my hands full with presto at the moment. Still, it is interesting. It's like the change to `hosts.exclude` got picked up automatically, without a restart of the namenodes. [09:16:50] Yes, I shall have a look. thanks btullis and elukey [09:38:48] !log deploying presto version 0.281 to production [09:38:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:40:57] !log Rerun failed druid-loading airflow jobs [09:40:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:45:04] 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff With the custom logrotate config the logrotation now works as expected: ` krb1001:/var/log/kerberos# ls -lha to... [09:46:35] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) [09:46:46] (SystemdUnitFailed) firing: (19) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:46] (SystemdUnitFailed) firing: (19) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:16] (03Abandoned) 10Aqu: Cleanup dependencies of core/pom.xml file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/780886 (https://phabricator.wikimedia.org/T306193) (owner: 10Aqu) [10:05:03] Hello joal, do you have time for a last check plz ? [10:05:03] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/929723/ [10:05:03] I've already run the 3 request in the contexte of filling an hive table to make sure it works (I was not sending to Cassandra though). [10:05:43] Hi aqu - I'minmeeting now, will review after :) [10:07:20] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10BTullis) This all looks fine now. The new version has been rolled out to all nodes. ` presto:wmf> SELECT node_id,node_version FROM sys... [10:07:38] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10BTullis) [10:08:31] The presto 0.281 upgrade is all finished. I've tested a query against webrequest, but I'd be grateful if you could be on the lookout for any weirdness please. [10:09:31] ack btullis -willtry one or twodashboards in supsert [10:17:40] 10Data-Platform-SRE, 10Data Pipelines: Enable the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) @JAllemandou - @Milimetric I've rolled out this experimental feature to superset-next.wikimedia.org - but in order to do so I've temporarily disabled puppet on... [10:19:56] 10Data-Platform-SRE, 10Data Pipelines: Enable the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10JAllemandou) Thanks for doing his @BTullis - While super useful when it works, the feature is not stable enough to roller-out to production. You can disable it on both cl... [10:20:15] btullis: I ran some dashboards, nothing broke on me - feels good :) [10:20:58] joal: Great, thanks for doing that. [10:22:49] btullis: I queried with presto in general, CLI and sqlab. It looks good. But yeah, +1 on joseph's take of PRESTO_EXPAND_DATA. Not sure if it's superset-next or that flag, but sqllab is buggy on superset-next [10:24:54] milimetric: OK, cool. Thanks for letting me know. I'll revert the change and then maybe you could check to see if the bugginess goes away with the feature flag. OK? [10:25:11] k [10:37:38] (03PS1) 10Btullis: Update the datahub-upgrade image to include the entity-registry [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/932871 (https://phabricator.wikimedia.org/T329514) [10:50:22] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) [10:51:45] PROBLEM - puppet last run on an-tool1010 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:51:59] 10Data-Platform-SRE, 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) 05Open→03Resolved [10:57:13] RECOVERY - puppet last run on an-tool1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:27] (03CR) 10CI reject: [V: 04-1] Update the datahub-upgrade image to include the entity-registry [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/932871 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:04:09] (03CR) 10Btullis: "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/932871 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:26:52] (03CR) 10Btullis: [C: 03+2] Update the datahub-upgrade image to include the entity-registry [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/932871 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:26:56] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) This is a bit confusing. The [[https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/Dockerfile#L33|Dockerfile]] for the datahub-upgrade container shows the file b... [11:29:34] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) [11:33:08] o/ elukey btullis About to try the hosts.exclude fqdn vs short-name theory on an-master1001 [11:34:13] stevemunene: ack, thanks. Make sure you keep a record of your findings. [11:34:59] sure :) [11:35:36] !log disable puppet on an-master1001.eqiad.wmnet [11:35:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:39:56] !log running hdfs dfsadmin -refreshNodes to pick up analytics106[1-3] from hosts.exclude [11:39:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:44:06] (03Merged) 10jenkins-bot: Update the datahub-upgrade image to include the entity-registry [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/932871 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:56:06] After adding the hosts fqdn and shortname and running `hdfs dfsadmin -refreshNodes` the nodes are visible as decommissioning on the HDFS_NameNode_Status_Interface https://usercontent.irccloud-cdn.com/file/nsfOJ48x/decommissioning [11:57:52] The HDFS under replicated blocks graph also shows a spike after this https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41 [12:00:26] Does it do the same if you don't add the shortname? [12:11:33] (03CR) 10Aqu: [C: 03+2] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931959 (https://phabricator.wikimedia.org/T329310) (owner: 10Mforns) [12:13:11] (03PS9) 10Nick Ifeajika: fix the metric query. Change the final write query to conform to the structure expected by the cassandra table. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [12:14:15] btullis: ideally it should, considering that is what we had initially and the decommission went through. [12:29:44] I also has the thought that it could've been due to this change https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 as Ben had mentioned, since it was merged before the hosts were physically decommissioned. Which raises the question of there they were referenced in-order to be added back to the cluster since they are not part of the topology. I am going to create a patch with the fqdn and short name of the [12:29:44] hosts currently decommissioning. Then observe whether the same happens. [12:31:18] Is it advisable to stop the current decommissioning nodes by re enabling puppet? ie hosts.exclude will be written over once i re enable [12:40:05] stevemunene: https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 was merged this morning right? IIUC the hosts.exclude file containing analytics106[1-3] was already present on disk when the namenodes were restarted the last time [12:40:46] Yes and yes [12:41:02] so their decom should have kicked off [12:41:08] when the namenodes were restarted [12:41:10] in theory :) [12:41:22] anyway, I think that Ben's suggestion about adding refresh to puppet for hdfs is great [12:41:30] so we will not need to restart the namenodes [12:41:35] and puppet will take care of everything [12:42:02] I didn't see a sal entry for the namenodes being restarted this morning. [12:42:26] Namenodes have not been restarted today [12:42:51] If it were done by the cookbook, I would expect it to log to operations automatically. [12:43:02] the decomm starts immediately you merge and run puppet [12:44:02] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from yarn and hdfs. Then moving on to the next step... [12:45:40] I don't believe that's true. They are excluded from YARN as soon as puppet is run, but they are not excluded from HDFS until the namenodes are restarted (or the new dfsadmin command is executed, which we didn't know about until today). [12:50:30] Sorry, let me re-read everything above. Maybe I'm totally wrong [12:55:32] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) The ideal thing to do... [12:57:47] (03CR) 10Joal: [C: 03+1] "LGTM!Thanks Antoine :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) (owner: 10Aqu) [12:58:32] I may have misread or misinterpreted some information/charts so here is some more background on my statement. Immediately after merging the CR to remove analytics106[4-6], I noticed the change on the HDFS Under Replicated Blocks here https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41&from=1687759200000&to=now . The status on yarn and on the HDFS NameNode Status Interface also reflected the 3 [12:58:32] being in decommissioning state. This is without a restart on any infra. [12:59:23] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.12; 2023-06-06): Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10JArguello-WMF) 05Open→03Resolved [12:59:25] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10JArguello-WMF) 05Open→03Resolved [12:59:27] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix eventutillites_python stream_manager error_sink configuration - https://phabricator.wikimedia.org/T335591 (10JArguello-WMF) 05Open→03Resolved [12:59:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10JArguello-WMF) [12:59:32] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) 05Open→03Resolved [13:02:46] stevemunene: Ah, OK. That is interesting. What about it we do another test then. I propose that we add `analytics1067.eqiad.wmnet` to `/etc/hadoop/conf/hosts.exclude` on an-master1001 and do nothing else apart from watch the logs. [13:03:33] See if the namenode picks up changes to this file without requiring the `hdfs dfsadmin -refreshNodes` command. [13:05:36] sure, let's do that. are you available to pair or should I proceed? [13:05:52] Let's do it together. [13:06:01] To the batcave! [13:06:48] Actually, Okta is giving me grief today. Let's huddle. [13:06:57] Even better [13:22:32] stevemunene: the hdfs decom doesn't start automatically, you need to restart the namenodes [13:22:57] the only thing that starts automatically is the yarn decom, because of the exec that Ben pointed out earlier [13:23:26] the namenodes have this: [13:23:27] Active: active (running) since Thu 2023-06-22 14:10:33 UTC; 3 days ago [13:23:42] so I assumed they were restarted on Friday [13:24:07] https://sal.toolforge.org/log/BSJq44gBxE1_1c7sJ7eO [13:24:12] yes exactly [13:25:07] Actually elukey, we've just found out why a restart wasn't needed. Nicolas already added an automation for `hdfs dfsadmin -refreshNodes`https://gerrit.wikimedia.org/r/c/operations/puppet/+/893999/12/modules/bigtop/manifests/hadoop/namenode.pp [13:25:53] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) >>! In T338233#8956488, @JMeybohm wrote: >>>! In T338233#8927836, @gmodena wrote: >>> Let's verify this with Search and SRE... [13:26:04] I +1d it, but I had clearly forgotten that it existed. I thought that it was only YARN that had been done. [13:26:18] mmmmm [13:26:36] did it work when the change was merged? We should find the entry in puppet's log then [13:27:01] it is a little weird that we see `-fs hdfs://${::fqdn}:8020` though [13:27:04] it shouldn't be needed [13:27:55] True. It shouldn't be needed. Here's what happened when I re-enabled puppet after our test today, removing the manual additions to `hosts.exlude` [13:27:59] https://www.irccloud.com/pastebin/uUkTmpTG/ [13:29:12] Here is this morning's puppet run on an-master1001. https://puppetboard.wikimedia.org/report/an-master1001.eqiad.wmnet/380f045081ae46083dde6bed2790420e24962685 [13:29:29] https://usercontent.irccloud-cdn.com/file/zxQDdcRw/image.png [13:30:24] Anyway, at least the mystery is solved. It would have been nice if Nicolas could have updated the docs as well, but never mind :-) [13:32:10] btullis: I am not 100% clear what the issue was then, the decom for 61->63 should have started after https://gerrit.wikimedia.org/r/c/operations/puppet/+/930580 no? [13:32:29] is it the short name vs fqdn or something else? [13:33:11] elukey: it did and was done [13:33:56] stevemunene: and why 1061->63 were re-added in the default rack? [13:34:28] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41&from=1687132800000&to=1687391999000 [13:34:33] elukey: The issue was that Steve re-added 61-63 back into the cluster this morning. It was due to a chain of patches in gerrit. Nothing to do with the shortname/fqdn. Yes, those hosts had been removed from the topology, so when they were re-added they went into the default rack. [13:36:12] So this https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 not only removed 64-66 but also re added 61-63 [13:36:14] btullis: ok I think now I get it, the datanodes were still up on 61->63 so removing them from the exlude file caused them to be re-added [13:36:31] Also, we're overloading the word 'decom' here. There is the 'decommissioning' state in HDFS, the 'decommissioned' state. The 'decommission' cookbook etc. [13:36:31] this is something I didn't expect [13:37:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10JMeybohm) >>! In T338233#8963918, @gmodena wrote: >>>! In T338233#8956488, @JMeybohm wrote: >>>>! In T338233#8927836, @gmodena wrote... [13:38:03] That's right. Steve chose to leave the hosts running after they had entered the 'decommissioned' state in HDFS. At this point he went to to 'decom' the next batch of hosts, but because the cookbook hadn't been run against them, they just rejoined the cluster without any topology information present. [13:38:45] * elukey nods [13:38:50] okok this was the missing bit [13:38:53] thanks for the explanation [13:39:15] let's update the docs asap then :) [13:40:24] All good, it took a a bit of pairing to find out what it was. If I'd paid more attention three months ago, perhaps it wouldn't have happened :-} [13:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:52:01] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:23] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10xcollazo) Confirmed iceberg production table `referrer_daily` is working as expected on prod Presto instance: ` presto:xcollazo_iceberg> select sum(num_ref... [14:00:59] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) The `datahub-upgrade` job is almost working now, but not quite. It looks like there is still an issue with which host it's using to try to configures the schema registry. Plus, we still... [14:06:01] !log move varnishkafka instances in esams to pki - T337825 [14:06:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:04] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [14:37:25] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > I'd assume this is not specific to mw-page-content-change-enrich bug rather generic to all flink apps deployed using the... [14:37:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:39:58] varnishkafka completely on pki! [14:44:01] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:14] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All varnishkafkas on PKI! Remaining steps: * clean up the old certificate from puppet private and puppet CA. [14:47:06] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mediawiki-event-enrichment: changes to test image seem to be ignored in CI - https://phabricator.wikimedia.org/T340195 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reques... [14:51:41] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) MW enrichment runs active/active single compute, and there are no downstream applications to 'depool'. If mw enrichment... [14:56:49] elukey: awesome. Thanks so much. [14:58:39] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy) [15:03:05] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) @Ottomata ack. Just wanted to validate and have it documented. I added a comment to [[ https://wikitech.wikimedia.org/wiki/... [15:17:53] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10Gehel) [15:23:45] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10Ottomata) Thank you! > ( Side... [15:32:53] (03CR) 10Joal: "Two little things :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [15:51:35] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create Turnilo/Superset dashboards for WDQS - https://phabricator.wikimedia.org/T338159 (10Gehel) [15:51:49] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create Turnilo/Superset dashboards for WDQS - https://phabricator.wikimedia.org/T338159 (10bking) Per today's triage/sprint planning meeting, we need to review this one...I believe we have a saved query that does most of what we do. Let's say we're about... [15:57:36] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10JAllemandou) > The ideal thing t... [15:59:41] 10Analytics, 10Article-Recommendation, 10SRE: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10MatthewVernon) [16:00:12] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Recommendation-API, 10SRE, 10SRE-swift-storage: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done (though we might want to think about refactori... [16:46:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:43] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Enable wmfdata-py to access MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340467 (10mpopov) [16:59:07] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Let user specify cnf to use when connecting to MariaDB - https://phabricator.wikimedia.org/T340469 (10mpopov) [17:09:33] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Retrieve host & port info when connecting to MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340472 (10mpopov) [17:10:07] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Enable wmfdata-py to access MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340467 (10mpopov) [17:18:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Ottomata) > if EventStreamConfig is not enabled ($wgEventStreams is undefined) I think you must have the EventStreamConfig extensio... [17:35:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Ottomata) > Perhaps...should EventStreamConfig extension.json set the default value of config.EventStreams to null? Hm, no I don't t... [17:39:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Vagrant: EventBus should not blackhole undeclared streams - https://phabricator.wikimedia.org/T329480 (10Tgr) Thanks, that's indeed the case. I probably just didn't realize that EventStreamConfig is a separate extension (it was installed... [17:48:30] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) [17:48:41] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) a:03Ottomata [17:49:04] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) [18:08:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: All eventgate clusters should be able to use remote schema repos - https://phabricator.wikimedia.org/T340166 (10Ottomata) [18:49:53] 10Data-Engineering, 10Data-Platform-SRE: Superset permissions for nshahquinn-wmf - https://phabricator.wikimedia.org/T339385 (10nshahquinn-wmf) 05Open→03Resolved a:05Stevemunene→03nshahquinn-wmf Since there is no API for permissions, I just did it manually. It only took about 20 minutes—not bad at all. [18:50:06] 10Data-Engineering, 10Data-Platform-SRE: Superset permissions for nshahquinn-wmf - https://phabricator.wikimedia.org/T339385 (10nshahquinn-wmf) [18:51:00] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10RKemper) [18:51:16] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10RKemper) [18:52:54] 10Data-Platform-SRE: Reboot buster query service hosts (wdqs/wcqs) to apply java8 sec upgrades - https://phabricator.wikimedia.org/T340482 (10RKemper) [19:06:56] 10Data-Engineering, 10Event-Platform Value Stream: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) It looks like doing this for any timestamps that are provided by MediaWiki to EventBus will be more difficult, as they are given as MW for... [19:25:04] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) [19:25:08] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) a:03Ottomata [19:25:49] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) [19:26:55] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) See also: {T329594} [20:12:13] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Use ECS logging fields when adding extra info to mediawiki-event-enrichment - https://phabricator.wikimedia.org/T337399 (10Ottomata) a:05tchin→03gmodena @gmodena did this in https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-... [20:16:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Restart Strategy for Enrichment Service - https://phabricator.wikimedia.org/T325359 (10Ottomata) @gmodena I think we can close this? [20:18:32] 10Data-Engineering, 10Event-Platform Value Stream: Drop GuidedTour* tables - https://phabricator.wikimedia.org/T317460 (10Ottomata) [20:19:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform][NEEDS GROOMING] We should standardize Flink app config for yarn (development) deployments - https://phabricator.wikimedia.org/T311070 (10Ottomata) 05Open→03Declined Not doing for Yarn. Handled by helm charts in prod. [20:22:36] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10WMF-JobQueue, and 3 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ottomata) [20:22:44] 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (MW Expedition): Decouple EventBus and EventFactory - https://phabricator.wikimedia.org/T292121 (10Ottomata) 05Open→03Declined Declining this task, as EventFactory has been deprecated. [20:26:58] (03PS10) 10Nick Ifeajika: fix the metric query. Remove duplicates. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [20:39:57] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform Value Stream: Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Ottomata) [20:42:18] 10Data-Engineering, 10Event-Platform Value Stream: EventStreamCatalog should not remove user specified options in CREATE TABLE statements - https://phabricator.wikimedia.org/T331542 (10Ottomata) [20:42:52] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Event Catalog: Standardize Options Handling - https://phabricator.wikimedia.org/T333795 (10Ottomata) [20:43:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [20:51:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:36] 10Data-Platform-SRE: Restart buster query service hosts (wdqs/wcqs) to apply java8 sec upgrades - https://phabricator.wikimedia.org/T340482 (10RKemper) [21:48:12] (03PS11) 10Nick Ifeajika: fix the metric query. Strip TLDs from domain projects. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [22:07:58] 10Data-Engineering, 10Data-Persistence, 10Research: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10Milimetric) [22:16:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:23] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10Jclark-ctr) @BTullis would like to take care of tomorrow when would be a good time with you to do this? [23:21:53] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36858c2c-bae0-4a63-9ac9-19916c27613e) set by btullis@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their se... [23:22:28] !log shutting down an-worker1092 in preparation for RAID controller battery replacement [23:22:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:23:52] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) Hi @Jclark-ctr - Many thanks. I've shut down the machine ready for you, so you can replace it whenever is convenient. Feel free to boot the host again when finishe... [23:53:17] 10Data-Engineering, 10Data-Persistence, 10Research: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10Eevans) p:05Triage→03Medium a:03Eevans [23:58:48] 10Data-Engineering, 10Data-Persistence, 10Research: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10Eevans) Ok, this has been created using: `lang=sql CREATE KEYSPACE "local_group_default_T_knowledge_gap_by_category" WITH replication = {'class': 'NetworkTopologySt...