[00:34:59] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: jupyterhub-conda.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:34:59] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: jupyterhub-conda.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:12:54] <wikibugs>	 (03PS8) 10Peter Fischer: cirrussearch: move to development [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315)
[08:27:29] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) I added documentation about how to roll restart/reboot OS daemons/hosts in https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub...
[08:28:25] <brouberol>	 btullis: if you have 2 minutes, https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/956916 is ready for a final review, now that we can reach the opensearch/logstash API from the cumin hosts. Thanks!
[08:34:59] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: jupyterhub-conda.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:36:20] <elukey>	 hello folks!
[08:36:25] <elukey>	 eventstreams runs on stretch 
[08:36:28] * elukey cries in a corner
[08:36:35] <joal>	 good morning elukey :)
[08:36:39] <elukey>	 I am trying to upgrade it but nodejs returns me horrors
[08:36:40] <btullis>	 Oh, crumbs.
[08:36:41] <joal>	 mwarf
[08:38:14] <btullis>	 elukey: what sort of horrors? Failing unit tests kind of thing?
[08:38:49] <elukey>	 btullis: npm horrors, I solved some removing the usage of locally installed librdkafka (as opposed to build its own)
[08:39:03] <elukey>	 trying to solve them, I'll send a code change
[08:39:08] <elukey>	 my aim was just to add some logs :D
[08:39:41] <btullis>	 Found a can of worms, eh? :-)
[08:40:03] * elukey nods with sadness
[08:40:15] <elukey>	 also we run node10 everywhere
[08:44:05] <elukey>	 npm ERR! Unsupported URL Type "npm:": npm:vue-loader@^16.1.0
[08:46:05] <moritzm>	 FYI, rebooting AQS nodes in eqiad
[08:48:00] <btullis>	 ack: Thanks moritzm 
[08:49:52] <elukey>	 "npm WARN npm npm does not support Node.js v10.24.0" is also nice
[08:55:47] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer...
[09:35:34] <stevemunene>	 Hi btullis given the recent alerts from DFS would it be safe to proceed with the upgrades at this time ? only 11 hosts remain an-worker11[37-48] cc brouberol 
[09:40:03] <btullis>	 Hi stevemunene and welcome back. Yes, I think it's safe to carry on with the upgrades.
[09:41:10] <btullis>	 stevemunene: I +1d your datahub oidc change. Do you think we should try that deployment today, or would you feel more comfortable scheduling it for next week?
[09:43:18] <wikibugs>	 10Data-Engineering, 10Discovery-Search, 10serviceops, 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10dcausse)
[09:43:57] <stevemunene>	 Ack, thanks btullis ! We are yet to notify the team on the upcoming changes to datahub so I thin it would be best to send a notification today and attempt the upgrade on Monday. 
[09:44:36] <elukey>	 folks I am going to remove the mediawiki.revision-score stream from ES
[09:44:45] <elukey>	 (not touching docker imgs etc..)
[09:45:44] <btullis>	 elukey: Ack. That's this one, right? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956775
[09:45:58] <btullis>	 stevemunene: Good stuff. 
[09:46:29] <elukey>	 correct
[09:48:05] <wikibugs>	 10Data-Engineering, 10Discovery-Search, 10serviceops, 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10dcausse)
[09:50:40] <wikibugs>	 (03PS1) 10Joal: [TEST] Add logging to test refine race condition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/957683
[09:58:06] <wikibugs>	 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) Missing metric: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/89
[10:21:06] <moritzm>	 AQS reboots in eqiad are completed
[10:27:28] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye
[10:54:45] <brouberol>	 When would be a good time to test out the rolling restart/rebook cookbook on datahub search?
[10:58:12] <btullis>	 brouberol: I'd say any time today, really. You might just want to keep an eye on https://datahub.wikimedia.org to be sure that there's no breakage to the service that depends on it and https://alerts.wikimedia.org to be on the lookout for any stray alerts. Should be very low risk though.
[11:01:35] <brouberol>	 👍 thanks, will do. I'll try a rolling restart then
[11:08:20] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye completed: - an-worker1137 (**PASS**)   - Downtimed on Icinga/Alertm...
[11:10:12] <wikibugs>	 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) I've done a bit of investigation, but don't have many useful clues yet. I manually checked all five of the journalnode logs on hosts (that's `an-worker[1078,1080,1090...
[11:13:20] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) 05In progress→03Resolved Rolling restart worked on `datahubsearch`:  ` brouberol@cumin2002:~$ sudo cookbook sre.opensearch.roll-restart-reboot --reason 'Rolling...
[11:14:01] <brouberol>	 btullis: we still have to reboot the datahubsearch hosts for a kernel update?
[11:14:53] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) 05Resolved→03Open Oops, I change the status by mistake. I still have to test the reboot action.
[11:15:44] <brouberol>	 (if so, would you happen to have a link to the phab ticket?)
[11:16:50] <btullis>	 brouberol: No, I already rebooted them for https://phabricator.wikimedia.org/T344587 (I guess you still can't see it? - I'll see if I can add you to the ACL in phab, or if it needs its own ticket for the records)
[11:17:19] <brouberol>	 I can't see it indeed. I'm missing the security ACL
[11:18:31] <brouberol>	 should you need to reboot them, I have successfully tested the sre.opensearch.roll-restart-reboot cookbook for restart, but not for reboot. I was planning to roll reboot them as well, to make sure the cookbook worked 100%. Would that work for you?
[11:19:43] <btullis>	 That works for me, feel free to reboot them with the cookbook at your leisure. :-)
[11:20:24] <brouberol>	 on it!
[11:22:46] <btullis>	 brouberol: There's a special procedure for requesting access to the security ACL in phabricator. I think it's probably best to follow that for record-keeping: https://www.mediawiki.org/wiki/Security/SOP/Access_to_Phabricator_Security_Issues
[11:23:36] <brouberol>	 oh right. I'll do this once the reboots are done. Thanks!
[11:38:54] <wikibugs>	 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) I think that the only way forward I can think of at the moment is to try scheduling some downtime for krb1001 and either:  - powering it off for a while or - stopping...
[11:39:04] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) 05Open→03Resolved Rolling reboot worked on `datahubsearch`: ` brouberol@cumin2002:~$ sudo cookbook sre.opensearch.roll-restart-reboot --reason 'Rolling reboot by...
[11:40:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) I've now added conda-analytics version 0.0.20 to our apt repository. ` btullis@apt1001:~$ w...
[11:42:28] <btullis>	 !log deploying conda-analytics version 0.0.20 to the test cluster for T337258
[11:42:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:42:31] <stashbot>	 T337258: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258
[12:08:50] <wikibugs>	 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10MoritzMuehlenhoff) >>! In T346135#9166267, @BTullis wrote: > I think that the only way forward I can think of at the moment is to try scheduling some downtime for krb1001 and...
[12:45:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) It's still not right, unfortunately. When I try to clone the environment I get the followin...
[13:10:59] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye
[13:12:37] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye
[13:21:51] <wikibugs>	 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I decided to pause work on the GitLab-CI based build of spark itself. I got the build to work with blubber/buildkit, but I realised that I was mis...
[13:25:05] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineer...
[13:45:53] * brouberol is taking a break
[13:57:39] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye executed with errors: - an-worker1138 (**FAIL**)   - Downtimed on Ic...
[14:13:33] <stevemunene>	 !log powercycle an-worker1138, investigating failures related to reimage T332570
[14:13:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:13:36] <stashbot>	 T332570: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570
[14:29:30] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10bking)
[14:34:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) an-worker1138 is currently facing an error {F37721279} Did a powercycle in order to access the terminal, however the host does not accept the root pw. First thought was to check the partitions from...
[14:36:34] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10bking) We successfully deployed this yesterday; moving to "Done" on the workbo...
[14:45:50] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking)
[14:46:05] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) Thanks Moritz...closing on our board.
[14:55:30] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye
[15:02:03] <brouberol>	 Does anyone know what procedure we follow to reassign partitions when we provision new kafka brokers as replacement for other, older brokers? Ultimately, each partition currently assigned to the "old" brokers will need to be reassigned to the new one, but do we have tooling around this?
[15:03:22] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10dcausse) a:03dcausse
[15:04:23] <wikibugs>	 10Quarry: Missing columns for table `externallinks` in Quarry - https://phabricator.wikimedia.org/T346347 (10Bdijkstra)
[15:06:08] <claime>	 brouberol: elukey will know :)
[15:06:30] * claime shamelessly throws poor e.lukey under the kafka bus
[15:07:04] <brouberol>	 thanks! 
[15:10:08] <dr0ptp4kt>	 irccloud / libera / sasl stuff went boom, so i'm sorry if this is in the backscroll for others but not me. are others able to ssh stat1006.eqiad.wmnet via bast1003.wikimedia.org right now? i'm getting the following: channel 0: open failed: connect failed: Name or service not known
[15:10:08] <dr0ptp4kt>	 stdio forwarding failed
[15:10:08] <dr0ptp4kt>	 kex_exchange_identification: Connection closed by remote host
[15:10:08] <dr0ptp4kt>	 Connection closed by UNKNOWN port 65535
[15:12:18] <dr0ptp4kt>	 i haven't changed anything in ~/.ssh/config (i have the customary ProxyJump bast rules in there). and i could try rebooting my machine in case there's other strangeness. but i can definitely ssh directly to bast1003.wikimedia.org
[15:12:58] <brouberol>	 stevemunene: are there any steps additional to a) assigning the `role(kafka::jumbo::broker)` role to the new brokers and b) re-assigning the partitions onto the new brokers? Something involving, say, kerberos keytabs, or anything else? 
[15:15:07] <wikibugs>	 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye
[15:16:21] <wikibugs>	 10Data-Platform-SRE: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) a:03brouberol
[15:16:38] <dr0ptp4kt>	 btullis, not sure if i have this fully correct, can you check? clouddb10121.eqiad.wmnet necessarily exposes itself to the production analytics infrastructure via https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/network/data/data.yaml$113 plus https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/base/templates/firewall/defs.erb$36 ; 
[15:16:38] <dr0ptp4kt>	 https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/network/manifests/constants.pp$109 upon which defs.erb depends requires slice_network_constants() at https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/network/lib/puppet/parser/functions/slice_network_constants.rb$81 , and so a regex match on _analytics_ bearing lines is what yields the results from 
[15:16:38] <dr0ptp4kt>	 network/data/data.yaml that get concatenated into the $ANALYTICS_NETWORKS value in defs.erb which is ultimately used by https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/wmcs/db/wikireplicas/dedicated/analytics.pp . but, when i've tried a basic telnet clouddb1021.eqiad.wmnet from a stat#### box that seems to be in the permitted source nets, it just stalls out (i can still ping 
[15:16:38] <dr0ptp4kt>	 the hosts). i more generally need to get access to an-launcher1002 and so will request analytics-admin , but meantime, i was wondering if you had a quick (probably obvious, i'm sorry) read on why no tcp connectivity? unless there's an egress ferm blocking me, not sure what's up...maybe i need to check the stat boxen ferms, but figured i'd ask here first
[15:17:23] <dr0ptp4kt>	 i mean to say telnet clouddb1021.eqiad.wmnet 3311 btw
[15:17:57] <dr0ptp4kt>	 and as up above i can't currently jump from bast1003.wikimedia.org so i'd do a bit more troubleshooting, but i figured meantime i could also ask as it's something i've been scratching my head about
[15:18:29] <btullis>	 dr0ptp4kt: In a meeting right now. Will help asap afterwards.
[15:18:36] * dr0ptp4kt thanks btullis !
[15:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:19:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:22] <stevemunene>	 brouberol: I might not have the right info for this,  think elukey might best answer this
[15:21:45] <brouberol>	 gotcha, thanks
[15:22:36] <joal>	 elukey is not under the kafka-bus, it's the bus-driver ;)
[15:23:18] * brouberol honks 
[15:23:49] * joal just realize how rude he's in english  - HE is the bus-driver (and brouberol will soon replace him I gues :D
[15:26:03] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Wikimedia-production-error: Error: Call to a member function exists() on null (via EventBus PageChangeEventSerializer) - https://phabricator.wikimedia.org/T346355 (10Krinkle)
[15:28:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:46] <btullis>	 dr0ptp4kt: I can get to port 3311 on clouddb1021 from stat1006.
[15:29:53] <btullis>	 https://www.irccloud.com/pastebin/S71BA1Mi/
[15:30:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:41] <dr0ptp4kt>	 btullis now I'm really baffled. maybe it has been a pobcak...rebooting laptop, will try again assuming I can get back into the server
[15:33:36] <dr0ptp4kt>	 btullis: does my read of the ferm wireup look about okay, in any case?
[15:33:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:35:31] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/dat...
[15:36:28] <wikibugs>	 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader2002.codfw.wmnet with OS bullseye completed: - search-loader2002 (**PASS**)   - Removed fr...
[15:37:29] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10gmodena) a:03gmodena
[15:37:59] <btullis>	 dr0ptp4kt: Yes, I think you're right about the way that definition of `$ANALYTICS_NETWORKS` gets populated. 
[15:38:31] <dr0ptp4kt>	 btullis: now it's working! both ssh'ing in (that was a pobcak) and connecting to port 3311 (that was not a pobcak, it's just magically working now). thanks for the confirmation on the $ANALYTICS_NETWORKS, it helps to have a puppet pro friend to confirm things
[15:38:43] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10gmodena) a:03gmodena
[15:39:30] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10gmodena)
[15:39:31] <btullis>	 brouberol: There is also a `kafka_clusters` hieradata structure here that will need to be updated: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L756-L789
[15:40:02] <btullis>	 dr0ptp4kt: +1 to magic solutions :-)
[15:40:07] <brouberol>	 yep, I had that on my radar as well, infered from what we did for hadoop
[15:40:10] <brouberol>	 👍
[15:48:06] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host search-loader1002.eqiad.wmnet with OS bullseye completed: - search-loader1002 (**...
[15:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:48] <btullis>	 brouberol: There are also some other places where lists of brokers have been configured by hand. e.g. in our deployment-charts: https://gerrit.wikimedia.org/g/operations/deployment-charts/+/dd8b66925c6b1224b135379d9c097486129796f5/helmfile.d/services/eventgate-analytics/values.yaml
[15:50:20] <brouberol>	 ah, good to know, thanks!
[15:51:14] <brouberol>	 so I'm guessing evetgate-analytics will need to be redeployed w/ the new config as well, before we reassign the partitions
[15:51:23] <btullis>	 There has been some work to get rid of this duplication (T253058_ but it's not 100% complete.
[15:51:35] <btullis>	 T253058
[15:51:35] <stashbot>	 T253058: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058
[15:52:08] <elukey>	 brouberol: o/ We have some docs about rebalancing partitions, see https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_topic_partitions_to_new_brokers, so far I have been using topicmappr (https://github.com/DataDog/kafka-kit/wiki/Topicmappr-Usage-Guide) but I never got to debianize it etc.. there are various strategies to move partitions around, you can do it later on when the 
[15:52:14] <elukey>	 new brokers are in place
[15:52:20] <btullis>	 Codesearch is a handy tool by the way: https://codesearch.wmcloud.org/search/?q=kafka-jumbo&files=&excludeFiles=&repos=
[15:52:36] <elukey>	 if you want to experience some debianization we could start with packaging topic mappr :)
[15:53:05] <elukey>	 so far all my work has been ad hoc and not generalized, didn't find the time to do it 
[15:53:05] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye completed: - an-worker1139 (**PASS**)   - Downtimed on Icinga/Alertm...
[15:53:11] <btullis>	 I think the brouberol might have been one of the authors of topicmappr, if I'm remembering correctly (or the main author?)
[15:53:24] <elukey>	 wow really?
[15:53:41] <btullis>	 Don't quote me on that :-)
[15:53:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:47] <elukey>	 then I have a lot of questions :)
[15:54:04] <brouberol>	 well, yes and no
[15:54:53] <brouberol>	 jamie alquiza did _most_ of the writing, but I spent 4 years working with jamie, so this is our design, yep
[15:55:01] <elukey>	 niceee
[15:55:35] <brouberol>	 haha, feel free to book time on my calendar, whether tomorrow or next week, I'm more than happy to see if/how we could use it here!
[15:55:36] <elukey>	 well we need to establish a formal process, I was thinking to debianize topic mappr and https://github.com/tarvip/kafkakit-prometheus-metricsfetcher
[15:56:06] <brouberol>	 I had the exact same thought as well. The distribution wasn't super formalized in Github. We used to produce our own internal deb file as well
[15:56:11] <elukey>	 so far I tried various attempts that worked, but I am always confused about when to use rebuild/rebalance/etc,,
[15:56:37] <elukey>	 yeah then we distribute it on all kafka nodes etc..
[15:57:29] <elukey>	 so far I've used rebuild once to move partitions to new brokers, and rebalance to move some high traffic topic on some kafka brokers with few partitions
[15:57:39] <elukey>	 both times on kafka main, but didn't reach a perfect result
[15:57:43] <elukey>	 jumbo is different
[15:58:41] <elukey>	 for example, I am not 100% sure if having the same number of partitions on all brokers is something to strive for or not, compared to having a good balance of space used on disk across brokers etc..
[15:58:44] <btullis>	 There is some brand-new tooling for building debs from GitLab-CI. I'm looking forward to taking if for a spin: https://gitlab.wikimedia.org/repos/sre/wmf-debci
[15:59:01] <brouberol>	 ah, yes. There are historical reasons for rebuild being in there, but most of the time, we want to use rebalance, as it minimizes partition moves. Rebuild does a _full_ partition reassignment cluster wide, meaning that you needs each broker to be at <50% disk usage
[15:59:31] <brouberol>	 do you want to chat/meet about it real quick?
[15:59:34] <elukey>	 more invasive yes
[15:59:54] <elukey>	 I am a little short on time today, tomorrow or next week we can definitely meet if you want
[15:59:59] <brouberol>	 for sure
[16:00:22] <elukey>	 really happy about your experience, I am pushing for more work on kafka (we are still running 1.1 sigh)
[16:01:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:21] <btullis>	 If anyone is available for a quick review of this small update to conda-analytics, I'd be grateful: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/33
[16:03:19] <brouberol>	 elukey: happy to be of help!
[16:03:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:09:07] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis) @Stevemunene I have uploaded a new patchset to https://gerrit.wikimedia.org/r/c/operations/puppet/+/949001 to fix the CI issue. It was a [[https://gerrit.wikimedia.org/r/c/...
[17:15:23] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10bking)
[17:16:06] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10bking)
[17:16:10] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Consider migrating search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10bking)
[17:16:14] <wikibugs>	 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking)
[18:27:07] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10bking)
[18:28:53] <xcollazo>	 !log Deployed latest DAGs to analytics Airflow instance T340861
[18:28:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:28:56] <stashbot>	 T340861: Implement a backfill job for the dumps hourly table - https://phabricator.wikimedia.org/T340861
[19:00:11] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/eventutiliti...
[19:02:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena)
[20:18:21] <wikibugs>	 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) OK, I will schedule some time to do that next week.
[20:34:23] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10BTullis) >>! In T345726#9158579, @RLazarus wrote: > Hi @joanna_borun -- does this need Infrastructure F...
[20:38:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer...
[21:11:25] <wikibugs>	 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10Iflorez)
[21:28:31] <wikibugs>	 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10BTullis) Hi @iflorez,  Could you try the following please?  * ssh into stat1006 and try the command: `conda-analytics-list`  I have tried this command as your user and it seems to work for me: ` iflorez@stat1006:~...
[21:31:30] <btullis>	 !log deploying conda-analytics version 0.0.21 to hadoop-test for T337258
[21:31:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:31:33] <stashbot>	 T337258: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258
[21:39:13] <jinxer-wm>	 (DiskSpace) firing: Disk space an-test-ui1001:9100:/ 0.9904% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:40:00] <btullis>	 !log executed apt-get clean on hadoop-test
[21:40:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:44:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space an-test-ui1001:9100:/ 0.9903% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-ui1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[21:51:54] <wikibugs>	 10Data-Engineering: Unable to spawn new environments - https://phabricator.wikimedia.org/T346397 (10BTullis) >>! In T346397#9168439, @BTullis wrote: > I have a strong hunch what the problem is here and I think that it is the sqlalchemy package. > {F37724171,width=90%} >  > I just saw a failure to start jupyterhu...
[21:55:33] <wikibugs>	 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10RKemper)
[21:55:36] <wikibugs>	 10Data-Platform-SRE, 10Release-Engineering-Team, 10Scap: "scap deploy"'s config-deploy should check for broken symlinks - https://phabricator.wikimedia.org/T342162 (10RKemper)
[21:55:53] <wikibugs>	 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10RKemper) Removed subtask because I think the scap ticket is not directly related to this one.
[22:16:29] <wikibugs>	 10Data-Engineering, 10Beta-Cluster-Infrastructure: Beta logstash filled with kafka errors - https://phabricator.wikimedia.org/T346402 (10colewhite) It appears there is some problem with the kafka-jumbo nodes in deployment prep.   Logstash appears functional.  Adding Data Engineering for the kafka-jumbo nodes.
[22:17:31] <wikibugs>	 10Data-Engineering, 10Beta-Cluster-Infrastructure: Many kafka errors in beta/deployment-prep - https://phabricator.wikimedia.org/T346402 (10colewhite)
[22:19:28] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) Hmm. Not the best result. Firstly, it complains of a missing library when trying to list en...