[01:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:42] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:48:36] 10Data-Engineering, 10cloud-services-team: dbproxy1018 - https://phabricator.wikimedia.org/T346012 (10Marostegui) [04:48:52] 10Data-Engineering, 10cloud-services-team: dbproxy1018 alert for two instances down - https://phabricator.wikimedia.org/T346012 (10Marostegui) [06:49:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:34:04] Good morning team! [07:45:30] 10Data-Platform-SRE, 10Observability-Metrics, 10SRE, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) The statsd-exporter part of this work is happening in {T345790} because we need to make graphite failovers simpler. Technically... [08:25:32] brouberol: Hey! How is this new week going? [08:25:42] Morning brouberol o/ [08:25:47] Do you need some pairing on anything? I have ~1h where I'm available [08:26:20] If you want to go over the plan for that cookbook for example, or if there is anything blocked in your onboarding checlist [08:28:26] FYI puppet is a weird state on an-test-master1002.eqiad.wmnet, one run makes some changes the next one fails and so on (see https://puppetboard.wikimedia.org/node/an-test-master1002.eqiad.wmnet ) [08:34:42] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:31] volans: Thanks. I will check it out. [08:36:42] thank you :) [08:39:42] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:38] ^ I'm not quite sure what's happening with this kube-controller-manager - Looks like there has been a change today and puppet has restarted it, but I'm not sure of the rest of the timings. [08:50:17] Everything's going well! I'm more than happy to pair after my 1/1 with hashar, in 15 min [08:50:38] I've started to hit issues with Curator not working with opensearch [08:51:17] Oh, I'm not really surprised :/ [08:51:24] Do you have a dependency on curator? [08:51:44] the ElasticsearchClient object has, yes [08:51:46] we have our SRE sync in 10', let's see then [08:53:42] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:42] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:56] I'll be there in 1 min [09:22:07] 10Data-Platform-SRE, 10Discovery-Search, 10Elasticsearch: cleanup the custom elasticsearch_${version}@ systemd unit in favor of an override configuration - https://phabricator.wikimedia.org/T218315 (10Gehel) [10:00:21] explanations about why we're stopping the investigation of adding official OpenSearch support in spicerack for now: https://phabricator.wikimedia.org/T345900#9155620 [10:22:17] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) [10:28:02] I've created https://gerrit.wikimedia.org/r/c/operations/puppet/+/956383 to stop the puppet errors on an-master1002. [10:48:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:39:32] qq about gerrit: if I submit multiple patches via multiple calls to `git review -w`, is there a way to see the combined diff of all patches stacked on each other? [11:41:15] ah, I think I needed to select the diff base as `Base -> patchset n` [11:41:52] Hmm. I don't think so. Are they in the same branch? Usually when I have multiple patches in the same branch, a single `git review` will make a 'chain of patches' that you can step through, but they're still not shown concurrently. [11:43:20] Oh right, so you're still talking about a single patch, is that correct? But different versions of the patch, show up as different patchsets. So you can see how a certain patch evolved over time. [11:45:32] yes exactly! I keep `commit --amend`-ing my commit (as I believe this is "the gerrit way"), meaning I keep adding to the same patch, via stacking patchsets [11:46:10] and the gerrit UI was showing me the diff between patchset n and n-1 by default, and switching `patchset n-1` with `Base` allowed me to see the "whole" diff [11:46:20] Then yes, you've got it. [11:46:54] One gerrit feature that it took me a while to get the hang of was the chained patches. So you can do something like: [11:47:09] * btullis 1)Create branch [11:47:15] Oops. [11:47:39] when viewing the diff between patchsets, say /3..6 you also see the comments related to those, it's pretty useful and practical to navigate to reply to comments [11:49:01] gotcha, thanks! [11:49:05] 1) Create branch 2) Deploy change to single host 3) Deploy to all hosts - These three patches can be on the same branch and sent with a single `git review` - gerrit will show the relation chain, allowing a reviewer to see the three together. [11:49:51] If you need to rebase your branch, it will update all linked changes. [12:03:31] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) Alright, I seem to be getting working results, at least in dry-run mode: https://phabricator.wikimedia.org/P52407 ! [12:05:03] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) Note: due to the reasons explained out [[ https://phabricator.wikimedia.org/T345900#9155620 | here ]], we're swapping using the official `opens... [12:06:00] btullis: I'm not sure what you mean by Deploy change to single/all host(s) [12:09:53] oh, these were some "imaginary" patches for illustration purposes, right? [12:22:50] and so, in that scenario, each commit can be rebased independently, so that the final review shows a chain of linked changes, each of them self-contained. Did I get that right? [12:43:48] Now that the opensearch cookbook is ready for review, I was thinking about starting looking into https://phabricator.wikimedia.org/T343762 (hadoop worker provisioning). Would anyone be able to assist/pair to help me get started? Thanks! [12:43:50] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10dcausse) a:03dcausse [12:55:59] brouberol: Yes, sure thing. We have some docs on adding new hadoop worker, but it's always worth reviewing to see if anything needs to be updated. https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation [13:00:18] So you should already be able to SSH to the new hosts and you'll be able to see from the `motd` that they are currently in the `insetup::data_engineering` role. [13:00:20] https://usercontent.irccloud-cdn.com/file/u84vGdYY/image.png [13:01:36] This role is configured here: https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L137-L140 [13:02:20] So one of the steps required is a modification of that site.pp file to add these hosts to the `analytics_cluster::hadoop::worker` role instead. [13:03:02] They can be done in bulk, or individually. However, there are a few steps we need to do first. [13:04:29] 1) On each host, we need to run this manual journalnode creation script linked from here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation (We haven't got around to automating it yet) [13:07:21] 2) We need to run the `hadoop-init-workers` cookbook for each host. This sets up the RAID controller with the 12 x RAID0 volumes, formats each of them as an ext4 volume, then mounts them. Again, it's something that we *could* have done automatically during installation, but it hasn't been a priority. At least this has a cookbook. [13:07:56] Thanks! I have a couple of 1/1 and I'll have a look right after\ [13:08:41] 3) We need to create some kerberos keytabs for each of the hosts and make them available via our secret repository in puppet: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos/Administration#Create_a_keytab_for_a_service [13:09:09] We can look at this step together. [13:11:24] Here is a similar ticket that I worked on: https://phabricator.wikimedia.org/T275767 although there have been a few changes since then. [13:22:49] 10Data-Engineering, 10cloud-services-team: dbproxy1018 alert for two instances down - https://phabricator.wikimedia.org/T346012 (10Marostegui) [13:25:02] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10bking) This is complete; closing. [13:27:38] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10bking) Reopening as the following hosts are still on Buster: `apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001... [13:37:58] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [13:38:29] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [13:38:31] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10bking) [13:47:24] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) @EBernhardson @dcausse as far as replacing these hosts: - Is there a way to test these hosts ahead of time? I didn't see apifeatureusage or search-loa... [14:12:18] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10dcausse) apifeatureusage are logstash hosts so it might be better to ask the o11y team for advises here, regarding search-loaders hosts @EBernhardson might kn... [14:33:38] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10EBernhardson) Indeed the mjolnir changes should be straightforward. All their state is stored in kafka, if they get restarted they pick back up after the last... [14:57:23] btullis: I have run through the configuration of an-worker1149.eqiad.wmnet, but it seems it's missing the 12 attached disks. lsblk only shows /dev/sda* [14:57:31] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10pfischer) [14:58:29] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10pfischer) [14:59:11] 10Data-Platform-SRE: Troubleshoot rdf-streaming-updater/dse-k8s cluster - https://phabricator.wikimedia.org/T346048 (10bking) [15:02:22] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) a:03brouberol [15:06:18] (ls /dev/sd* only shows `/dev/sda /dev/sda1 /dev/sda2 /dev/sda5`) [15:06:35] (03CR) 10Mforns: "One last annoying comment here, and optional as well, I'm ok with this merged as is, too:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:06:42] (03CR) 10Mforns: [C: 03+1] Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:07:33] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) p:05Triage→03Medium [15:19:36] 10Data-Platform-SRE: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10Gehel) [15:21:08] 10Data-Platform-SRE: Migrate apifeatureusage/search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10Gehel) [15:21:57] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10Gehel) [15:22:00] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [15:24:16] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10bking) [15:24:41] btullis: Heya - Is there any plan on stopping oozie for real? [15:24:58] brouberol: Hmm. It looks like the disks aren't yet presented to the operating system. There needs to be a hardware RAID0 volume for each of the disks, before the O/S can see them. I thought that the cookbook did that, but it looks like there isn't. I'll try to find it. [15:25:46] joal: Yes there is. https://phabricator.wikimedia.org/T341893 [15:27:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10Gehel) a:03bking [15:28:24] btullis: any info on priorization of this on your side? [15:29:15] I think it's driven by requests from you :-) [15:30:48] ack! thanks btullis [15:32:37] I think we had discussed doing Hue first (T341895) but now that ticket is blocked by ongoing discussions. I've no problem with bumping Oozie up the list. [15:32:37] T341895: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 [15:33:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform, 10Patch-For-Review: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [15:33:26] understood btullis - Thank you for the explanation :) [15:34:21] brouberol: It looks like there is a `megacli` command that we need to run `megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0` [15:36:14] Perhaps it used to be DC Ops who ran this command when initially setting up the server, but for some reason it hasn't been run on these. [15:36:53] It's mentioned here: https://phabricator.wikimedia.org/T290805#7352410 & it's mentioned under a different section here: https://wikitech.wikimedia.org/wiki/MegaCli#Replace_individual_disks_in_JBOD [15:37:19] Maybe we should make a reference to it https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation in case we need to refer to it. [15:37:25] let me try [15:37:42] is that safe to run on an-worker1149 ? [15:38:28] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10mforns) Here's a spreadsheet with an analy... [15:41:21] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker) [15:41:32] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker) [15:43:02] > is that safe to run on an-worker1149 ? [15:43:02] Yes, it's not in service so you can't really do any harm. There are lots of read-only `megacli` queries here, if you would like to interrogate the controller before running something that writes configuration. https://wikitech.wikimedia.org/wiki/MegaCli#Disk_status [15:44:21] e.g. `megacli -PDList -aall` will show you all about all of the physical disks present. `megacli -LDInfo -Lall -a0` will show you all about the logical drives, of which there is probably only one at the moment. [15:44:57] thanks, let me keep notes about that. The command ran successfully, and thus the cookbook ran successfully as well [15:45:47] Great. Sorry the info wan't complete. [15:46:28] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10MoritzMuehlenhoff) >>! In T346039#9156442, @bking wrote: > @EBernhardson @dcausse as far as replacing these hosts: > > - Is there a way to test these hosts ahead of time? I didn't see apifeatureu... [15:47:25] I ran the megacli cmd on all remaining 7 hosts via cumin [15:48:58] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) The disks were not mounted on these 8 hosts. I ran a `megacli` command on all hosts to fix this the situation: ` brouberol@cumin1001:~$ sudo cumin 'an-worker11[49-56].eqiad.wmnet'... [15:52:40] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) **`an-worker1149.eqiad.wmnet` ** [x] Setup journal node [x] Create kerberos keytabs [x] Commit kerberos keytabs in puppet [x] Run `sre.hadoop-init-workers` cookbook **`an-worker115... [15:57:18] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I've prepared a patch to production-images that will allow us to build multiple versions of spark. Before I merge it, I'm ju... [16:13:42] I'm out for the day! Enjor your day/evening! [16:22:03] (03CR) 10DCausse: "this patch was made at a time when we fetched the data very early in the pipeline, now that we're fetching the content late we've lost tra" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [16:33:35] 10Data-Platform-SRE: Troubleshoot rdf-streaming-updater/dse-k8s cluster - https://phabricator.wikimedia.org/T346048 (10bking) Update: After some consultation in #wikimedia-k8s-sig , this doesn't seem to be a DNS issue. So it's most likely firewall rules...will continue troubleshooting and get back. [16:59:16] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) [18:31:29] (03CR) 10Gmodena: [C: 03+1] "LGTM. Left a (non blocking) comment re possible approaches for iterative stream/schema development." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [18:46:26] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper) a:03RKemper [18:49:14] 10Data-Platform-SRE: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890 (10Gehel) [18:50:40] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10gmodena) [18:57:06] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10gmodena) [19:45:16] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10Antoine_Quhen) For Airflow dags, we are using trusted-runners provided by rel-eng. You may want to build an adhoc pyflink imag... [19:54:26] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10Antoine_Quhen) For Airflow dags, we are using trusted-runners provided by rel-eng. You may want to b... [20:59:15] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10lbowmaker) @JAllemandou - ok to deprecate now? cc - @Ahoelzl [21:21:54] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) 05Open→03Resolved Great! Thanks for confirming @OSefu-WMF. [21:28:10] 10Data-Engineering, 10CirrusSearch, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work): [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10lbowmaker) [21:37:48] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10RLazarus) a:03joanna_borun Hi @joanna_borun -- does this need Infrastructure Foundations approval? [22:55:36] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Ahoelzl) I suggest the following steps before turning off: [ ] Check logs, that no jobs have been running on the system [ ] Send email...