[00:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:49] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:48] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:02:48] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:48] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:48] (SystemdUnitFailed) firing: (3) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:16] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10SD0001) Would be good to consolidate discussion in {T178520} - maybe we could switch directly to object storage. [05:34:12] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) >>! In T349690#9282576, @SD0001 wrote: > Would be good to consolidate discussion in {T178520} - maybe we could switch directly to object storage. Oh look at that, been a good idea for some time. [05:35:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:02:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:23:43] \me waves good morning! [07:24:01] oops, first character, first typo. brb, coffee time [07:39:26] o/ goodmorning brouberol [07:44:56] brouberol: I notice from backlog that you had an issue with a cookbook, was it solved? Was it the decommissioning one? [07:58:30] yes, the issue was solved by elukey. It was a missing backslash that caused `/tmp/reuse-parts` to not be deployed in the busybox in-memory FS of the debian installer, meaning that the partition layout had issues. It was fixed, and we were able to reimage the host. I'm moving onto the 2/3 [08:00:40] it was also caused by me, so I'd need to take some blame for the time lost :( [08:00:53] no blame to assign really. It happens [08:01:14] and I got to learn a ton by seeing what the non-happy path was [08:01:48] ack, thanks both! [08:02:07] and semi-tongue-in-cheek-semi-serious, as my dad manages a datacenter in Paris, it gave me talking points for the holiday dinners [08:02:14] so all good [08:03:16] rotfl [08:04:05] I see it was a reimage, what I wanted to say about the decommissioning cookbook is that it's idempotent, so you can re-run it at will and if it doesn't run then it means we need to patch it to make sure it can run [08:05:15] oh, yes. I think what happened there is that the cumin host rebooted (see #w-security), meaning I lost my screen altogether, and had to run various sub-cookbooks to fallback to the intended state [08:06:11] (I'm starting the re-imaging of kafka-jumbo1008) [08:06:15] !log sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008 [08:06:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:06:18] T348495: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 [08:07:22] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye [08:16:05] elukey: I'm running the reimage cookbook and I confirm that the partman step now works top notch [08:16:44] yay [08:20:40] volans: if you want to laught at my expense: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968659 (I applied a fix that Moritz added to reuse-parts.cfg, and I failed to copy it in the right way) [08:20:56] (so /lib/partman/display.d/70reuse-parts; [08:21:04] ended up without execute perms) [08:21:13] *laugh [08:23:19] eheheh, yeah those partman bash scripts are the worse in terms of maintanability and it's super easy to make mistakes... in paticular the gigantic switch case for the recipes [08:24:09] but it was nice to debug d-i, I understood more things about what we do with partman [08:25:25] spaking of the backslashes in the giant switch case. Could we, say, store the hostname patterns -> partman recipes association in a config file, and generate the bash script via puppet, to avoid these typos? [08:26:01] Morning all. [08:28:00] morning! [08:34:15] morning [08:38:06] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Updating the comment to show that we are starting with druid1006 switch with druid1011. output from the zookeper leader `druid1005` before start ` stevemunene@druid1005:~$ ech... [08:40:18] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1008.eqiad.wmnet with OS bullseye completed: - kafka-jumbo1008 (**PASS**) - Downtimed on... [08:48:54] !log sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009 [08:49:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:49:00] T348495: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 [08:49:16] Heads up, about to start replacing the servers on the ZooKeeper cluster for `druid-public-eqiad` cluster, colocated on druid hosts with newer bullseye hosts for T336042. https://wikitech.wikimedia.org/wiki/Zookeeper [08:49:17] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [08:51:24] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye [09:18:54] !log stop zookeper on druid1006 T336042 [09:18:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:18:57] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [09:19:56] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Disabled puppet on all the `druid::public::worker` hosts. proceeding to disable zookeper on `druid1006` ` stevemunene@druid1006:~$ sudo systemctl stop zookeeper stevemunene@dr... [09:23:19] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host kafka-jumbo1009.eqiad.wmnet with OS bullseye completed: - kafka-jumbo1009 (**PASS**) - Downtimed on... [09:26:33] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) ` brouberol@cumin1001:~$ sudo cumin A:kafka-jumbo 'grep -i version /etc/os-release' 9 hosts will be targeted: kafka-jumbo[1007-1015].eqiad.wmnet OK to proceed on 9 hosts? Enter the number... [09:26:46] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) 05In progress→03Resolved [09:26:48] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [09:26:50] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [09:36:02] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [09:40:10] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [09:40:25] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [09:47:54] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, 10SRE, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [09:49:44] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Merged, running puppet on `druid1011` status on zk leader ` stevemunene@druid1005:~$ echo mntr | nc localhost 2181 zk_version 3.4.13-2--1, built on Tue, 04 Jun 2019 21:22:04 -... [10:07:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:17] !log restart zookeper leader to pick up new host druid1011 T336042 [10:18:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:18:21] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [11:27:10] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Restarted the ZK cluster leader after running puppet on the host to update the config. `druid1011` is yet to join the zk cluster due to constant timeouts. ` 2023-10-26 11:23:3... [11:36:49] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:37:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:06] (03CR) 10Michael Große: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969103 (owner: 10Michael Große) [12:02:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:50] btullis I was wondering how we could test https://gerrit.wikimedia.org/r/c/operations/puppet/+/968612/9 without merging to production, nor using puppet environments, and I'm coming short [12:05:33] We can create a files called `hieradata/hosts/an-test-client1002.yaml` and in there define: `profile::airflow::manage_skein_certificate: true` [12:05:57] On the one hand, 968612 is a no-op fleet-wide and 968613 would only have an impact on a single test machine, but it does feel weird that we'd need to merge something to production to test something that might not work out [12:06:28] Oh I see that's what you've done :-) [12:06:36] yep, this is what I'm doing in https://gerrit.wikimedia.org/r/c/operations/puppet/+/968613/4, but this would still require https://gerrit.wikimedia.org/r/c/operations/puppet/+/968612/9 to be merged first, right? [12:07:06] even though we might come to the conclusion that managing the skein crt with our PKI does not work, for $reasons [12:07:36] and I guess that we could revert both if that's the case, so this might be a moot point [12:07:49] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:49] Yes, I'm afraid so. That is the nature of things here. The only real alternatives that you have is to create one or more VMs in WMCS: See: https://wikitech.wikimedia.org/wiki/Help:Puppet#Apply_a_puppet_role_that_has_not_been_merged_into_operations/puppet.git_yet [12:12:17] There is also Pontoon, which is intended to address some of these quandries: https://wikitech.wikimedia.org/wiki/Puppet/Pontoon [12:14:51] I think that could work in theory, but because we'd need to figure out if submitting spark jobs from airflow via skein would work with a profile::PKI-generated certificate, we'd need to recreate a whole airflow/spark environment in WMCS, wouldn't we? [12:15:05] ...but once again, it's outside of the production realm. Inside the production realm, we only run puppet against the production branch. So yes, we might end up reverting both CRs if skein doesn't like it, but at least the boolean allows it to hit only the test cluster. [12:15:10] so maybe dedploying these changes to our production test instance might [12:15:19] *might be the simplest way forward [12:15:40] wdyt? [12:16:34] ^ agree. Yes, one could argue that it would be worthwhile building an airflow/spark pipeline env in WMCS would be worth it, but it's a lot of work for this small change. [12:16:35] (03CR) 10Michael Große: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969107 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [12:17:04] (that sentence was jumbled, but I think the idea was there) [12:17:21] :D the message got across just gine [12:17:23] *fine [12:27:25] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ca92fb50-91c0-4832-a18d-b71b3e5cae7d) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service... [12:31:12] alright, I'm going to set both these PRs as reviewable then. At your own convenience, of course [12:50:25] btullis: Heya - question for you - do you know if we have a grafana test isntance? [12:50:47] btullis: I have had the funny/crazy idea to see if we could plug in presto into grafana [13:18:10] joal: We do have this: https://grafana-next.wikimedia.org/ but it's mainly used for testing version upgrades to Granafa, I think. https://wikitech.wikimedia.org/wiki/Grafana#Version_upgrade [13:18:27] I'd love to hear your crazy idea. Do you want to jump on a call? [13:18:37] sure btullis - batcave! [13:25:49] btullis we're in pairing if you wanna join [13:26:24] Be there in 5 [13:33:31] btullis I think we're winding down, but ping us if you wanna get back together [13:39:26] Ping! [13:40:04] pong! [14:02:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:04] I noticed that an-test-master100[12] have a broken dpkg state, is that a known issue? conda-analytics depends on mariadb-dev which transitelvey depends on mysql-common which is unavailable [14:05:52] Oh, that's my fault. Sorry. I will try to revert. I was testing with a pre-release version of conda-analytics. [14:07:01] ah, ok [14:07:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add link to Grafana dashboard where data is used [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969103 (owner: 10Michael Große) [14:08:31] (03Merged) 10jenkins-bot: Add link to Grafana dashboard where data is used [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969103 (owner: 10Michael Große) [14:09:18] (03PS1) 10Lucas Werkmeister (WMDE): Add link to Grafana dashboard where data is used [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969146 [14:09:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add link to Grafana dashboard where data is used [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969146 (owner: 10Lucas Werkmeister (WMDE)) [14:10:29] (03Merged) 10jenkins-bot: Add link to Grafana dashboard where data is used [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969146 (owner: 10Lucas Werkmeister (WMDE)) [14:11:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix link to Grafana dashboard [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969107 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [14:12:34] (03Merged) 10jenkins-bot: Fix link to Grafana dashboard [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969107 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [14:21:01] (03PS1) 10Lucas Werkmeister (WMDE): Fix link to Grafana dashboard [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969147 (https://phabricator.wikimedia.org/T348644) [14:21:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix link to Grafana dashboard [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969147 (https://phabricator.wikimedia.org/T348644) (owner: 10Lucas Werkmeister (WMDE)) [14:22:59] (03Merged) 10jenkins-bot: Fix link to Grafana dashboard [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969147 (https://phabricator.wikimedia.org/T348644) (owner: 10Lucas Werkmeister (WMDE)) [14:23:11] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10dcausse) [14:23:31] (03PS1) 10Lucas Werkmeister (WMDE): README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969127 [14:26:18] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10dcausse) {F40321464} [14:29:06] (03CR) 10Michael Große: [C: 03+2] README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969127 (owner: 10Lucas Werkmeister (WMDE)) [14:30:02] (03Merged) 10jenkins-bot: README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969127 (owner: 10Lucas Werkmeister (WMDE)) [14:33:38] (03PS1) 10Lucas Werkmeister (WMDE): README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969148 [14:33:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969148 (owner: 10Lucas Werkmeister (WMDE)) [14:34:14] (03Merged) 10jenkins-bot: README: Clarify development process [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/969148 (owner: 10Lucas Werkmeister (WMDE)) [14:46:35] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10MSantos) @Jdforrester-WMF here's a few questions that I have: **Recommendation API ownership** The former #product-inf... [15:06:12] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Joe) The problem can also be that we have one component in front of the service (envoyproxy)... [15:21:39] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) >>! In T336042#9283701, @Stevemunene wrote: > Restarted the ZK cluster leader after running puppet on the host to update the config. > `druid1011` is yet to join the zk cluster... [15:29:26] !log stop zookeper on druid1005 current leader for the `druid-public-eqiad` this will trigger the election of a new leader T336042 [15:29:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:29:29] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [15:31:01] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) New locations are as follows cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20... [15:44:07] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10xcollazo) We had an OpsWeek issue ([[ https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia... [16:16:02] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Zookeper stopped on `druid1005`, `druid1011` is now the new leader. ` stevemunene@druid1011:~$ echo mntr | nc localhost 2181 zk_version 3.4.13-6--1, built on Sun, 07 Feb 2021... [16:18:25] !log roll-restart druid public workers to pick up new zookeeper hosts. T336042 [16:18:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:18:28] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [16:24:05] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I've made a small [[https://gerrit.wikimedia.org/r/969143|pu... [17:10:33] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [17:11:10] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [17:13:13] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [17:22:34] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9284698, @VRiley-WMF wrote: > New locations are as follows > > cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 > > c... [17:27:29] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Search Update Pipeline - https://phabricator.wikimedia.org/T349772 (10bking) p:05Triage→03Medium a:03bking [17:32:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:14] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9284698, @VRiley-WMF wrote: > cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058 Thanks! I'm getting a duplicate cable ID ale... [17:56:17] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: Confirmed that Spark 3.4.1 works as well! :tada: (I also wanted to check Spark 3.3.3 on top of our Spark 3.3.2 shuf... [18:05:21] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: Confirmed that Spark 3.4.1 also works against Shuffler from Spark 3.3.2 :tada:. This is good since folks that want b... [18:15:21] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) @BTullis : > Maybe there is a neat solution to automate keeping a set of assembly files up-to-date, which match the shuffl... [18:48:42] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Search Update Pipeline - https://phabricator.wikimedia.org/T349772 (10RKemper) [18:49:07] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards/alerts for new Search Update Pipeline - https://phabricator.wikimedia.org/T349772 (10RKemper) [18:50:09] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) **Data Engineering ownerhship** `eventgate` is missing from your list, but has WIP to target node 18 https://... [18:51:45] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) Also, `mediawiki/services/similar-users` is a Python service. [18:57:57] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Krinkle) >>! In T266798#9277763, @Ottomata wrote: > cc @Krinkle in c... [19:30:55] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [19:40:04] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Jdforrester-WMF) [19:40:12] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [19:42:08] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm executed with... [19:47:57] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [19:48:09] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) >>! In T349118#9284438, @MSantos wrote: > @Jdforrester-WMF here's a few questions that I have: > > **... [20:00:01] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [20:05:57] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/da... [20:45:50] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm completed: - c... [21:47:49] (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:02:47] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10RobH) [23:03:00] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10RobH) [23:47:49] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed