[00:01:12] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10colewhite) Related: {T255864} [00:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:43] (SystemdUnitFailed) resolved: monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:43] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper) [06:52:17] FYI, I'm rebooting one of the Kerberos servers in a few minutes, there should be no impact since all Kerberos clients are configured to transparently fall back to the second server [07:03:11] the KDC is back up [07:14:06] Thanks moritzm :) [08:06:25] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [08:31:54] (03CR) 10Btullis: Increase the max kafka message size for gobblin (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/954968 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [08:56:41] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: eventutilities-python: cookicutter template example should be updated - https://phabricator.wikimedia.org/T345390 (10gmodena) [08:57:10] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10gmodena) [08:58:11] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10gmodena) [08:58:13] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10gmodena) [08:59:40] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10gmodena) [09:00:02] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye [09:24:25] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye [09:36:44] mforns: If I'm not mistaken, you have not added a note to the airflow task you manually set to succeeded. I don't think you have either logged the operations you did for the job to restart - can you confirm, and possibly update/log even if not at the exact date? [09:38:16] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye completed: - an-worker1133 (**PASS**) - Downtimed on Icinga/Alertm... [10:03:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye completed: - an-worker1134 (**PASS**) - Downtimed on Icinga/Alertm... [10:23:01] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) Seeing some HDFS corrupt blocks from 2023-09-07 10:03 UTC on [[ https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=39&from=1694080200938&to=1694082000938 | grafana ]]. Did a quick c... [11:56:34] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) As requested by @Gehel, I pushed a boilerplate patch to gerrit, to make sure my accesses were working correctly. I've marked the change request... [12:03:48] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis) >>! In T344798#9149149, @brouberol wrote: > As requested by @Gehel, I pushed a boilerplate patch to gerrit, to make sure my accesses were working... [12:37:36] Hey mforns - I just sent https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/488 - could you please review when you have am inute? [12:42:45] (03PS2) 10Joal: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) [12:52:38] (03PS3) 10Joal: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) [12:52:42] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1147:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:26] (03CR) 10Joal: [C: 03+2] "Merging for later deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal) [12:54:07] btullis: just to let you know, I've made progress on the opensearch restart/reboot cookbook, leveraging the SREBatchBase/SREBatchRunnerBase base classes. I have what feels like a usable solution, but I'll need to get ssh access to the boxes to dig around and test things out, as I'm sure nothing would work at the moment [12:54:16] ^ looking at worker1147 [12:54:42] btullis: Shall I remove the gobblin patch from the deploy etherpad? it seems there still is some discussion [12:54:43] brouberol: Excellent! [12:55:00] joal: Oh yes please, sorry I forgot that I added it. [12:55:10] no prob btullis - doing [12:55:53] joal: Are you happy for me to abandon the gobblin config patch? I think it's you with whom I was hoping to have the discussion :-) [12:56:15] hm, actually the entire patch is not to abandon, only one of the lines to remove [12:56:38] And, I'm still wondering about the global impact of using 10Mb batches instead of 1Mb batches when reading [12:57:42] (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1147:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:43] Right, so I wasn't sure about the value of increasing that value from 1 MB to 10 MB at all. [13:01:15] an-worker1147 ran out of heap space. [13:01:18] 2023-09-07 12:49:56,419 WARN org.sparkproject.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception: [13:01:49] puppet has automatically restarted the service. [13:04:50] (03Merged) 10jenkins-bot: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal) [13:13:27] btullis: about the access for brouberol (T345633), milimetric asked if `analytics-admins` is required as well. I suspect that `ops` would cover that already. [13:13:59] but I suspect you know more than I do. Could you reply on task if we're missing something or not? [13:14:00] Thanks! [13:14:32] Correct, `ops` already covers that. I remember the same things being asked on my access ticket. [13:15:18] https://phabricator.wikimedia.org/T285754#7184200 [13:15:36] I will reply on the task. [13:24:55] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10Volans) @BTullis For context the `sre.elasticsearch.rolling-operation` is currently not using the existing framework for batch actions and IMHO it should... [13:32:32] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) ⚠ This is currently being implemented, meaning it's missing aliases, selectors, service names etc, that I was hoping to glean after I get SSH a... [13:40:30] joal, yes... I forgot to log operations to irc, sorry. Re. Airflow task comments, we added a couple, but I think I forgot the mediawiki_history_reduced one... sorry again. Will do those [13:40:37] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye [13:43:01] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis) >>! In T344798#9149548, @Volans wrote: > @BTullis For context the `sre.elasticsearch.rolling-operation` is currently not using the existing fram... [13:43:05] !log (actual timestamp: 2023-09-06, 19:10:29 UTC) cleared airflow task mediawiki_history_reduced.check_mediawiki_history_reduced_error_folder (and subsequent tasks) for snapshot=2023-08. This was due to false positive errors having been generated by the checker. [13:43:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:56:09] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye [14:06:55] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) @BTullis With the upcoming elevation of the `analytics-wmde` user to a systemwide user across nodes (airflow, stat100x, hadoop worker nodes, etc..) and membership of `a... [14:18:56] * brouberol on a small break [14:19:07] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye completed: - an-worker1135 (**PASS**) - Downtimed on Icinga/Alertm... [14:25:43] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) Also, qq for @BTullis: what does: "depooling" mean? I see on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Server_actions that it "remov... [14:29:42] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [14:31:20] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr) [14:37:50] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye completed: - an-worker1136 (**PASS**) - Downtimed on Icinga/Alertm... [16:07:13] qq: how can we edit the content generated by a mediawiki template? https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config lists out bast6002.wikimedia.org, but according to vgutierrez, it has been replaced by 6003. (Also, I'm asking here, for lack of a better channel that I can think of) [16:10:48] brouberol: I don't think that bit of content is generated by a template, is it? I could be wrong. I think it's just part of the normal page content, so you can edit it with either of these buttons. [16:10:51] https://usercontent.irccloud-cdn.com/file/gO7L02Qg/image.png [16:11:34] Oh, I see what you mean. [16:11:36] https://usercontent.irccloud-cdn.com/file/BsJn8nsw/image.png [16:11:54] ^ exactly [16:12:06] Click on the blue lnk there: https://wikitech.wikimedia.org/wiki/Template:BastionMap [16:12:55] Looks like the VisualEditor isn't available for this one, so it's a source edit only. [16:16:19] nice, it worked. Thank you! [16:16:45] Great! Thank *you* for your contribution :-) [16:17:12] indeed, thanks :) [16:20:28] (03PS3) 10Aqu: WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [16:55:05] !log restarting the aqs service on all aqs* servers in batches to pick up new MW_history snapshot. [16:55:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:16:04] (03PS1) 10Joal: Make refine SchemaLoader main function thread safe [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616) [17:16:36] thanks mforns for the log (even pasttime :) [17:17:12] no problemo, thanks for the reminder! [17:24:55] also mforns: It seems you or Clare have rerun the failed refined jobs but not sent related emails :) [17:28:47] 10Data-Platform-SRE, 10Discovery-Search, 10Infrastructure-Foundations: Unable to provision Ganeti VMs in CODFW - https://phabricator.wikimedia.org/T345754 (10bking) 05Open→03Resolved a:03bking CODFW VM builds are complete; closing... [17:42:16] 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) @JMeybohm these hosts have been reimaged, are you still seeing their envoy proxy as unmanaged? Envoy configuration is included [[ https://gith... [17:55:52] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) bking@cumin1001:~$ sudo cumin A:wdqs-all 'cat /etc/debian_version' ===== NODE GROUP ===== (1) wdqs1003.eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 10.13 ===== NODE GROUP ===... [17:56:08] joal: oh! ok [18:08:41] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) The reindex mentioned in the above comment was... [18:36:55] (03CR) 10Gmodena: [C: 03+1] "LGTM. The returned object consumed by a parallel collection, so it make sense to synchronize the block." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal) [18:59:57] 10Data-Platform-SRE, 10Patch-For-Review: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) The generated unit file contains jar / war path that have globs. We expect those to be expanded by sh -c '...', but this does not seem to be the case. ` [Unit] Description=Quer... [19:13:39] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye [19:40:08] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10bking) @brouberol "depooling" is more accurately defined as "remove host from load balancer pool". confctl has [[ https://wikitech.wikimedia.org/wiki/Con... [20:00:36] 10Data-Platform-SRE: Refactor sre.elasticsearch.rolling-operation to use spicerack improvements - https://phabricator.wikimedia.org/T345880 (10bking) [20:00:57] 10Data-Platform-SRE: Refactor sre.elasticsearch.rolling-operation to use spicerack improvements - https://phabricator.wikimedia.org/T345880 (10bking) [20:23:33] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed: - wdqs1016 (**WARN**) - Downtimed on Icinga/Alertman...