[00:01:12] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10colewhite) Related: {T255864}
[00:16:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: monitor_refine_eventlogging_analytics.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:12:43] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper)
[06:52:17] <moritzm>	 FYI, I'm rebooting one of the Kerberos servers in a few minutes, there should be no impact since all Kerberos clients are configured to transparently fall back to the second server
[07:03:11] <moritzm>	 the KDC is back up
[07:14:06] <joal>	 Thanks moritzm :)
[08:06:25] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene)
[08:31:54] <wikibugs>	 (03CR) 10Btullis: Increase the max kafka message size for gobblin (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/954968 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[08:56:41] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: eventutilities-python: cookicutter template example should be updated - https://phabricator.wikimedia.org/T345390 (10gmodena)
[08:57:10] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10gmodena)
[08:58:11] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10gmodena)
[08:58:13] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10gmodena)
[08:59:40] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10gmodena)
[09:00:02] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye
[09:24:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye
[09:36:44] <joal>	 mforns: If I'm not mistaken, you have not added a note to the airflow task you manually set to succeeded. I don't think you have either logged the operations you did for the job to restart - can you confirm, and possibly update/log even if not at the exact date?
[09:38:16] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye completed: - an-worker1133 (**PASS**)   - Downtimed on Icinga/Alertm...
[10:03:59] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye completed: - an-worker1134 (**PASS**)   - Downtimed on Icinga/Alertm...
[10:23:01] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) Seeing some HDFS corrupt blocks from 2023-09-07 10:03 UTC on [[ https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=39&from=1694080200938&to=1694082000938 | grafana ]]. Did a quick c...
[11:56:34] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) As requested by @Gehel, I pushed a boilerplate patch to gerrit, to make sure my accesses were working correctly. I've marked the change request...
[12:03:48] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis) >>! In T344798#9149149, @brouberol wrote: > As requested by @Gehel, I pushed a boilerplate patch to gerrit, to make sure my accesses were working...
[12:37:36] <joal>	 Hey mforns - I just sent https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/488 - could you please review when you have am inute?
[12:42:45] <wikibugs>	 (03PS2) 10Joal: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071)
[12:52:38] <wikibugs>	 (03PS3) 10Joal: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071)
[12:52:42] <jinxer-wm>	 (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1147:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:26] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Merging for later deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal)
[12:54:07] <brouberol>	 btullis: just to let you know, I've made progress on the opensearch restart/reboot cookbook, leveraging the SREBatchBase/SREBatchRunnerBase base classes. I have what feels like a usable solution, but I'll need to get ssh access to the boxes to dig around and test things out, as I'm sure nothing would work at the moment
[12:54:16] <btullis>	 ^ looking at worker1147
[12:54:42] <joal>	 btullis: Shall I remove the gobblin patch from the deploy etherpad? it seems there still is some discussion
[12:54:43] <btullis>	 brouberol: Excellent! 
[12:55:00] <btullis>	 joal: Oh yes please, sorry I forgot that I added it.
[12:55:10] <joal>	 no prob btullis - doing
[12:55:53] <btullis>	 joal: Are you happy for me to abandon the gobblin config patch? I think it's you with whom I was hoping to have the discussion :-)
[12:56:15] <joal>	 hm, actually the entire patch is not to abandon, only one of the lines to remove
[12:56:38] <joal>	 And, I'm still wondering about the global impact of using 10Mb batches instead of 1Mb batches when reading
[12:57:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1147:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:57:43] <btullis>	 Right, so I wasn't sure about the value of increasing that value from 1 MB to 10 MB at all.
[13:01:15] <btullis>	 an-worker1147 ran out of heap space.
[13:01:18] <btullis>	 2023-09-07 12:49:56,419 WARN org.sparkproject.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
[13:01:49] <btullis>	 puppet has automatically restarted the service.
[13:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: Remove special KaiOS App checks from pageview def [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal)
[13:13:27] <gehel>	 btullis: about the access for brouberol (T345633), milimetric asked if `analytics-admins` is required as well. I suspect that `ops` would cover that already.
[13:13:59] <gehel>	 but I suspect you know more than I do. Could you reply on task if we're missing something or not?
[13:14:00] <gehel>	 Thanks!
[13:14:32] <btullis>	 Correct, `ops` already covers that. I remember the same things being asked on my access ticket.
[13:15:18] <btullis>	 https://phabricator.wikimedia.org/T285754#7184200
[13:15:36] <btullis>	 I will reply on the task.
[13:24:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10Volans) @BTullis  For context the `sre.elasticsearch.rolling-operation` is currently not using the existing framework for batch actions and IMHO it should...
[13:32:32] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) ⚠ This is currently being implemented, meaning it's missing aliases, selectors, service names etc, that I was hoping to glean after I get SSH a...
[13:40:30] <mforns>	 joal, yes... I forgot to log operations to irc, sorry. Re. Airflow task comments, we added a couple, but I think I forgot the mediawiki_history_reduced one... sorry again. Will do those
[13:40:37] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye
[13:43:01] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10BTullis) >>! In T344798#9149548, @Volans wrote: > @BTullis  For context the `sre.elasticsearch.rolling-operation` is currently not using the existing fram...
[13:43:05] <mforns>	 !log (actual timestamp: 2023-09-06, 19:10:29 UTC) cleared airflow task mediawiki_history_reduced.check_mediawiki_history_reduced_error_folder (and subsequent tasks) for snapshot=2023-08. This was due to false positive errors having been generated by the checker.
[13:43:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:56:09] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye
[14:06:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) @BTullis With the upcoming elevation of the `analytics-wmde` user to a systemwide user across nodes (airflow, stat100x, hadoop worker nodes, etc..) and membership of `a...
[14:18:56] * brouberol on a small break
[14:19:07] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye completed: - an-worker1135 (**PASS**)   - Downtimed on Icinga/Alertm...
[14:25:43] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) Also, qq for @BTullis: what does: "depooling" mean? I see on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Server_actions that it "remov...
[14:29:42] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr)
[14:31:20] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jclark-ctr)
[14:37:50] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye completed: - an-worker1136 (**PASS**)   - Downtimed on Icinga/Alertm...
[16:07:13] <brouberol>	 qq: how can we edit the content generated by a mediawiki template? https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config lists out bast6002.wikimedia.org, but according to vgutierrez, it has been replaced by 6003. (Also, I'm asking here, for lack of a better channel that I can think of)
[16:10:48] <btullis>	 brouberol: I don't think that bit of content is generated by a template, is it? I could be wrong. I think it's just part of the normal page content, so you can edit it with either of these buttons.
[16:10:51] <btullis>	 https://usercontent.irccloud-cdn.com/file/gO7L02Qg/image.png
[16:11:34] <btullis>	 Oh, I see what you mean.
[16:11:36] <btullis>	 https://usercontent.irccloud-cdn.com/file/BsJn8nsw/image.png
[16:11:54] <brouberol>	 ^ exactly
[16:12:06] <btullis>	 Click on the blue lnk there: https://wikitech.wikimedia.org/wiki/Template:BastionMap
[16:12:55] <btullis>	 Looks like the VisualEditor isn't available for this one, so it's a source edit only.
[16:16:19] <brouberol>	 nice, it worked. Thank you!
[16:16:45] <btullis>	 Great! Thank *you* for your contribution :-)
[16:17:12] <vgutierrez>	 indeed, thanks :)
[16:20:28] <wikibugs>	 (03PS3) 10Aqu: WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862)
[16:55:05] <btullis>	 !log restarting the aqs service on all aqs* servers in batches to pick up new MW_history snapshot.
[16:55:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:16:04] <wikibugs>	 (03PS1) 10Joal: Make refine SchemaLoader main function thread safe [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616)
[17:16:36] <joal>	 thanks mforns for the log (even pasttime :)
[17:17:12] <mforns>	 no problemo, thanks for the reminder!
[17:24:55] <joal>	 also mforns: It seems you or Clare have rerun the failed refined jobs but not sent related emails :)
[17:28:47] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10Infrastructure-Foundations: Unable to provision Ganeti VMs in CODFW - https://phabricator.wikimedia.org/T345754 (10bking) 05Open→03Resolved a:03bking CODFW VM builds are complete; closing...
[17:42:16] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10bking) @JMeybohm these hosts have been reimaged, are you still seeing their envoy proxy as unmanaged? Envoy configuration is included [[ https://gith...
[17:55:52] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) bking@cumin1001:~$ sudo cumin A:wdqs-all 'cat /etc/debian_version' ===== NODE GROUP ===== (1) wdqs1003.eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 10.13 ===== NODE GROUP ===...
[17:56:08] <mforns>	 joal: oh! ok
[18:08:41] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) The reindex mentioned in the above comment was...
[18:36:55] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM. The returned object consumed by a parallel collection, so it make sense to synchronize the block." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/955816 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal)
[18:59:57] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) The generated unit file contains jar / war path that have globs. We expect those to be expanded by sh -c '...', but this does not seem to be the case.   ` [Unit] Description=Quer...
[19:13:39] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye
[19:40:08] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10bking) @brouberol "depooling" is more accurately defined as "remove host from load balancer pool".  confctl has [[ https://wikitech.wikimedia.org/wiki/Con...
[20:00:36] <wikibugs>	 10Data-Platform-SRE: Refactor sre.elasticsearch.rolling-operation to use spicerack improvements - https://phabricator.wikimedia.org/T345880 (10bking)
[20:00:57] <wikibugs>	 10Data-Platform-SRE: Refactor sre.elasticsearch.rolling-operation to use spicerack improvements - https://phabricator.wikimedia.org/T345880 (10bking)
[20:23:33] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed: - wdqs1016 (**WARN**)   - Downtimed on Icinga/Alertman...