[10:41:58] <dcausse>	 errand+lunch
[11:53:51] <gehel>	 lunch
[14:04:57] <dcausse>	 clearing airflow tasks from the cli but it's tedious since dag runs do not yet exist... I barely remember airflow 1 happily setting the state to success even if the dag run did not yet exist
[14:06:53] <dcausse>	 this dag uses depends_on_past so new runs are only created once previous one is done
[14:13:58] <inflatador>	 <o/
[14:15:02] <dcausse>	 o/
[15:07:02] <inflatador>	 pfischer I partitioned the topics as requested yesterday T354595
[15:07:02] <stashbot>	 T354595: SUP: Production TODOs - https://phabricator.wikimedia.org/T354595
[15:07:18] <inflatador>	 will start on the helmfile stuff shortly
[15:37:03] <Trey314159>	 Anyone mind cancelling our retro today? Mark & Tajh & Selena are going to talk about annual planning during their office hours, and I want to listen.
[15:43:55] <dcausse>	 fine by me
[15:58:02] <inflatador>	 OK
[16:00:09] <ebernhardson>	 \o
[16:00:21] <pfischer>	 o/
[16:01:18] <ebernhardson>	 dcausse: yes there is some annoyance with airflow and the number of active runs, i often set the newest runs to fail and then clear them so it starts the earlier ones i reset
[16:01:54] <gehel>	 Trey314159: no objection to canceling retro, let's see who shows up
[16:03:35] <inflatador>	 gehel I think most of us are in the Mark/Tahj mtg
[16:04:28] <gehel>	 retrospective is officially canceled, either join the strategy meeting or get that time back!
[16:59:32] <inflatador>	 workout, back in ~40
[17:18:37] <pfischer>	 dcausse: do you think we could give the non-blocking JSON processor a try (https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/99)?
[17:19:19] <dcausse>	 pfischer: sure, but probably not before tomorrow
[17:20:56] <pfischer>	 Alright, since the change in gitlab settings, the MR is blocked while there are unresolved threads. So want to make sure there’s no open question left.
[17:21:57] <dcausse>	 pfischer: is it urgent?
[17:22:05] <pfischer>	 No
[17:22:17] <pfischer>	 No worries, it can wait.
[17:23:00] <dcausse>	 ok, np, but feel free to force push it tho, I can take a look after the fact if you prefer
[17:23:42] <pfischer>	 No it’s fine. Take your time. It’s just me being impatient. 🙊
[17:23:50] <dcausse>	 ok :)
[17:43:43] <dcausse>	 I think I got all the dags cleared but very possible I missed something... will check alerts tomorrow
[17:44:07] <ebernhardson>	 dcausse: thanks! It's always a bit tedious going through the airflow failures
[17:44:34] <dcausse>	 yes, I wish there was a more automated way but no :/
[18:17:47] <inflatador>	 sorry, been back
[18:30:01] <inflatador>	 re: Airflow failures, would implementing T350499 help? If there's anything else we can do to reduce toil on that LMK
[18:30:02] <stashbot>	 T350499: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499
[18:31:31] <ebernhardson>	 inflatador: no, the tedious bit is that something goes wrong with event ingestion, then a few hours of multiple different tasks fail, then downstream tasks of those tasks fail, and further downstream.  All of those have to be looked at and reset or bypassed in some way, depending on what happened.
[18:32:41] <ebernhardson>	 the initial ingestion tasks often need looking at the logs to see what went wrong (both dc events missing, or just inactive dc). Depending on which way it failed determines what needs to be done. Often involving `show partitions ...` from a hive console and poking through whats available to see if it's still missing
[18:33:25] <inflatador>	 ebernhardson gotcha, so this like "it doesn't happen in a predictable way, so automating it is hard"?
[18:33:40] <inflatador>	 or maybe it doesn't happen often enough to bother?
[18:33:52] <ebernhardson>	 and then, for extra funsies, an airflow dag might be limited to two concurrent dag runs, but there are already two dag runs waiting around. So clearing the old tasks doesn't actually get them started. In addition to clearing them the newest dag runs need to first be canceled, to open up space to run the old ones, then they have to be cleared so that they can run as expected
[18:34:54] * inflatador starts to get a picture
[18:35:18] <ebernhardson>	 probably some automation could be put together, but indeed it hasn't been frequent enough and it's not always the exact same fix
[18:35:54] <inflatador>	 got it, I can close that ticket if it's not worth it ATM
[18:37:47] <ebernhardson>	 inflatador: it could still be useful, but it's hard to say :P That reads as a slightly different area, where for example we expect to recieve data from other teams and process it.  If their thing fails then our thing that ingests it also fails.
[18:38:09] <ebernhardson>	 At least on our side we get SLA miss emails and failure emails, i'm guessing the other side gets similar so there is some alerting going on
[18:39:57] <inflatador>	 ACK, is it just us and Data Platform/Data Eng that uses Airflow, or are there other teams?
[18:40:39] <inflatador>	 (just thinking about this in terms of the alerts review)
[18:44:10] <ebernhardson>	 inflatador: i would have to review closer, but primarily we ingest kafka events that were copied to hdfs. This is a generic process that probably has decent monitoring. The other bit we ingest is weighted tags from analytics platform eng for image suggestions
[18:44:48] <dcausse>	 dinner
[18:46:01] <inflatador>	 np, looks like the larger data eng team is the primary consumer of Airflow based on wikitech https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow
[18:47:02] <inflatador>	 https://wikitech.wikimedia.org/wiki/Search/WeightedTags has some good info too
[18:47:58] <inflatador>	 lunch, back in time for pairing
[19:08:37] <ryankemper>	 Might be 5m late to pairing
[19:23:28] <inflatador>	 back
[20:02:21] <inflatador>	 appointment, back in ~90
[20:51:32] <ebernhardson>	 hmm, since adding the rest of the wikis the cirrus-streaming-updater grafana dashboard now shows durations for all the wikis it processes, not sure what we should do with that. Perhaps limit to top-N by request duration?
[21:06:20] <ebernhardson>	 i dunno, maybe we leave them. It's probably not hurting much, but it means the graphs are just a rainbow of lines :)
[21:42:04] <inflatador>	 LOL
[21:42:06] <inflatador>	 back
[22:28:30] <inflatador>	 ryankemper or anyone else, quick patch to migrate an elastic host to puppet 7: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991674
[22:29:14] <inflatador>	 we have a few Puppet 7 elastic hosts already, but they're freshly reimaged hosts, wanted to see if we could migrate an existing host (via the puppet migration cookbook)
[22:29:24] <ryankemper>	 inflatador: +1'd
[22:29:43] <inflatador>	 ryankemper excellent, thanks
[22:31:31] <ryankemper>	 meanwhile, /me is trying to figure out what's wrong with the microsite (ex: https://query-full-experimental.wikidata.org/) getting a 502 bad gateway from varnish. I thought it might be the URLs being wrong (fixed in https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/991671/1/sites/full-experimental/custom-config.json) but still getting the 502
[22:32:49] <inflatador>	 interesting...let me finish puppet-merging and we can work on it if you want
[23:28:33] <ryankemper>	 Okay, looks like the issue is that miscweb's `/etc/ssl/localcerts/webserver-misc-apps.discovery.wmnet.crt` (`modules/secret/secrets/certificates/certificate.manifests.d/webserver_misc_apps.certs.yaml` in the puppet private repo) needs the new URLs in the alt_name. It also still has query-preview.wikidata.org so we should remove that as well. I'll run through the cergen steps to get a new cert up