[08:00:00] dcausse: could you join the SDAW triage meeting? There might be questions about snippets or other search related things. [08:00:20] You should have received an invite (6:15pm later today) [08:19:02] gehel: sure [08:20:06] Thanks [11:02:47] lunch [11:04:21] gehel: for when/if you have time https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/887745 [11:27:45] lunch [13:27:42] being bold and self-merging https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/1 [14:03:19] o/ [14:48:54] inflatador: Hey, got a moment to talk switchover? [14:49:08] claime for you? Always! :) [14:49:12] :D [14:49:38] So what would be the services to exclude ? [14:50:50] claime looking at pybal ( https://config-master.wikimedia.org/pybal/codfw/ ) it would be all services with wcqs or wdqs in the name, LMK if you need more info on this [14:51:50] I'll probably make a phab task about it for tracking, but I'm trying to get a sense of what would/could happen with mw and all other services being served only from codfw, and these services staying put [14:52:53] I don't think any of these services are part of the mw stack , but maybe gehel or someone else could confirm? [14:53:51] the main concern is cascading failures...we had this happen a few months back when we depooled a DC and forgot to repool, let me see if I can find it [14:55:03] There should be no impact in letting WDQS run on eqiad during the switchover. That's what we did during the last switchover (after seeing that they were overloaded on codfw alone). We have more servers coming up so that we have enough capacity to handle the load on a single DC, but those servers are not there yet. [14:56:44] WDQS is getting updates from Wikidata on both DC, so having it stay up isn't an issue. The WDQS internal cluster is read by MW (and that cluster should have enough capacity for single DC operations). The wdqs public clusters are read directly (no interaction with MW) so there should be no issues there. [14:57:17] I'm not sure I'm super clear in the above reply. Please ask for clarification as needed, or let's jump in a meet if needed. [14:57:32] I can't find the incident report, but gehel pretty much covered it [14:57:36] (thanks!) [14:58:30] I think a meet is not a bad idea, I'll create a room real quick [15:08:37] inflatador: seems like you're busy should we cancel for today? [15:08:58] dcausse oops! Yeah, let's cancel [15:09:01] sorry about that ;( [15:09:03] np! [15:57:02] hmm, looks like puppet isn't happy with the new an-airflow server. we may have to change its profile/class temporarily so it can at least set up SSH [15:57:18] doh, i missed something :( [15:57:42] probably not your fault. Errors are related to kerberos/keytab [17:07:27] workout, back in ~40 [17:45:42] back [17:46:22] ebernhardson looks like puppet finished on airflow1005 if you wanna give it a spin [17:47:03] \o/ will check it out after mtgs [18:59:41] lunch, back in ~40 [19:01:40] hmm, bumpversion / gitlab ci works backwards from what i expected. I asked for a patch bump against 2.1.0 so it released 2.2.0 and marked the next version as 2.2.1 [19:06:50] ryankemper/inflatador: I've forwarded a SRE Onboarding Q&A session. You should have been onboarded by now, but in case you have more questions (or maybe answers?) feel free to join. Or skip. [19:07:06] ack [19:54:33] sorry, been back [20:03:20] meh, our consumer lag graphs for mjolnir are still not right :( Before my current thing started was reporting consumer lag for partitions varying up to ~80k for different partitions. As soon as i start producing to the topic all the lag's drop to 0 [20:06:43] inflatador: one thing i'm not sure of with airflow, antoine reported in the ticket they are in progress to deploy airflow 2.5 and switch from mariadb to postgresql. It's not clear what their timeline is and if we should be looking to create our database in postgresql instead of mariadb [20:06:51] re https://phabricator.wikimedia.org/T326193 [20:07:21] i'm guessing no and proceed with mariadb, they will have a migration process anyways? [20:12:21] ebernhardson ACK, I saw that comment and wasn't sure myself. I guess we can just use the same Mariadb instance we currently use for airflow1001? Apologies as I should've looked at that step beforehand [20:12:42] inflatador: the same instance, but i think it needs a new database [20:15:05] ebernhardson OK , will take a look. It seems odd that we'd have a separate database for 1005 if we're going to replace 1001, but I don't know anything about airflow [20:15:24] the thing is both will be running at the same time and i don't think they should be working from the same database (grouping of tables) [20:15:40] so same database server, but different named database inside the server [20:16:25] essentially the two instances shouldn't share state, and would probably break if they tried to [20:17:23] OK, I can do that [20:18:50] LMK if you have a preferred name for the new DB. current one is called `search_airflow` [20:19:05] i'm terrible at naming things....search_airflow_2 ? :P [20:19:54] i wonder how the db naming is configured in puppet, there might be an existing convention to follow [20:24:14] for the platform eng team theirs is called `airflow_platform_eng_v2`, if we followed that convention we would be `airflow_search_v2` [20:25:08] I'm reaching out to otto-mata about general workflow. At my old job it was a CR and review any time a human did updates/creates/etc on a DB [20:25:24] probably overkill for creating a DB that no one else uses, but I don't want to be that guy [20:25:32] :) [20:26:38] Don't worry, we found all sorts of ways to break the DB using APIs and indirection ;) [20:46:40] OK, so Ben and Andrew have advised us to wait and talk to Antoine and Steve to get a handle on the Postgres situation. I can comment on the above Phab ticket [22:01:15] kk