[07:26:37] ryankemper: I'm scheduling our next ITC for June 21. Please fill your answers before that. (I'll probably be out on June 20, recovering from offsite) [09:03:41] Another run to the hospital with Oscar + lunch. Back this afternoon [10:14:20] lunch [13:09:04] greeetings [13:37:01] ah darn, forgot to ask ebernhardson at wed meeting about the clickthrough queries we talked about on slack: https://wikimedia.slack.com/archives/G01TF2HGG8J/p1654616892436129 [13:47:45] dropping off kids, back in ~15 [14:01:18] back [14:10:50] heads-up, will start reimaging CODFW to bullseye shortly [14:15:27] rebooting for updates, back in a few [14:32:15] \o [14:34:34] o/ [15:01:09] well, that took longer than expected [15:01:09] OK, I'm starting codfw bullseye reimage now [15:23:01] BTW, it seems the reimage operations aren't actually running in parallel [15:23:37] it's depooling everything at the same time, but only reimaging one host at a time [15:25:45] hmm [15:26:49] inflatador: makes sense, the code says: `for node in nodeset: reimage [15:27:01] so it walks through the nodes one at a time running the reimage cookbook [15:27:13] we could parallelize it in a few ways, but might want to check with volans [15:29:28] * volans busy, can't read backlog now [15:30:31] ACK, I think we'll let it go this way for now. I'm guessing there could be dragons involved with maintaining the state of multiple reimages at once [15:46:42] oh thats evil. Restarted machine expecting chrome to reopen all my tabs, instead it gives a prompt that it needs an update, then loses all my tabs :S [15:49:14] (╯°□°）╯︵ ┻━┻ [15:50:10] ebernhardson: even in history -> recently closed? [15:50:21] it happened to me long time ago, but never since [15:50:49] volans: oh cool, i didn't know you could restore whole windows from there [15:51:03] much happier now :) [15:51:53] :) [16:01:22] whoo hoo, first reimage to bullseye in prod! [16:01:50] \o/ [16:12:39] elastic2036 (HP) passed, elastic2053 (Dell) failed [16:15:16] elastic2054.codfw.wmnet (Dell) failed its reimage too [16:16:15] probably the same outdated firmware issues we saw in Cloudelastic [16:17:46] (would be nice if those updates were part of the reimage process, but I have no idea if that's feasible) [16:19:55] :( [16:20:27] on the positive side, the new extension I installed does seem to be handling my calendar [16:22:51] https://phabricator.wikimedia.org/T309343#7971329 for reference [16:31:07] * ebernhardson wonders what WikimediaApiPortalOAuth does [16:32:09] hoping it resolves the issue with letting api clients get oauth tokens for wcqs, but i'm going to guess it's intentionally limited in a way that doesn't work... [16:36:18] inflatador: hmm yeah will we need to wait for dc-ops to upgrade the firmware [16:36:21] or is there some way we can do it on our end [16:40:04] not sure myself, places I've worked for previously rolled that stuff in with the reimaging. I'm going to ask in #wikimedia-dcops [16:51:20] reading the dcops chat, looks like someone typo'd wdqs https://phabricator.wikimedia.org/T307138 and it's causing some issues (just netbox, nothing impacting end users). Still, that's an easy typo to make [16:53:57] hey folks, just checking in about https://phabricator.wikimedia.org/T310216 [16:54:32] it caused a lot of log spam yesterday but as i understand it now it should only apply to commons and not generate _more_ log spam when all wikis are promoted. is that accurate? [16:54:40] (commonswiki being in group1) [16:54:52] dduvall: sae as yesterday, one out of 30 shards is missing in one of the clusters, so one out of 30 writes fails. And then a background process that asserts the index is correct keeps noticing those 1 in 30 are missing and tries to fix them (causing more write fails) [16:55:09] we can turn it off if it's bothering people, but if we leave it on the other 29 shards keep getting writes until we fix the underlying issue [16:55:28] i see. thank you [16:55:42] i think the log spam is tolerable if it's only going to be at the rate i saw yesterday [16:56:04] what i was mainly concerned about was a drastic increase in that log spam upon all wiki promotion [16:56:07] dduvall: it wont be strictly limited to commonswiki, file uploads anywhere can trigger a commonswiki write. It will always come from job runners though [16:56:20] but if that shouldn't happen, than i'm ok to ignore the log spam until you all fix the underlying issue [16:56:47] it's in progress, i know they were working on testing the snapshot/restore functionality yesterday [16:56:56] ah i see. ok. i'm slightly concerned about an increase then but i can feel it out as we progress [16:56:56] (will snapshot froma different cluster and restore to bad one) [16:57:19] the log messages came through on jsonTruncated as well, because they are too large [16:57:52] yea i saw that, maybe we should truncate them first? The general idea is we report the full error from elastic, but the full error from elastic includes the update script, which includes the full page content multiple times [16:57:54] so if there's going to be a huge influx of messages i guess it's possible it will be a lot for logstash to handle [16:58:05] right [16:58:12] that would be fantastic if you could truncate them [16:58:20] i'll check, shouldn't be too hard [16:58:26] right on. thank you [17:00:12] dduvall: any idea what the cutoff is, 4k, 8? [17:02:00] looks like 32k [17:02:09] just from a `wc -c` [17:02:29] ok, maybe like 20k for this one message and make sure theres room. Probably we don't even need that much, but depends on why it failed [17:04:37] it's a lot of data coming through in the aggregate [17:05:09] just during the window between deployment and rollback there were ~ 27k log messages that came through [17:05:31] 27k * 32k is ~ 900M [17:06:35] oh wait, sorry. that's a 24 hour period [17:07:36] hmm, can probably be 4k. I suppose i'm suspicious that sometime we will have a bad transformation to create the script and it would be usefull to see what that was, but i can't remember that ever happening [17:10:22] works for me. you can always increase it again later if you think it'll be helpful, but i very much appreciate the mitigation of logspam [17:10:28] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/804396 probably does as necessary [17:10:47] the most important aspect for us is that the messages are parseable and come through with the right channel and metadata, so we can set filters [17:31:02] bah, i don't write enough php to remember our spacing guidelines :P sec [17:33:56] i had that experience many times over during my PET expedition :) [17:41:48] OK, I updated NIC firmware on elastic2053.codfw.wmnet and started reimage, we'll see how it goes [17:42:12] wow, cindy the browser bot is still a thing? memories [17:42:28] CI still doesn't want to run all our dependencies :P [17:42:39] haha, fair enough [17:43:04] i just need language wikis, commonswiki, wikibase, sister wikis, elasticsearch, a real job queue, a probably a few more things :P [17:43:06] the promise of GitLab will Soon™️** be realized [17:43:21] oh that's it? [17:43:29] :P [17:43:38] yea [17:43:40] someday :P [17:44:08] it's terribly fragile though...as i find time i've been trying to move it from vagrant to mwcli [17:44:51] i'd really like to get a k8s cluster made available to gitlab ci [17:45:13] cindy is also re-running now, all passing so far. It just doesn't reset properly sometimes depending on what the last patch was [17:45:14] not a promise, just a dream [17:45:37] ah i see. yeah it commented on patchset 1 [17:56:15] inflatador: heh yeah I've seen `wqds` as a typo so many times, very easy to do [18:03:10] * ebernhardson chuckles at the 'Aer Lingus: Cheap Flights' headline after looking at what the nonstop SFO<->DUB roundtrip costs [18:03:54] might be the most expensive overseas flight i've taken by 30-40% [18:36:26] inflatador: pairing https://meet.google.com/eki-rafx-cx [18:56:41] (ah saw you went to lunch. g.ehel and I are getting the updated wmf-elasticsearch-search-plugin in place on relforge/cloudelastic/codfw [19:05:48] * ebernhardson isn't really sure why cloudelastic keeps complaining, it's memory situation does look bad but the only thing that changed is the missing shard [19:06:19] i imagine it'll stop complaining when we restart for s3 plugin [19:37:27] ebernhardson: do you know which groups is needed to access search logs in analytics? Context: T310021 (Bruno is the PhD student working with Ricardo on Search metrics) [19:37:28] T310021: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 [19:37:45] ebernhardson: is analytics-privatedata-users enough? [19:38:52] gehel: yea thats everything [19:39:29] * ebernhardson somehow didn't get the ping for that meeting, oh well you don't normally actually need me :P [20:46:23] huh, /me is not a fan of the javascript tests that reject trailing commas. Everything should have trailing commas :P [21:00:15] ebernhardson but then I wouldn't get to do gross stuff with templates like this ;P https://github.com/inflatador/kafka-kraft/blob/main/templates/server.properties.j2#L32 [21:00:51] lol [21:03:02] think of all the wasted effort that would be wasted! [21:20:06] but wasting effort is so much fun, i previously ported a new json parser into hhvm (our old php executor) because the license on the builtin json parser said the parser could only be used for good, not evil, and we couldn't guarantee that :P [21:20:39] (but funnily facebook was fine with it, lol) [21:20:51] LOL, I guess we need an 'evil bit' for PHP