[10:04:30] lunch [13:12:38] greetings! [13:15:04] o/ [13:41:49] looks like we had another BG issue this weekend? [13:45:35] inflatador: I saw a couple alerts from icinga, did we page this time? [13:48:01] Yeah, looks like the general SRE team got paged, but I didn't. Can you forward me the icinga alert emails if you got some? I don't see any in my inbox [13:49:39] SRE chatter is from ~1430 UTC yesterday, so guessing alerts will be slightly earlier than that [13:49:47] inflatador: I only get those from my irc bouncer not via e-mail [13:50:18] dcausse ah OK, is it just in #wikimedia-operations then or somewhere else? [13:50:33] yes only this channel generally [13:50:47] now I see it, somehow I missed the ping from Luca :/ [13:51:35] oh and something I forgot is that we are certainly still routing traffic to codfw only [13:51:41] we might want to re-enable eqiad [13:52:25] oops, that's not good, will take a look at that ASAP [13:52:57] unless someone wants to do the last steps of T302494 soon (which requires depooling the DC during the operation) [13:52:58] T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494 [13:55:23] * inflatador checks [14:00:09] dcausse you want to finish off these steps now? I don't know how to do any of the steps under "Migrate flink HA storage from swift to s3" but happy to help if you think it's doable [14:01:49] Migrate flink HA storage from swift to s3 ? ...curious! [14:01:59] OHH sorry [14:02:05] reading one momre above...s3 proto. got it! [14:02:27] ottomata: yes moving away from the swift client which is not great :/ [14:02:39] yeah, just using the S3 plugin, apparently the flink swift plugin is abandonware? [14:02:49] yes [14:02:51] makes sense [14:03:15] inflatador: this task is nice to learn how to operate flink on the k8s cluster [14:03:28] but does not have to be done right now [14:05:30] if you guys are busy with elastic 6.8 this can certainly wait [14:05:59] dcausse completely up to you. If you think it's doable without another incident, I'm at https://meet.google.com/skb-ihdv-bje . I'd just as soon get us back to full capacity for wcqs [14:52:54] I repooled wdqs services in eqiad, that should hopefully help things out a bit [14:53:04] inflatador: thanks! [15:42:54] ebernhardson: if you have a minute: https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/773810 (should be fairly trivial) [15:43:12] looking [15:43:47] quick break, back in ~15-20 [15:43:50] dcausse: seems fine, but what changed which cause this to be needed? [15:43:57] * ebernhardson also has no clue what effect it has :P [15:44:13] ebernhardson: it was wrong from the beginning :) [15:44:24] dcausse: ahh, well ok then :) [15:44:38] will scap this now [16:12:27] back [16:15:29] dinner [17:30:01] lunch/errands, back in ~1h [17:53:43] hmm, i wasn't previously intending to use the es68 branch but since they dramatically improved the BC handling by aliasing _doc thinking will move patches over to that side and merge back to master when prod is ready [18:45:48] back [18:46:08] however, I forgot I have a telehealth apt in ~15, so I won't be back for long [19:59:43] back [21:48:13] Some thoughts after the last wdqs incident: [21:49:29] - It'd be really nice to have an "agent" that could auto-restart blazegraph if a wdqs host fails to report metrics for >10 minutes. We frequently have hosts that drop offline for long periods of time and it would definitely help that, but it may have even caught this incident before the page [21:50:36] - Eqiad was depooled at the time of the incident. This is yet another reminder that WDQS can't be trusted to run very well in a single datacenter with our current capacity. We should look at scaling up a lot come this year's annual planning, probably double capacity [21:54:16] - Deadlock tends to correlate with higher load. For example here's where `wdqs2002` drops offline (this was a day before the incident): https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1648256074143&to=1648259750421&viewPanel=12. Note that it falling into deadlock coincides with net load on WDQS doubling [21:56:35] (For reference, wdqs2002 dropped offline at `01:05`, which is right when the load on wdqs2002 starts shedding)