[09:07:12] 06serviceops: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10001587 (10elukey) Some recovery happened, but I still see this on kafka-main2001 (after the restart): ` [2024-07-22 08:06:44,113] ERROR [ReplicaFetcher replicaId=2001, leaderId=2005, fetcherId=3]... [09:11:07] 06serviceops: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10001600 (10elukey) ` -rw-r--r-- 1 kafka kafka 1073735478 Jul 18 03:28 00000000025151747941.log -rw-r--r-- 1 kafka kafka 1489260 Jul 18 03:28 00000000025151747941.timeindex -rw-r--r-- 1 kafka kaf... [09:47:44] 06serviceops, 03Discovery-Search (Current work): Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621 (10dcausse) 03NEW [09:47:48] claime: ^ [09:50:45] dcausse: thanks! do you know of a url query pattern we could try and match? [09:50:55] sure, lemme find this [09:53:21] claime: should have a query string with list=geosearch (or perhaps generator=geosearch) [09:53:54] 06serviceops, 03Discovery-Search (Current work): Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10001768 (10Clement_Goubert) This lines up with backpressure on `mw-api-ext` in `eqiad` starting at 0600 {F56593700} as well as full poolcounter queues {... [09:54:38] dcausse: https://w.wiki/Aj3E [09:55:03] claime: seems like it, thanks for digging into this [09:55:34] dcausse: I've found the offending UA [09:55:40] nice! [10:54:59] 06serviceops, 10CirrusSearch, 10GeoData, 03Discovery-Search (Current work), 13Patch-For-Review: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10001905 (10dcausse) [12:28:15] 06serviceops: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10002106 (10elukey) My proposal: * stop kafka on kafka-main2001 * mv /srv/kafka/data/eqiad.resource-purge-3 /srv/kafka/backup/ * cleanup zookeeper (not needed afaics) * start kafka on 2001 In the... [12:40:35] Reporting it just because I didn't find a task already for it, but I see that rdb1014 is reported down since 22h [12:50:48] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10002138 (10Milimetric) Sorry to ask this very basic question, but I found a bunch of others didn't know: h... [12:55:41] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10002166 (10Ottomata) IIUC, the PHP 8 issue with be the same with containerized MW. I also don't know exac... [13:05:56] volans: huh, thanks. I guess we don't have these hosts alerting when down in am yet [13:06:03] I'll have a look [13:06:29] that was in icinga indeed [13:06:40] thx [13:24:07] 06serviceops, 06DC-Ops, 10ops-eqiad: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633 (10Clement_Goubert) 03NEW p:05Triage→03High [13:25:37] 06serviceops, 06DC-Ops, 10ops-eqiad: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10002243 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9bfde8c2-0f71-4de8-908c-ff3a74fdbe71) set by cgoubert@cumin1002 for 7 da... [13:29:00] 06serviceops, 07Wikimedia-production-error: Misbehaving mw-api-ext pods serving 5xx - https://phabricator.wikimedia.org/T370425#10002255 (10jijiki) Interesting! >>! In T370425#9996695, @Scott_French wrote: > In both cases, workers start failing with SIGILL at the start of badness, e.g. (from `mw-api-ext.eqia... [13:36:27] 06serviceops, 06SRE, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10002302 (10Clement_Goubert) [13:41:11] 06serviceops, 06SRE, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10002337 (10Clement_Goubert) [13:52:55] 06serviceops, 10CirrusSearch, 10GeoData, 03Discovery-Search (Current work), 13Patch-For-Review: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10002411 (10Gehel) p:05Triage→03High [14:06:45] 06serviceops, 10CirrusSearch, 10GeoData, 03Discovery-Search (Current work), 13Patch-For-Review: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10002471 (10bking) This reminded me of T365814, which discusses testing the performance governor on... [14:12:01] 06serviceops, 07Wikimedia-production-error: Misbehaving mw-api-ext pods serving 5xx - https://phabricator.wikimedia.org/T370425#10002480 (10Joe) The `SIGILL` thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing... [14:55:24] 06serviceops, 10MW-on-K8s, 06SRE: Update wikitech documentation - https://phabricator.wikimedia.org/T370646 (10Clement_Goubert) 03NEW [15:00:59] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10002764 (10Joe) >>! In T352650#10002138, @Milimetric wrote: > Sorry to ask this very basic question, but I... [15:02:18] 06serviceops, 10MW-on-K8s, 06SRE: Update Parsoid wikitech documentation following mw-on-k8s migration - https://phabricator.wikimedia.org/T370646#10002774 (10Aklapper) [15:10:52] 06serviceops, 10CirrusSearch, 10GeoData, 03Discovery-Search (Current work), 13Patch-For-Review: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10002860 (10Gehel) Increase in traffic has been identified as coming from a single bot, which has be... [15:35:33] 06serviceops, 10CirrusSearch, 10GeoData, 03Discovery-Search (Current work), 13Patch-For-Review: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621#10003022 (10Gehel) [15:42:21] hello folks [15:42:27] about the kafka-main2001 issue (https://phabricator.wikimedia.org/T370574) [15:42:35] 06serviceops: kafka-main2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10003118 (10elukey) [15:42:42] I see two options [15:43:03] 1) wait until thursday so the segments are compacted/cleaned up [15:43:50] for the time being we'd keep codfw in a sort-of inconsistent state, to be recovered once the corrupted segments/metadata are gone [15:44:04] 2) manually clean up (stop kafka, cleanup, start and see how it goes) [15:44:26] quicker but more potential side effects [15:44:36] 06serviceops: kafka-main2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10003148 (10Ottomata) > cleanup zookeeper (not needed afaics) If you like, you could verify your proposal on the kafka-test cluster with before you do kafka-main [15:46:08] This is a good suggestion ottomata --^ [15:46:12] lemme break kafka-main [15:46:15] err kafka-test [15:54:30] it worked :) [15:55:16] 06serviceops: kafka-main2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10003226 (10elukey) >>! In T370574#10003148, @Ottomata wrote: >> cleanup zookeeper (not needed afaics) > > If you like, you could verify your proposal on the kafka-test cluster with before you... [15:57:26] all right I think we can do it [15:57:36] elukey: iirc if we lose resource-purge it's no big deal [15:57:52] what would the potential side-effects of the manual cleanup be? [15:58:33] claime: in theory none from what I tested on kafka-test1006, in practice kafka may complain about the missing data etc.. [15:58:46] but it seems it was more when zookeeper was heavily used [15:58:54] right now our version seems to handle the failure nicely [15:59:04] namely it restart fetching the whole data [15:59:50] ^ is what I would expect. we have done full broker migrations in basically the same way: turn off old broker, spin up broker on new host with same broker_id. kafka then auto-syncs everything from leaders [16:00:04] This is essentially that but for one toppar [16:01:59] {{done}} [16:02:17] elukey: <3 [16:03:31] so far no horrors [16:03:37] keeping an eye on metrics [16:04:53] <3 [16:06:50] nemo-yiannis: just saw your ping re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1054512 (and it's revert). [16:07:09] should we retry that one? [16:08:36] akosiaris: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1054895 we went through with it the next day [16:09:00] ah, cool, thanks [16:11:20] kafka-main codfw back to normal [16:11:24] thanks ottomata! [16:13:40] 06serviceops: kafka-main2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574#10003382 (10elukey) 05Open→03Resolved a:03elukey All recovered! [16:34:18] Hey folks! I'm working on improving Citoid's behaviour when we're unable to access the URI that the user provides us (T370432) – one thing I'd really like to do is to run modified instances and web browsers originating from the same IPs as our Citoid/Zotero production instances. I was thinking a relatively straightforward way to do so would be to use `ssh -D` to run a SOCKS5 proxy over that way – is that a reasonable approach, and if so, [16:34:18] who should I talk to next? [16:35:38] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T370672 (10Clement_Goubert) 03NEW [16:36:02] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T370672#10003561 (10Clement_Goubert) p:05Triage→03Low [16:42:50] zip: where do you envision running ssh -D on and what would your target machine be? With Citoid/Zotero running on k8s, there is no SSH server to connect to. [16:44:42] It did not occur to me that that would be the case [16:45:08] well, I suppose if you're running k8s you probably don't want someone with shell access to the host machine [16:46:18] I don't know what IPs it's running off at all yet. I'm still relatively new here and so I'm kind of bouncing around wikitech and not really finding where I'm meant to be looking [16:46:47] yes, access to the host machines for k8s is reserved to SREs, following principle of least privilege. And even when one does have access, they need elevated privleges (root actually) to be able to enter the network namespace of the pod running citoid/zotero and thus have the same IP address [16:47:49] ah. is it possible for pods to have network namespaces separate from the host machines they're on? this is some deep magic I've never needed to touch [16:48:40] the ip addresses aren't a secret or anything btw, they are documented in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/network/data/data.yaml#125 [16:49:36] ooh, neat [16:50:06] zip: it's the default for docker/k8s and generally containers to have their own networking namespace. it can be overriden but this is reserved for very specific workloads [16:50:35] I guess at my last place we let GKE handle that and I never had to think about it [16:50:37] those specific workloads are "building" the platform btw, not "using" it. [16:51:22] although I suppose actually as far as we had IPs at all we had to connect them to load balancers and thence to service definitions [16:51:42] which doesn't really touch at all on what IP would be seen by an external service if you connect outwards [16:51:55] yes [16:53:10] I am still not clear on what exactly it is you want to do tbh. [16:53:46] I understand debugging Citoid/Zotero, but what do you want to proxy for? [16:54:08] basically firewall debugging [16:54:28] which is to say that we suspect that there's certain sites blocking us by IP [16:55:13] so I wanted to do a comparison between running queries from my local network and running from an WMF range [16:56:35] in any case it sounds like live tinkering is probably out. [16:57:11] As for Zotero I will go and see if it's outrageously painful to patch it to actually log properly (apparently we don't because it just logs junk like lots of exception backtraces in plaintext) [16:57:25] oh, that would be nice, thanks [16:57:41] yeahhh seeing stuff that's not observable makes me itchy [16:58:47] fwiw, my understanding is that the field has changed a lot in the last few years. It's no longer just IPs that matter, but IP+User-Agent at the very least, with many sites adding the ability to run javascript (e.g. cloudflare CDN, capthca, recaptcha etc) [16:59:01] as for follow-up questions. I suppose temporarily running a pod that's just an opensshd instance or a socks proxy is out of the question? [17:00:16] you suppose correctly, but to rule out the IP angle, you can use url-downloader https://wikitech.wikimedia.org/wiki/Url-downloader and use it as a standard forward HTTP proxy. That's the same thing that Zotero/Citoid use in production [17:01:04] * bd808 was about to ask if url-downloader or webproxy would be the gateway there [17:01:08] I wouldn't be surprised if you find that you need to mimic Citoid/Zotero in a lot of other angles too [17:01:22] like HTTP headers etc. [17:02:07] that sounds perfect [17:02:18] if I can keep my feedback loop nice and tight and experiment as I go, the first part of this will go a lot quicker [17:05:30] this sounds like mroe than enough to get me unstuck, thank you very much for the help! [17:30:49] akosiaris: want me to fix this and arm it? <+jinxer-wm> FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 [17:31:04] mutante: niah, I 'll finish it this week [17:31:08] ok [17:31:09] it expired I see [17:31:26] * akosiaris resubmitting [17:31:37] at one point we changed those passphrases to a single one [17:31:45] then _maybe_ new keys were added [17:32:05] but most of the 19 keys should all be loaded with the same phrase and it's in pwstore [20:05:31] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" with Cloudflare - https://phabricator.wikimedia.org/T370118#10004682 (10VPuffetMichel) [20:06:24] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" with Cloudflare - https://phabricator.wikimedia.org/T370118#10004687 (10VPuffetMichel) Hi there, Is there someone in #serviceops #sre who can help us with this? [20:20:39] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" with Cloudflare - https://phabricator.wikimedia.org/T370118#10004751 (10RLazarus) Someone in #serviceops probably knows the answer to this but I don't, at least not confidently. He...