[08:04:19] fyi there has been some sabotage overnight on long distance optical fibers in France, and our drmrs<->eqiad link is down since about that time [08:05:01] XioNoX: ah, no good [08:05:19] should we do something about to mitigate eventual issues? [08:05:34] backup link took over as expected, so nothing to do [08:06:01] just be vigilant if we see signs of issues over there [08:12:25] tnx [08:17:14] The latency from home to bast6003 is ~175ms so my packets are taking the scenic route [08:17:49] and having ~5% packet loss [08:20:07] so it might be useful to edit the geo-maps to redirect FR users to ams instead, dunno what people thing [08:21:19] basically most paths between the north and south of france are cut [08:22:34] so we should send north of france to esams and south to drmrs ideally? [08:22:46] I know we can't right now :) [08:22:51] yeah exactly [08:23:04] there are more people above the cuts than below :) [08:23:35] and we can depool otherwise we will send north africa through the cuts [08:23:37] Naive q: can we route with that geographic granularity? [08:23:47] what about some simple [08:23:55] https://www.irccloud.com/pastebin/9epUPrRq/ [08:24:01] too naive? [08:24:11] fabfur: yeah that's what I had in mind [08:24:33] we can see an increase of NEL reports for france [08:24:40] but under the alerting threshold [08:26:52] A question about alerting: if I add role memebership etc etc for a host (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057205) without the hw being in-place and installed, will that lead to alerts? If so, is there a way to put in a long-term (as in days, maybe two weeks) silence for such a host? [08:27:41] brouberol: depends on the maxmind geoIP DB. For the US we can split by states for example, same with Brazil or Russia, but dunno if they have this granularity for France [08:27:56] klausman: site.pp is safe, if it doens't match any host nothing happen, the rest depends on how they are used by who [08:28:00] XioNoX: understood, thanks [08:28:27] volans: ack, thanks [08:29:58] fabfur: can you send a CR? [08:30:00] as for downtime, icinga doesn't allow to downtime something it doesn't know exists. Alertmanager silences are regexes so unrelated to the host and could be put. But again, depending on how the data is used it might alert something else unrelated to the host (like pybal because host in etcd but not reachable, etc...) [08:30:21] XioNoX: sure, I was preparing it, is there some phab ticket already opened for this? [08:30:32] fabfur: nop [08:31:32] ack, I'll create one and maybe you could add some details if you want [08:32:00] should be a private one? [08:33:18] nothing secret [08:33:32] fabfur: I can take care of it, one sec [08:33:49] https://phabricator.wikimedia.org/T371216 [08:34:00] really quick and dirty, add/edit as you wish [08:34:08] I'll prepare the very simple CR [08:35:49] https://gerrit.wikimedia.org/r/c/operations/dns/+/1057812 [08:37:06] fabfur: I updated the task description [08:37:12] tnx [08:38:42] fabfur: +1 [08:39:15] anyone has some objection to merge this and run authdns-update now? [08:41:47] gone in 3 ... 2 ... 1 [08:42:06] merged [08:42:24] authdns-update run [08:44:01] done [08:57:26] fwiw, if we deem than useful/necessary, there are 109 subdivisions in France in the maxmind db (might include larger/smaller division) and they could be used to map the traffic if deemed useful/necessary [08:57:48] related gdnsd docs (assuming up to date ;) ) is https://github.com/gdnsd/gdnsd/wiki/GdnsdPluginGeoip#geoip2-location-data-hierarchy [09:14:12] fabfur: thx, NEL shows a clear improvement [09:14:26] thx to you for noticing this! [09:16:02] volans: thx, I don't think it's worth the hassle, but it's good to know it's a possibility [09:22:42] ack [12:08:33] so re recommendation-api: "fault filter abort" is actually from envoy [12:08:47] not clear if the service gives a more useful 503 [12:08:51] hnowlan: you might know, do we think it's having user impact? I suspect not, correct? [12:09:03] I think it is but it's a handful of RPS [12:09:16] honestly I have no idea how much this service is used [12:09:27] aiui the liftwing apis for this are the Proper way [12:10:23] service logs are empty save for a startup finished message [12:10:30] a service that has an error rate of 20% on a good day is not something to worry *too* much about when it gets worse though [12:10:37] so we should silence the alerts [12:10:53] +1 [12:11:10] outside of the codebase there's just so little to know about the service when it's so terse with outputs and there are no docs [12:11:46] yeah [12:11:48] silencing it [12:12:09] ty! [12:12:40] there are two app-level errors [12:12:42] TypeError: Cannot read properties of undefined (reading 'pages') [12:13:20] but that is not really keeping with the frequency of error served [12:13:45] silenced for 1d [12:14:07] if we are going to be roping in SWEs, I assume an incident doc would be handy? [12:15:00] yeah couldn't hurt [12:15:12] looped research in [12:15:14] ok, on it [12:16:22] hnowlan: thanks! on slack I assume? mind adding me to the thread? [12:17:48] yep sure [12:18:02] ty <3 [12:38:59] kinda at a loss. If internal API or m2-master (which it directly connects to!!) were down we'd see much scarier things [12:44:46] the old recommendation api (yes there is a newer one on liftwing, the codebase is being revamped but it is currently ~7y old as well) should be used by the Android apps via Restbase [12:45:01] Seems only the android app uses recommendation-api damnit too slow :D [12:45:25] https://en.wikipedia.org/api/rest_v1/#/Recommendation [12:45:37] so not ideal but not too noticeable [12:45:55] but I am pretty sure those are all abandoned or not really used features, as you were saying the failure rate is already high [12:46:27] is the service down now? [12:48:26] flapping but mostly yeah [12:48:44] lovely [12:48:53] we're at about 70% errors :[ https://grafana.wikimedia.org/goto/SHf_gm9IR?orgId=1 [12:48:54] also the service is abandoned/unowned since ages [12:49:12] I think all signs point to setting page: false [12:49:22] I agree yes [12:49:33] we can add a comment in service.yaml explaining why [12:50:41] oh, page isn't set to true. These are probedown errors [12:53:24] 25-30% standard failure rate is absurd. [12:53:37] it's all hidden behind RESTBase caching at the best I think [12:54:08] also, it's weird cause I see just the android app using it, but ... apparently it's used against commons and wikidata? [12:54:37] which... is a bit weird, but maybe I am missing something [12:57:05] if its abandoned and unowned, is there a task to rip it out of the android app? [12:57:25] no [12:57:33] quite the contrary in fact [12:57:37] there's a task to get it owned [12:58:15] brouberol: XioNoX: MaxMind actually does provide subdivision data in many countries, but their SLA for its accuracy is much much lower outside the USA, so we don't use it [12:58:57] it's... recovering? [12:59:33] some day when we maybe have per-ipblock mapping in probenet, we could potentially turn up the sample rate here, and push a new and correct map for FR quickly [13:01:40] some someone tell me why with 30% failure rate at https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=recommendation-api&from=now-7d&to=now&viewPanel=16 the probe never fired before? [13:03:49] the probe is for robots.txt, maybe there's a threshold of wreckedness required for *that* to fail rather than the API requests [13:06:36] cdanis: ack thanks [13:07:16] curling robots.txt works atm for example, but the pods themselves are still serving 503s every now and then *shrug* [13:08:49] ~30%? [13:09:03] it's been like that for 6+ months from what I see? [13:09:37] I couldn't get further back than 6 months either [13:14:03] yeah makes no sense [13:15:16] I am this close to excluding it from the probedown alert tbh [13:15:28] I'd cosign that [13:15:31] it's been at 30% for 6 months and noone has cared [13:18:30] I am clearly not a neutral party in this rn, but +1 :D [13:19:02] kamila_: is help needed with the NEL pages btw? I'm just catching up [13:19:34] cdanis: I have no idea what to do with them, so if you're ahead of me, then yes :D [13:19:52] they're just RU and flappy and I don't know of anything on our side that would be causing them, was about to ping traffic [13:19:54] the answer in this case is probably also "silence it" but I'll look a bit [13:20:00] mhm [13:20:57] Yeah, I was looking too and I think silencing makes sense. It's not from any one ISP if I'm reading it correctly, all going to text-lb.esams. [13:21:37] eoghan: yeah, and not just that, but the distribution of ISPs in the spikes at a glance matches the distribution in the background nouse [13:21:45] s/nouse/noise/ [13:23:09] FWIW https://www.thousandeyes.com/outages/ also showing some issues in AMS area [13:24:02] Can we silence by country? I know we can silence the NELByCountryHigh, but dunno if we want to silence the main one for too long. [13:38:41] eoghan: I think it should be safe to do that yes [13:39:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057874 for recommendation-api probes [13:42:45] akosiaris: +1'd w [13:42:50] *with inline comment [13:42:52] thanks [13:44:16] cdanis: can we silence by a country? [13:44:38] sukhe: I don't know that we've ever done it, but that was certainly the intent behind the per-country metric+alert [13:44:52] I am failing at karma again (the alertmanager one) [13:44:54] as eoghan said silencing the main alert is also required [13:44:57] yeah I'm taking a look [13:45:42] I don't think we've ever fine-tuned the per-country thresholds, so they might need to be double checked if we silence the main one and relate to the per-country ones [13:47:10] there is an approximately 0% chance that it's something dumb like "ru.w.o uses recommendation-api", right? :D [13:47:56] kamila_: yes :) [13:48:14] good, just checking :D [13:49:03] I think I've silenced the main alert, but I can't be sure because the alert had already fallen out of karma's short-term memory [13:49:20] I do think you can get alertmanager labels for a page from the annotations in the victorops web UI, maybe [13:49:27] I'll look at that after/during my meeting [13:49:32] thanks <3 [13:49:33] thanks cdanis <3 [13:50:36] is anyone familiar with one of these BGP looking glass thingies? I'd be curious to see if we can see BGP flapping or something [13:51:07] I recommend https://ioda.inetintel.cc.gatech.edu/country/RU [13:51:13] (not just BGP) [13:53:22] thanks sukhe, appreciated! [13:54:36] fabfur: puppet-merging yours as well? Added mszabo to ldap_only_users (5ae9024b44) [13:54:55] ah sorry. completely forgot to finalize merge [13:55:05] merge if you can, otherwise I'll do [13:55:10] {{done}} [13:55:16] tnx! [14:31:09] I was looking at that patch actually and had a PBKAC: I thought the "run puppet compiler" button was a link to some results and not a button, and was shocked when it gave a run notification [14:32:06] mszabo: yeah, it runs PCC https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler [14:57:22] _joe_: sukhe https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052791 Rewrite /beacon/event -> EventLogging rest handler is ready for review. On its own it won't do anything until we remove the varnish handling of /beacon/event [14:58:05] ottomata: I won't pretend to review this bit but happy to do the varnish part and rollout :) [14:58:47] <_joe_> ottomata: oh only for mediawiki.org? [14:59:52] _joe_: yes. [15:00:53] we are ONLY doing this to support MediaWikiPingback events. These are sent by mediawiki core code hardcoded to mediawiki.org/beacon/event. And we need to support old 3rd party installed MW versions for like...5 years [16:26:27] _joe_: thanks for +1. Should I feel confident merging that myself? I'm happy if you are pretty sure it isn't going to break thigns [16:26:42] <_joe_> ottomata: so the way it should work is [16:26:57] <_joe_> you check for when the mw infra deployment windows are [16:27:22] <_joe_> you merge the change, run puppet on the deployment host, then you do a scap deployment [16:30:42] The scap deployment in question: https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#The_scap_way without --k8s-only so it also deploys to jobrunners, but it doesn't rebuild the image for no reason [16:31:19] Hmm, actually you can do it with --k8s-only, it'll be deployed to jobrunners by puppet [16:39:34] <_joe_> it's irrelevant to jobrunners, too [16:39:45] yep [17:20:40] Hm, okay! _joe_ can I just schedule this then and be around for a mw infra window and someone else will do it? i've rarely done full MW deploys: i usually just do individual files. [17:21:09] <_joe_> ottomata: sync-file does a full sync now :) [17:31:11] on-callers: we just upgraded cp4052 in prod to ATS 9.2.5 up from ATS 9.2.1. no issues expected as such but please note in case we get paged (or otherwise) [17:31:54] actually, from 9.1.4 so even worse in a way :) [17:32:10] (9.2.1 was the older build we had on a canary host but never rolled it out) [17:32:29] _joe_: okay so what I'm hearing is I should schedule it in a mw infra window and do it myself, ya? [17:33:56] <_joe_> ottomata: yes, you don't need to modify the calendar, just give a heads up in #serviceops I guess [17:34:08] okay, i'll add for tomorrow anyway just in case [17:47:58] greetings! we’re about 8 weeks away from the September 2024 DC switchover [0] (eqiad to codfw): services and traffic will be depooled in eqiad on the 24th, MediaWiki will switch on the 25th, and eqiad will be repooled for (active/active) services and traffic on 2 October. all actions will target 15:00 UTC. [17:47:58] if you have tasks related to supporting the switchover, please file them under [1]. thanks! [17:47:58] [0] https://wikitech.wikimedia.org/wiki/Switch_Datacenter [17:47:58] [1] https://phabricator.wikimedia.org/T370962 [17:48:20] it's that time of the year again!