[06:46:56] (EdgeTrafficDrop) firing: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:56:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:57:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:02:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:38:57] 10Domains, 10Traffic, 10DNS, 10SRE, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) [08:41:17] 10Traffic, 10SRE: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10Ladsgroup) p:05Triage→03Medium Feel free to change priority. [08:47:28] 10Domains, 10Traffic, 10DNS, 10SRE, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) p:05Triage→03High Given the time-pressure. [09:55:47] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster [10:02:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [10:03:08] ^^ expected (host being reimaged) [10:11:04] hello! I'm looking for reviews on https://gerrit.wikimedia.org/r/c/operations/puppet/+/758063 [10:12:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [10:13:58] 10Traffic, 10SRE: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) 05Open→03In progress [10:44:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [10:49:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [10:57:15] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster completed: - cp5011 (**WARN*... [10:59:19] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [11:53:03] vgutierrez ema: for when you have time, can you review patches in https://phabricator.wikimedia.org/T300398? [11:58:18] 10Traffic, 10ops-ulsfo: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) [11:59:09] Amir1: looking [11:59:17] thanks [11:59:42] that reminded me of https://en.wikipedia.org/wiki/Unseen_University [12:00:22] Amir1: https://wikimediafoundation.org/participate/unseen/ currently returns a 404, is that expected? [12:00:43] yup [12:00:46] ack [12:02:43] +1ed both of them [12:19:32] bblack, sukhe: you're probably more suited than me for reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/758063 /cc taavi [12:41:02] vgutierrez: oh yeah indeed, I will get to it today [13:08:05] 10Domains, 10Traffic, 10DNS, 10SRE, and 4 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It'll take a bit but it will be there. Ping me if it doesn't work. [13:11:01] thanks [13:37:38] 10Traffic, 10SRE: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup) [13:38:15] 10Traffic, 10SRE, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup) [13:39:33] 10Traffic, 10SRE, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) [14:11:09] I took a peek at it and added some feedback! [14:19:49] and I replied to your comment! [14:25:26] can we just switch the mode to 444 instead? I'm not sure why it's 440 now (maybe one of the uses of pdns-(rec|auth) has secrets in config?) [14:25:46] letting the daemon own its own config is kind of a dark pattern for sec [14:25:50] taavi: ^ [14:27:49] bblack: what about making root own it but changing the group (and not giving group write access)? it has a mysql password and a management api key which is why I'm not a huge fan of 0444 [14:27:58] 10Traffic, 10SRE: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) After discussing with @BBlack and @Vgutierrez it seems that this isn't a good use case for ncredir as ncredir only supports dns-01 challenges. So we need to find some other e... [14:28:33] taavi: well, in your use-case it does, in our use-case it doesn't [14:28:49] I'd make the argument that ideally, such things should not be in config files, but it's also possible there's no other viable solution in your usecase [14:29:36] (we could also just parameterize some aspect of this for the variant cases, too) [14:30:23] oh I see now, I was still thinking that part of the patch applied to recdns as well, which it clearly does not. [14:30:38] yeah, the auth server part is wmcs specific but the recdns part is shared [14:30:45] the point about how things ideally-should-be still applies, but yeah at least it doesn't affect our use of pdns-rec for the other use-cases [14:31:07] hmmm [14:31:22] either way you solve this, there's still kind of an ugly sec problem at the bottom [14:32:21] if the daemon has to read an api key and a mysql password as the runtime user it eventually runs as, then even with some 440+group-based solution, any remote unprivileged execution compromise of the daemon also leaks those passwords/keys to the attacker potentially. [14:33:46] (this is a common problem in many setups, so I guess it's not the end of the world, just not ideal) [14:33:50] (and something to be aware of) [14:35:01] I wonder what changed in either the new upstream or bullseye packaging that it no longer reads it as root before privdrop (or whatever it was doing before to get around this) [14:36:48] in bullseye, the systemd service unit file sets User=pdns/Group=pdns while buster does not [14:38:13] (if you want to compare them yourself, cloudservices2003-dev.wikimedia.org is running bullseye and cloudservices2002-dev.wikimedia.org is running buster) [14:59:30] volans: we're trying to launch a durum instance in drmrs and the cookbook seems to have been stuck after successfully adding the dns entry [15:00:31] not sure if you're available to check it out [15:02:04] mmandere: hey, sorry in a meeting right now, which cookbook are you running? from which cumin host? [15:03:27] volans: `ganeti.makevm` in `cumin1001` [15:03:38] checking [15:04:00] volans: thank you [15:06:28] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T300525 (10AlexisJazz) [15:06:58] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T300525 (10AlexisJazz) [15:17:01] taavi: in any case, stepping out a meta-level, there's two perspectives I can come at that code review from: The narrower "Does it look likely to negatively impact traffic's uses of pdns-recursor?" (to which I think the answer is clearly no, so we're good), or the broader one we're discussing above. [15:17:51] and I don't want to seem like I'm stepping on your toes about the design of something we really have no hand in, either, so take my objections with a grain of salt and do what you need to do for your use-case [15:17:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T300525 (10RhinosF1) [15:18:15] so long as I've told you my concerns, what you do with them is up to you, and I'm happy to +1 it for "traffic's pdns-rec is unimpacted by this" :) [15:21:06] vgutierrez: yeah work for me now [15:21:17] I was getting Request from 109.149.246.224 via deployment-cache-text06 deployment-cache-text06, Varnish XID 8599790 [15:21:17] Error: 503, Backend fetch failed at Mon, 31 Jan 2022 15:16:21 GMT a moment ago [15:23:03] ema fixed it, that's why [15:23:17] yup, I was seeing the puppet changes on the remap rules at ats-backend [15:23:31] thx ema <3 [15:23:37] Ty :) [15:24:59] you're welcome! [15:27:33] bblack: yeah. I don't think my patch makes it much worse than what it currently is (the daemon reading it as root and dropping privileges later), so I don't want to block an os upgrade on that [15:28:39] yeah if it keeps the keys+pass in memory in the existing setup, it's no different [15:29:02] (which I guess it must be, if it's able to reconnect those on failure without a daemon restart) [15:29:55] the only new thing is that with the ownership change, in theory the unpriveleged-exec attacker can chmod the config and change its contents, possibly leading to some other step [15:30:12] (but puppet would reset the config in the long run, so it's hard to say this would be a good attack-persistence step) [15:47:56] bblack: can you please add the "traffic is unimpacted" bit to the gerrit patchset please? [15:48:34] sure [15:49:34] mmandere: meeting finished, I'm looking more in depth [15:51:26] taavi: lol, I guess andrew isn't in here :) [15:52:14] volans: ack [15:53:24] mmandere: so, it's stuck getting info from the ganeti API (RAPI) on drmrs [15:53:53] I just tried to kill the tcp stuck connection and I'm tailing the logs to see if goes true or not [15:55:11] volans: ok... following [15:55:37] what I did to find it out [15:55:52] I can summarize it in a task later if needed [15:56:11] but seems still stuck, checking to contact the API directly [15:59:44] understood [16:00:43] so, the API seems responding, I can get responses from other endpoints, I can get the answer for /2/features for example, but /2/info is the one stuck, checking the ganeti side now [16:06:20] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) 05Open→03Resolved [16:06:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10JAllemandou) [16:18:28] mmandere: it seems a network issue, let's see if XioNoX or topranks can help here [16:19:06] ok [16:19:19] I'm trying to grok the above messages. [16:19:22] what's up? [16:19:39] Wondering how getting data for 2/features works if the network transport is what's broken? [16:19:53] recap: I can connect to Ganeti API on drmrs from cumin1001 and get some replies, but not others. /2/info seems to get stuck and the difference is that it's a larger answer, might be an MTU or ICMP filtering issue? [16:20:02] the api works fine locally on ganeti6001 [16:20:19] volans: the telxius link is down [16:20:21] hmm ok, sounds like you've done half the troubleshooting already that might well be it (mtu) [16:20:27] and so we're going through the Telia tunnel [16:20:49] and that reminds me that we didn't ask for a higher MTU on the drmrs side [16:22:43] topranks: I had a $fun path mtu discovery rabbit hole issue at $JOB-2, so when I start seeing weird behaviour I always ask myself if it can be that :D [16:23:50] heh yep it's one of the first things I jump to also.... and same reason, burnt into my brain from quirky issues past [16:24:25] that to say that I might be biased in my analysis :D [16:26:14] Email to Telia sent [16:26:29] XioNoX: so the solution is... to wait? [16:26:32] well, to ARELION [16:26:48] volans: if it's not urgent, ideally yes [16:26:53] mmandere: ^^^ [16:28:21] XioNoX: we are ok waiting [16:28:33] whatever is faster between Telia bumping their MTU, Telxius fixing their fibercut 18km from the shore, or GTT making sens of their setup [16:28:43] my bet is on #1 [16:28:52] lol [16:29:27] :D :D [16:29:28] mmandere: at this point I think it's safe to ctrl+c the cookbook, it shoul drollback the dns changes done so far and then exit cleanly [16:29:49] worse case we can kill the TCP connection if that's not enough to trigger the rollback [16:29:54] volans: ack, and thank you for helping [16:30:04] "Drollback" :) [16:47:50] bblack: ready for me to merge that pdns/bullseye change? Should be a no-op but I'm very afraid of causing a dns outage [16:47:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/758063 [16:48:40] (sukhe, same question) [16:49:51] andrewbogott: thanks, I am comfortable with the change for the DoH part. I am heading out to lunch, maybe we can merge it a bit if that's OK and if you require us to be around? [16:50:20] sukhe: yep, please ping me when you're back from lunch :) [16:51:51] thank you [16:52:46] I'm in [16:53:02] but I'd advise puppet-disabling the prod dns recursors and then testing at least one before unleashing it [16:53:50] (prod dns recursor's being cumin's "A:dns-rec") [17:05:48] mmandere, XioNoX: I see Telia have come back already to say they've increased that MTU. Quite amazing turn around I gotta say fair play to them. [17:06:02] indeed! [17:06:56] I configured our side as well, mmandere can you give it another try? [17:07:13] XioNoX: I can confirm it works! [17:07:19] nice! [17:07:33] I got a 6221 bytes string back [17:08:19] XioNoX: what was the MTU value and on which side it was wrong? [17:09:09] volans: it was the default MTU on the drmrs<->telia transit side, on which we have a backup GRE tunnel between drmrs and esams [17:10:25] ack [17:21:08] volans XioNoX topranks thanks for helping. We'll try to relaunch the vm creation again (tomorrow as it is a little late on my end) and will let you know if faced with any challenges [17:21:40] anytime :) [17:36:33] andrewbogott: I am here if we want to try it out [17:36:42] Let's do it! [17:37:01] First going to take Brandon's advice and disable puppet on the prod recursors [17:37:04] yep [17:37:05] after I remember how to do that with cumin... [17:37:59] sudo cumin A:foo "disable-puppet 'message here'" [17:38:03] iirc [17:38:08] cumin A:dns-rec "disable-puppet" [17:38:19] andrewbogott: happy to do this part [17:38:44] sure, go ahead [17:38:44] my wifi is flaky anyway [17:38:48] thanks [17:39:58] I suspect that this network is infected with a bot that does something expensive every 30 minutes :/ [17:41:44] ok [17:41:46] merging change [17:42:14] sounds good! [17:43:05] getting the easy and less critical one out of the way first (doh) [17:44:50] (which was a NOOP anyway but yeah) [17:44:51] moving on [17:46:38] in theory it's all noops! [17:46:42] ha [17:46:47] andrewbogott: can you please test cloudservices? [17:47:36] yep, it was a no-op on the buster boxes. [17:47:36] So all good on my end. [17:49:05] looks good [17:49:16] thanks for patch taavi! [17:49:18] *the [17:49:49] hmm, I'm still seeing "Jan 31 17:42:35 cloudservices2003-dev pdns_server[2657109]: Jan 31 17:42:35 Unable to open /etc/powerdns/pdns.conf" which is what the file ownership change was supposed to fix [17:50:11] ah, nope [17:50:16] it's giving a different error now [17:50:22] what's the error? [17:50:23] Jan 31 17:50:16 cloudservices2003-dev pdns_server[2665964]: Fatal error: Trying to set unknown parameter 'default-soa-name' [17:50:27] ah [17:50:31] that seems an actual config issue that we can fix [17:50:53] yeah [17:50:56] deprecated config ooption [17:51:00] https://doc.powerdns.com/authoritative/settings.html#default-soa-name [17:52:44] we probably need default-soa-content [17:54:38] taavi: there's still a real grab-bag of ownership/permissions in /etc/powerdns -- I fear that's going to trip us up [17:54:46] but I'm not far enough to now how/if yet [17:56:55] is anyone preparing the patch for the default-soa-name issue or should I? [17:56:58] andrewbogott: in what way? [17:57:38] sukhe: If pdns had trouble reading the .conf file, surely it will have the same problem reading any other file in there [17:57:52] I havent' looked at the soa-name thing yet [17:59:35] I think the error taavi was referencing was an earlier one, unless I am mistaken [17:59:52] because updating default-soa-name does fix the issue [18:00:13] also I haven't followed the historic discussion and I have never looked at pdns so maybe I should step back :) [18:00:59] Eh, I'm just guessing. If it works, it works [18:04:00] default-soa-name was deprecated in 4.2 but the replacement key wasn't added until 4.4? That's not how deprecation works :( [18:04:16] yeah also the documentation doesn't really talk about the replacement :) [18:05:26] since the permissions issues seems to be resolved, I am going to re-enable puppet on A:dns-rec unless objections [18:05:31] I will wait for a +1 on this ^ [18:05:52] sukhe: did you test a canary host first? [18:06:32] yeah test one first please [18:06:39] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Esanders) This appears to be affecting Patch demo instances too: https://github.com/MatmaRex/patchdemo/issues/422 [18:06:44] if we break A:dns-rec all at the same time, everything will probably die [18:06:54] I am going to wait for someone else to test it as well then :) [18:08:35] bblack: is this role(dnsbox) or something else? [18:09:25] yeah A:dns-rec is role(dnsbox) [18:10:03] I can go try one [18:10:27] looks like puppet is already enabled on dns1002... [18:10:39] ok [18:10:42] maybe that's sukhe [18:11:27] puppet is happy there but bblack if you want to do a dig test that would be reassuring [18:12:39] yeah it applied as a no-op back [18:12:45] no need to check anything else [18:13:08] ok -- sukhe go ahead and re-enable puppet then [18:13:48] [done] [18:14:00] thanks all! [18:14:15] There's another config change coming up but it should be trivial [18:14:31] some variation on https://gerrit.wikimedia.org/r/c/operations/puppet/+/758540 [18:14:54] yeah that default-soa-name thing is :/ [18:14:57] soryr that was me [18:15:00] there's no version in which the two overlap [18:15:01] (er, dns1002) [18:15:13] hm, I don't love it if that template is getting used on non-openstack hosts... [18:15:26] will need to put some less-hard-coded value there [18:15:26] I don't think anyone uses pdns-auth except you [18:15:31] oh good :) [18:15:31] I think [18:15:38] ok, then we just need to switch based on OS [18:16:26] yeah in our main puppet repo, the only string matches for pdns_server are itself and: [18:16:30] modules/profile/manifests/openstack/base/pdns/auth/service.pp: class { '::pdns_server': [18:16:59] a lot of these cases do scare me in the general case though [18:17:24] WMF's various SRE teams/subteams/related are getting too big in some cases to be having shared infra code in some cases [18:18:02] one team says "Oh I'll make this *simple* change to component foo, it seems trivial and our use of it is non-critical" and then it completely breaks something they never heard of that, that was a dependency for Everything. [18:18:09] it's getting harder to keep tabs on those kinds of things [18:18:43] don't ask me what the reasonable answer is. it's also not great to be duplicating efforts pointlessly :) [18:19:41] bblack: you are describing the reason I pinged all of you before merging that obviously-no-op change [18:19:53] Because 'obviously' is not always so obvious [18:28:08] yeah, it just doesn't scale well into the future and over the long term :) [18:28:24] it will increasingly become non-obvious in many cases, and we don't want the burden of close coordination on every minor change [18:54:15] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) a:05Krinkle→03None [20:02:29] 10Traffic, 10SRE: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) 05Open→03Resolved a:03CDanis @Volans made the suggestion of using wikitech-static. Given that status.wikipedia.org is currently served from there, this seems quite reas... [22:37:02] hey traffic, search team's got a small patch up that changes pybal to use a UA that indicates it's pybal (when running checks) rather than the generic twisted one. anyone have time to take a look at some point? https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/743222 [22:38:12] `wikidata/query/deploy` [22:38:53] `user@Rs-MacBook-Pro ~/wmf/wikidata/query/deploy [master]%` [22:39:08] `wikidata/query/rdf` [22:42:51] https://integration.wikimedia.org/ci/blue/organizations/jenkins/wikidata-query-rdf-maven-release-docker/detail/wikidata-query-rdf-maven-release-docker/85/pipeline [22:43:03] 0.3.101 [22:43:43] `pull latest`-> `check jenkins build to get version # (e.g. 0.3.101)` -> `./deploy-prepare.sh '0.3.101'` [22:45:05] lol sorry ignore the last 4 messages, wrong channel :P first message is real tho