[06:09:24] <_joe_> I can't seem to resolve puppet-compiler.wmflabs.org via dns [06:10:04] <_joe_> is this a known problem? [06:10:32] <_joe_> marostegui: can you? [06:16:21] it works for me [06:17:55] <_joe_> works for me now too :P [07:04:52] \o/ [07:07:31] <_joe_> happy? [07:07:54] I am on call XD [07:08:41] <_joe_> !incidents [07:08:41] No incidents occurred in the past 24 hours for team SRE [07:30:40] well, that is technically true, I guess [07:31:29] last one was 30h ago [07:41:17] _joe_: could it be similar to https://phabricator.wikimedia.org/T316476 ? [07:42:22] mmm so sirenbot says jbond is oncall, but the calendar says it is topranks? [07:42:40] XioNoX: I wonder if that can be related with https://phabricator.wikimedia.org/T315955#8188684 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/6fc0029d48e76df489c91b9dcaf4e87f84e74890%5E%21/#F0 [07:42:50] marostegui: there is an override, I followed that [07:43:00] jynus: but then who's the one? [07:43:24] taavi: sounds very possible yeah [07:43:38] marostegui: you can see it at: https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/rotations [07:43:54] jynus: yeah, that's the one I am seeing [07:44:01] But then the bot needs fixing? [07:45:08] why do you say that? because the calendar? that I belive is a separate thing [07:45:27] So per the calendar Cathal is the one oncall today with me, right? [07:45:41] taavi: can you comment on the tasks? I can do it but the credit goes to you :) [07:46:10] marostegui: oh, I see, probably because the override starts in 15 minutes [07:46:20] see calendar in 15 [07:46:21] Ah! [07:46:25] XioNoX: https://gerrit.wikimedia.org/r/c/operations/dns/+/827446/ [07:46:31] and I'll comment on both [07:46:37] Yeah, jynus that will probably explains it [07:46:50] garbage in, garbage out :-D [07:46:50] _joe_: feature request for !incident: say when was the last incident [07:54:12] <_joe_> marostegui: sirenbot pools the victorops api for who's oncall now, so I would assume it's accurate re: who gets paged [07:55:07] yeah, it is probably explained by what jynus said, the override starts in a bit [07:57:25] jynus _joe_ so there's no override for Business hours emea [07:57:35] So I am alone [07:57:42] <_joe_> marostegui: wait, wat [07:57:54] So on Business Hours EMEA [07:57:55] This rotation is being used in the following escalation policies [07:58:05] It's me and john, who is out this week I reckon [07:58:33] <_joe_> marostegui: do you know if there's a planned override? [07:58:33] and topranks will take over from 8am utc [07:58:34] unless it will change in 2 minutes in victorops? [07:58:45] it will [07:58:51] I would have expected victorops to show that too [07:58:52] Interesting [07:59:02] it does at https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/scheduled-overrides [07:59:16] just override should have been set since 0h to avoid these issues [08:00:08] But not on https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/rotations which is what I would have expected to get shown too [08:00:08] e.g. it is not unusual to be alone (but just for 1-2h or so)- your working hours don't have to match exactly that of your mate 0:-D [08:00:18] jynus: I know that, I have been oncall before ;) [08:00:52] marostegui: I won't try to convice you victorops ui has some sense :-/ [08:01:05] yeah XD [08:01:41] So I shouldn't expect to see the change on https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/rotations or should I? [08:02:10] Klaxon.wikimedia.org has updated so I guess if the bot is refreshed it will [08:02:23] marostegui: hey yep I am definitely on call this week - swapped with j.bond [08:02:53] thanks topranks - we have so many sources of information that it is hard to know which ones we should trust or check! [08:02:56] topranks: you started the override slightly late today I think so it showed John for the first hour [08:03:20] But interestingly, in victorops for EMEA hours, it still hows john [08:03:23] I am so confused [08:03:26] haha [08:03:38] Leo had set it up originally... it's likely as I am in Ireland and thus 1 hour behind CEST [08:03:46] marostegui: https://calendar.google.com/calendar/u/0/embed?src=kpcomsk13n79pcni0bnibndold4n15dj@import.calendar.google.com might be less confusing [08:03:55] RhinosF1: That's the one I use yes [08:04:08] But ultimately I would have trusted victorops emea hours [08:05:01] I'm happy to have it start an hour earlier if we can make the change. [08:05:30] topranks: it is ok, no problem [08:05:40] topranks: it's correct for the rest of the week [08:05:49] And correct as of 5 minutes ago [08:05:58] ok cool thanks [08:10:32] I am switching m2 master in 20 minutes [08:10:39] https://phabricator.wikimedia.org/T316202 [08:52:27] I'm gonna resume my https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 experiment in cp6016 shortly, this triggered T316337 on Friday but hopefully it won't happen again :) [08:52:28] T316337: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 [08:53:00] so if you hit WMF via drmrs please let me know if you experience any weird behaviour [15:17:09] sobanski: do I make sense in https://phabricator.wikimedia.org/T316560#8194642 [15:18:38] * RhinosF1 doesn't mind doing it if that's the correct docs link [15:19:12] Do you mean adding a section to the /Runbooks page pointing at the actual documentation? [15:23:51] sobanski: yes [15:24:03] Like is done for the alerts already on there [15:59:51] about to start a retrospective on 2 million errors we generated during the swift incident [16:00:02] you should have an invite on your inbox [17:26:07] Thanks rzl mutante [17:26:11] Re the runbook [17:27:49] 👍 [19:28:25] Hello team, could you please help me to clone the rancid repository? According to the Wiki 'git clone ssh://netmon1002.wikimedia.org:/var/lib/rancid/core/ rancid-configs' and 'git clone ssh://netmon1003.wikimedia.org:/var/lib/rancid/core/ rancid-configs' should clone the repository to your local machine. [19:28:25] I'm not sure if I'm unable to do it due to a bad Git/SSH configuration or if there's something else... [19:28:59] denisse|m: i'll try, one sec [19:29:35] fatal: '/var/lib/rancid/core/' does not appear to be a git repository [19:30:06] Thanks Daniel, topranks and I are having the same issue... [19:30:24] root@netmon1002:/var/lib/rancid/core# git status [19:30:24] On branch master [19:34:48] Thanks Daniel, I'm not sure if we may need to add our users to an ACL to be able to clone it with our regular accounts... [19:35:09] denisse|m: it's as if "git init" never ran but locally it thinks it is a normal repo.. that's .. weird [19:35:24] the login over ssh works [19:36:45] On netmon1003 the git repository seems to be working as expected, the latest commit is from August 29... [19:37:11] I get the same error from 1002 and 1003 when trying to clone [19:37:48] Me too, I also get the error when cloning from 2001. [19:38:40] yea, it's the same on all of them. [19:38:58] denisse|m: it's because the regular non-root user can't switch into the "core" director [19:39:08] [netmon1002:~] $ cd /var/lib/rancid/core [19:39:08] -bash: cd: /var/lib/rancid/core: Permission denied [19:39:45] was the ownership different before? [19:40:06] It's the same in all the instance, let me check previous commits... [19:46:58] The permissions haven't changed in a while and they seem to be right for me (rwxrwxr-x). [19:50:46] Okay, I found the issue. [19:51:23] The parent directory didn't had 'x' permission therefore we couldn't 'cd' into it. [19:54:30] denisse|m: this fixes it: [19:54:34] [netmon1002:/var/lib] $ sudo chown rancid:wikidev rancid [19:54:58] if you let the wikidev grop own the parent dir [19:55:09] since all the shell users are in that group [19:55:29] well, looks like you got it already [19:55:39] Hi SRE folks, does anyone know if cache parameters for donate.wikimedia.org have changed lately? [19:55:51] mutante: Ah, that's a good idea. I modified permissions of '/var/lib/rancid/' to be 755 but adding the group and letting to be 750 seems like a better idea, thank you. [19:55:58] I think it must have varied on the GeoIP cookie at one point [19:56:08] Possibly set '/var/lib/rancid/' to be rancid:wikidev. [19:56:12] But perhaps now it doesn't? [19:56:20] denisse|m: ACK, I just let puppet revert my live hack [19:56:30] (on 1002) [19:56:34] We're seeing a large number of people getting the "can't detect your country" message [19:56:46] even when there is a valid GeoIP cookie being set by varnish [19:57:00] https://phabricator.wikimedia.org/T316578 [19:57:32] There are currently fundraising emails out in donors' inboxes with links to donatewiki, so we [19:57:42] 're trying to figure this out as soon as we can [19:57:43] https://donate.wikimedia.org [19:57:52] <_joe_> bblack: ^^ [19:58:04] The 'wikidev' group works and it's without a doubt a better approach, thanks for pointing out Daniel. I'll update my patch. [19:58:23] The code that uses the GeoIP cookie hasn't changed since at least March [19:59:18] and this problem just started today [20:00:20] <_joe_> ejegg: Special:FundraiserRedirector reports a cache pass [20:00:29] <_joe_> meaning it's not cached at the edge AFAICT [20:00:48] oh ok, good to know [20:01:05] <_joe_> also sends [20:01:07] <_joe_> vary: Accept-Encoding,X-Forwarded-Proto,Cookie [20:01:24] that's right, I would have seen a cache report at the bottom of the page that got redirected TO... [20:01:39] I should have been looking at the headers on the previous request [20:01:40] sorry [20:01:50] I see the comment " seems to work when logged in to donatewiki. But not when logged out" and today was "trafficserver: Hide non session cookies during cache lookup". can there be a relation? [20:02:10] yeah if something changed today, that's the likely culprit [20:02:12] <_joe_> mutante: very possible as that is mangling with the request [20:02:16] <_joe_> yeah [20:02:19] we can revert [20:02:19] then see https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 [20:02:30] it's the same one that was involved in phab login issues, I think [20:02:47] <_joe_> yes [20:02:51] <_joe_> that would be it [20:03:01] ooh, sounds worth a try! [20:03:03] <_joe_> that hides the geoip cookie from mediawiki [20:03:20] <_joe_> which is likely to cause other problems too [20:03:20] definitely would cause our issue [20:03:22] it wasn't meant to [20:03:51] <_joe_> bblack: well not if you're logged in [20:03:59] but I think maybe we're missing something about ATS control flow here [20:04:33] either way, let's start with a revert? [20:04:37] <_joe_> yes [20:04:37] yes please! [20:04:38] yes please :) [20:04:42] <_joe_> that sounds sensible [20:09:54] what the patch was *meant* to do, was hide the cookies from ATS's internal cache lookup in , and then restore the cookie (if it was hidden) before sending any backend origin (e.g. mediawiki) req. [20:10:20] hmmmm [20:10:32] thx so much bblack _joe_ ejegg greg-g :) [20:10:52] but now I'm left wondering if perhaps donate's "Vary:Cookie" has a non-standard meaning in our world [20:10:52] thx also mutante! :) :) [20:11:22] the revert is deployed right? (I didn't see a !log in -operations) my donatewiki page is loading correctly [20:11:33] part of our poorly-documented (maybe undocumented) assumptions for years has been that Vary:Cookie in our world is only used for MW sessions/tokens. [20:11:52] <_joe_> greg-g: I don't know if brandon is forcing a puppet run [20:11:53] greg-g: it's merged, but not pushed to all [20:12:18] <_joe_> but I am still seeing the error from marseille, yes [20:12:20] bblack: just fwiw this has a significant impact on a current e-mail donation campaign... since the donation URLs went out in e-mails, we can't change that [20:12:23] ack [20:12:27] I imagine donatewiki doesn't use MW sessions in the usual sense, so perhaps it's outputting Vary:Cookie intending to mean "vary-on-geoip-cookie"? [20:12:33] private browser it failed still [20:12:41] <_joe_> it does both bblack [20:13:06] <_joe_> and I am not sure it's just donatewiki, I'll check tomorrow [20:13:30] yeah well, this probably is just the most-obvious and recent way in which that's misunderstood, then. [20:13:32] geoIP cookies are used by a lot of stuff on wikis [20:13:40] <_joe_> bblack: should we force a puppet run, given this is affecting an ongoing campaign? [20:13:40] for example for banners [20:13:59] however i didn't see any issues with CentralNotice's ability to read the cookie in [20:14:16] <_joe_> AndyRussG: to be clear, the patch as I would have understood it would not in any way cause what we're seeing [20:14:18] _joe_: it's running now. it's been many hours, though. [20:14:20] greg-g: the "!log" line confirmations have been removed recently [20:15:07] <_joe_> bblack: there is no page that uses cache involved in the process, apart from the final landing page that is country specific [20:15:18] _joe_: ok that's important to know. yeah I'm also not fully understanding the details of how this could be going, just seems like the most likely thing to revert [20:15:20] <_joe_> so if this is the problem, a simple puppet run should solve the issue [20:15:49] <_joe_> I am now seeing the correct page FWIW [20:16:03] <_joe_> so I would say the issue was indeed this patch; I'm not 100% sure *how* [20:16:09] but even our legacy varnish code makes strong assumptions that, in our WMF/MW world here, Vary:Cookie only has meaning for Session|Token cookies from MW. [20:16:17] <_joe_> but given it's past 10 pm, I'll go back to my movie :) [20:16:27] _joe_: bblack yayyy also working for me now! [20:16:32] thanks much for responding to me during a moive, _joe_ ! [20:16:34] if some things are using it otherwise, we probably have other corner cases to go chasing down [20:16:35] movie* [20:17:08] (none of which would be urgent, as they've been working however they're working now for a very long time, but still) [20:17:11] and just to be explicit: we are confirming that the page is loading properly now as well. Thanks one more time :) [20:18:06] the geoip cookie is (by my historical understanding) for the UA to consume, not the backend [20:18:29] bblack: pls don't hesitate to let fr-tech know if we can help support digging into this corner case ofc, thx so much once again :) [20:18:32] [for anonymous traffic anyways. all bets are off for otherwise-sessioned requests which are fully uncached) [20:19:34] <_joe_> bblack: that's how centralnotice works IIRC [20:19:37] or for that matter, other cases where vary wouldn't matter (like uncacheable outputs) [20:19:50] <_joe_> and yes, this is one uncacheable output [20:20:01] if it's uncacheable, then Vary is pointless [20:20:06] <_joe_> but Special:FundraiserRedirector makes use of the geoip cookie [20:20:17] <_joe_> so it needs to be preserved in the request [20:20:37] _joe_ correct about CentralNotice. It's read in via JS, and in fact the JS interface provided by CN is used by other stuff [20:20:44] <_joe_> and IIRC there were a couple other extensions using it [20:21:34] it's ok if extensions use random cookies (including GeoIP) in an uncacheable-response case, or when there's a vary:cookie + session-cookie (which makes it uncacheable anyways) [20:22:04] the nit we're picking is whether there's a *cacheable* output here, that's trying to Vary:Cookie on the GeoIP cookie [20:22:12] (should be spun off elsewhere really... https://phabricator.wikimedia.org/T102848) [20:23:10] bblack: for the donate wiki code, if it's helpful I think we could consider switching to only consume geoIP client-side [20:23:21] either that or we just have some true unintentional bug in that patch, in our code or implicitly in how we understand ATS mechanics [20:24:26] the intended mechanics of the patch only hide the cookie from cache lookup code, not from origin requests [20:25:59] oh wait, I think I know what happened. We may have yet-again forgotten that there can be several cookie headers? [20:26:25] IIRC one version of HTTP said you can't do that, then another did. I think we have to assume at least some UAs do it some of the time? [20:26:40] s/then another did/then another said you can/ [20:26:53] denisse|m: I am thinking maybe that was broken since 2282e81065b which added "managehome => true" with 'user { 'rancid' and the home is /var/lib/rancid. except this change was made in 2014 [20:27:32] so maybe our cookie save/restore code was saving the first cookie, then wiping them all, then only restoring the first cookie [20:27:46] in Varnish I think it's transparent, in ATS maybe not [20:29:10] I donno, we'll dig more tomorrow when vg is online [20:31:03] denisse|m: reading the commit message from 2014.. "get rid of generic::systemuser ..and replace it with the regular puppet user type again to remove a layer of abstraction that isn't really needed." but great that now in 2022 we are using the new abstraction :) [20:31:36] and switching the other direction again [21:06:11] bblack and mutante sorry if I muddied the waters with my first question here - there is nothing cache-able using the GeoIP cookie server-side on donatewiki - only the uncached FundraiserRedirector [21:09:08] I wasn't thinking clearly, and I saw the cachereport in the script tag at the bottom of the page that FundraiserRedirector had redirected me to [21:10:02] which of course applied to that page and not the FundraiserRedirector page which just returns 301 + Location header