[10:03:02] I created a to be reviewed incident report of yesterday's row D partial outage, let me know if you have any questions: https://wikitech.wikimedia.org/wiki/Incidents/2022-10-06_eqiad_row_D_networking then it will go through the onfire pipeline [10:03:25] thanks [11:40:37] topranks: thanks for T304501! [11:40:37] T304501: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 [11:41:08] I am secretly hopping this fixes the flapping BFD session that we see for doh1001 :) [11:43:06] sukhe: no probs! [11:43:25] I think it's unlikely to fix that, but it's not impossible [11:44:07] From the reading I did JunOS should work in the default 'automatic' mode, so it seems to be there is likely some edge-case bug where it sometimes doesn't evaluate which mode to use correctly [11:44:34] not at all impossible that that bug is related to the flapping we see, but it could well be something else [11:55:50] ok! [12:07:28] Are dynamic pages under noc. failing a known issue/maintenance, or is it unexpected? I cannot see config files or databases [12:17:15] <_joe_> jynus: I guess I made some mistake [12:17:21] <_joe_> sorry I just noticed [12:20:06] <_joe_> jynus: hotfixed, will puppetize the fix asap, although I'm about to eat lunch [12:20:07] oh, it works now? [12:20:32] thanks you, not biggie, just thought of reporting it in case it was not expected [12:21:26] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/840117 the fix :P [12:21:58] go have lunch, it can wait [13:50:32] topranks: yeah that's really cool. FWIW, our haproxy has the ability to set TOS/DSCP at its level on our outbound packets, so we could make L7 decisions about that at any of our layers, and communicate it back to haproxy for interacting with router-level QoS [13:51:14] bblack: oh cool that is good to know [13:52:11] (and then instead of explicitly using arbitrary limits to try to keep from saturating transit, we could for example identify the high-pressure stuff (like an image that's being hotlinked) and just mark it lower prio to be the best candidate for dropped packets. [13:52:12] myslef and Arzhel discussed before (we didn't know) how many applications might support marking packets directly, and we weren't too confident [13:52:15] ) [13:52:57] yeah being able to use richer information from higher layers, and use that to mark packets so the routers can see it would be really great [13:53:37] we'd been assuming some lower-level iptables stuff could do it, but that is fairly broad in terms of what you can do [13:54:05] your example of a hotlinked image being exactly what we couldn't classify at that layer [13:54:17] we could make other generic priority decisions broadly, too (like marking all the upload-lb packets as lower prio than the text packets, so that we're more likely to have the api/text bits working when images saturate in general [13:54:25] ) [13:54:49] admittedly, that could be done by-source-IP today, but it's simpler config-wise to just mark them at the origin [13:55:41] yes definitely. and the switches in particular have small amounts of TCAM for access-lists, so large ACLs matching a lot of different single IPs is problematic there [13:56:26] when you say "communicating back to haproxy" is there a mechanism in existence for that? or something you had in mind? [14:03:58] topranks: what I mean is, even if only the application layer, or our deeper cache layers (ats-be, varnish-fe) knew how to classify the egress traffic, it could set a response header like "X-Set-DSCP: 123" or whatever for haproxy to parse and use. [14:05:15] (that might even be a better model than adding more tricky L7 logic in haproxy itself on the request side) [14:08:44] of course yep HTTP header is a very good way to indicate that. makes sense. [16:25:56] http://ratfactor.com/rss-club/school-vs-wikipedia <- nice pro-wikipedia rant I picked up via HN today :) [16:29:34] wow, a pro-Wikipedia article on HN. we may finally see the Loch Ness monster next [16:29:56] <_joe_> sukhe: I'm actually disturbed [16:29:56] I am waiting for the eventual comment of, "why do they need an SRE team? I can run Wikipedia all by myself" [16:30:34] nice article though [16:30:44] _joe_: to see a pro-Wikipedia article on HN? :P [16:31:00] <_joe_> sukhe: more broadly, something I agree with [16:31:04] hah [19:08:22] apergos, still around today? I'm trying to understand what's happening with https://phabricator.wikimedia.org/T319269 [19:23:28] hey, andrewbogott yep [19:23:44] JustHannah: ^^ you're following this conversation too right? [19:23:55] so there's two things right [19:23:59] Yeah, we were discussing in PM. I think the root issue is this https://gerrit.wikimedia.org/r/c/operations/puppet/+/840213 [19:24:05] 1) downloads continued to go to labstore1006 [19:24:11] so right now I'm just trying to find the right magic rsync command to get 1001 caught up. [19:24:39] 2) once that stuff is properly rsynced around to clouddumps1001/2, then clouddumps1001 should be the place where they get downloaded [19:24:48] ok um I have that around [19:25:35] /usr/bin/rsync -a --bwlimit=160000 labstore1006.wikimedia.org::data/xmldatadumps/public/other/enterprise_html/runs /srv/dumps/xmldatadumps/public/other/enterprise_html/ I think this should do it [19:25:35] [19:25:46] from cloudumps1001 if the ferm stuff is sorted out [19:25:54] yeah I'm here [19:26:02] labstore1006 should not be part of this story, it's about to be decom'd [19:26:18] I mean, after today :) [19:26:26] So I can run that command on cloudstore1001 right? Trying... [19:26:27] after today, yup [19:26:41] yes, if you've adjusted ferm t permit the connection [19:27:02] that's where we got stopped, and then noticed that the labstore 1006/7 had been set to spare and then decided to call you :-D [19:28:53] * andrewbogott hacking around with the firewall... [19:29:05] ah so it wasn't fixed up yet, I see..... [19:31:51] * andrewbogott wishes rsync would tell me if it's doing something or not [19:32:14] yeah, just gotta wait for it to time out [19:35:49] * andrewbogott adds --progress [19:35:54] I think it's actually copying now [19:37:06] you ca see if you see directories um [19:37:37] /srv/dumps/xmldatadumps/public/other/enterprise_html/runs/20220920 and 20221001 [19:37:45] on clouddumps1001 I mean [19:37:50] apergos: can I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/840213? [19:37:53] then do the same rsync on 1002 and we're done [19:38:06] JustHannah: ^^ what do you think? [19:40:25] sorry are the dirs there now??? [19:41:19] the patch looks good! [19:42:03] they are in progress still [19:42:07] JustHannah: it's copying things but hasn't gotten to October yet [19:42:14] clouddumps1001 is happening, 1002 is not yet started [19:42:23] you can up the bw limit if you want it to go faster [19:43:49] would you be willing to set the same value in the corresponding labstore1006 host to false? [19:44:03] I know it's not being used but just for niceness [19:44:18] (in the pending patch) [19:44:41] apergos: you maybe aren't seeing the latest patchset [19:45:58] oh, lemme reload, sorry [19:46:14] yep fine [19:47:22] is it only that enterprise subdir that's affected by is_primary_server? Or are there a bunch of other things that I need to sync over? [19:50:52] I honestly don't remember, best to do a grep [19:52:47] Hello Andrew, sorry we just noticed the primary should be 1002 and not 1001 [19:54:01] dumps_dist_active_web is currently set to clouddumps1002.wikimedia.org and this is where it should pull from during rsync [19:54:55] progress will just be "per file" but adding "v" should add more detail [19:58:48] JustHannah: It'll be in both places eventually but I'll put a rush on 1002 [19:59:18] 'v' doesn't just get the version string? [20:00:09] V is for version, v is for verbose output [20:00:50] rsync -avp [20:02:31] if both are to be the active web servers there may be pieces of manifests that need to be redone a bit [20:08:12] if we are going with the current setup the patch should be on clouddumps1002 and not 1001 [20:10:34] apergos: my understanding was that the two hosts were configured to provide all the same services and the only difference is what clients are pointed to them... is that wrong? [20:10:41] had to do something similar when we temp added multiple gerrit/phab/gitlab hosts. before we just had one active and one passive host and they rsynced, then we had to turn the "dest_hosts" into an array and put some "for each" around code to sync to more than one. [20:11:08] sounds like you might have the same situation here [20:11:12] both hosts have a web server on them but only one is serving [20:15:57] and right now we download from wme to the web server and rsync that copy across to any other hosts [20:17:48] andrewbogott: ^^ [20:18:24] right, I was thinking clientwise. When you said 'if both are to be the active web servers...' what were you responding to? [20:19:10] if bouth clouddumps1001 and 1002 are to be active (i.e. dumps.wikimedia.org is to point to a pool of both) then some things might need to be reqorked in the manifests [20:19:17] you said um [20:19:30] my understanding was that the two hosts were configured to provide all the same services and the only difference is what clients are pointed to them [20:19:48] and uh while they both have the web server running on them only one is the actual active web server [20:19:54] is that more clear? [20:20:20] So they are both set up to be web servers. At any point we could redirect dumps.wikimedia.org to point to the other and everything would keep working just the same. [20:20:42] uh well there's some varis you probably need to switch in the hierdata yaml files too [20:26:11] JustHannah: the 20221001 dump should now be available on cloudstore1002 [20:33:13] confirmed! thank you! [20:36:03] the 20220920 one as well on both clouddumps, yes? [20:36:26] andrewbogott: ^^ [20:37:24] yes! [20:38:13] the 920 dump is partial, it's still syncing [20:38:20] okey dokey [20:38:26] do we have any conventions around rsyslog & logrotate, e.g. applications log to their own directory and are rotated by logrotate? [20:47:09] yea, but every service has their own puppetized logrotate.conf with different contents [20:47:46] but it does seem to be standard to use logrotate and log to its own dir like you said in the example [21:46:04] mutante: thanks