[01:10:11] https://www.irccloud.com/pastebin/tzzILsif [01:10:30] Request from via cp1084 cp1084, Varnish XID 183020208 [01:10:30] Upstream caches: cp1084 int [01:10:30] Error: 429, Too Many Requests at Fri, 15 Jul 2022 01:05:54 GMT [01:11:31] I'm getting this message rn [01:22:02] albertoleoncio: yea, there are a couple tickets about 429 errors. it can happen for example on pages that have a lot of thumbnails. https://phabricator.wikimedia.org/search/query/h8t6nOpM1trz/ should be a saved query to find open tickets with "429" in the title. [08:21:31] does (should) pybal handle failed servers? i.e. if I have a bunch of pooled servers and one of them goes pop, should I expect pybal to notice and stop trying to send traffic to it (after a while)? [08:34:36] Emperor: if the probes work then servers should auto depool I believe but there's also a threshold for servers marked down but not auto depool [08:34:43] Icinga should alert though if that happens [08:38:08] A very quick check of the docs does not help in finding how to actually set probes [08:40:13] I found https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml [10:14:02] Emperor: pybal healthchecks each server quite often.. and if you have an IdleConnection monitor configured as soon as the server closes that socket pybal will depool the server [14:58:53] would anyone be able to help with a +2 for a puppet patch to make beta CI not cry until monday [15:01:33] RhinosF1 I can look [15:01:57] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/814134 [15:04:21] RhinosF1 +2'd [15:04:36] inflatador: you did V+2 not C+2 I think [15:04:51] will need merge and then I'll make sure we check working [15:05:52] RhinosF1 Just added code review +2 , is that what you needed? (Sorry, still kinda new at the gerrit thing) [15:06:24] inflatador: press submit too and you'll need to puppet-merge in prod (doesn't touch anything but beta) [15:09:55] ACK [15:10:55] RhinosF1 OK, merged/puppet-merged. LMK if it looks right [15:12:48] ack [15:20:51] inflatador: can you do https://gerrit.wikimedia.org/r/814135 [15:22:56] RhinosF1 looking [15:24:02] OK, merged/puppet-merged [15:24:10] ty [15:25:21] My blogpost about the new Marseille POP is out! https://techblog.wikimedia.org/2022/07/15/building-dreamers-how-and-why-we-opened-a-datacenter-in-france/ [15:25:50] i like dreamers as a name for it [15:27:35] XioNoX: <3 [15:29:08] very nice! Side question, what do program do y'all prefer for making diagrams? [15:29:38] RhinosF1: I think this nickname is from rzl [15:29:53] inflatador: I like diagram.net [15:30:32] inflatador: This is a useful page: https://wikitech.wikimedia.org/wiki/Performance/Runbook/diagrams.net_conventions - I also like Libreoffice Draw. [15:31:22] XioNoX: nice blog! [15:31:36] diagram.net gives a cloudflare access denied page ;( [15:31:52] btullis thanks, bookmarked [15:32:34] inflatador: oops, diagrams.net not diagram.net [15:33:20] Ah! Much better [15:35:06] excellent blog post! [15:35:13] inflatador: testing now, might take ~10 minutes [15:35:48] very nice post XioNoX [15:42:42] I agree, that's excellent XioNoX [15:44:57] inflatador: that worked [15:45:13] {◕ ◡ ◕} [15:46:10] Have a good weekend! [16:00:45] "My blogpost about the new..." <- XioNoX: amazing article!! Thank you very for much for sharing your knowledge. [16:01:20] * denisse|m wonders if we could open a data center for LATAM visitors [16:29:56] denisse|m: I'm pretty sure that the Traffic team has discussed the need for a LATAM caching center before. When you look at the map on https://wikitech.wikimedia.org/wiki/Global_traffic_routing you can see that South America and Africa are both less served than other areas in the current edge node spread. [16:30:52] On average we have added a new POP every 3 years since I got here. The constraints are money and people to do the work as always. [16:37:08] yes :) [16:37:46] bd808: Hopefully we could have a datacenter in South America someday. <3 [16:38:16] the most-frequently cited locations for future exansions that haven't happened yet, are basically South America (probably close to Brazil-ish) and somewhere near the northwest-ish of India, in terms of what would bring the most benefits. [16:38:52] a lot of other factors go in, about how the internet works and legalities, etc [16:40:18] I agree, I would love to have a datacenter in Mexico (my homecountry) but Brasil is definitely a better choice as there may not be so much difference between serving visitors from Mexico than from the US. [16:40:47] shortly before Pandemic Times began, our conversations were around the LATAM one being the priority in terms of reaching users. Marseille ended up taking priority for other reasons related more to resiliency (Amsterdam serves all of EMEA and is very overloaded / single-point-of-failure) [16:41:36] Mexico hopefully gets somewhat-reasonable times from our Dallas, TX edge! [16:42:28] (or san francisco might be better, if you're on the far left side of MX) [16:47:53] Yes, looking at the request I notice I'm being served from ulsfo and I'm based in Guadalajara, Mexico. [16:48:25] Thanks for sharing, the new data center is very exciting!! <3 [16:52:45] does anyone ever use the query function in puppetboard? https://puppetboard.wikimedia.org/query ? I keep getting bad request, I'm guessing my query isn't formatted properly [16:54:18] trying examples listed at https://puppet.com/docs/puppetdb/7/api/query/examples-pql.html with no luck so far [16:59:05] inflatador: no pql support (IIRC it was disabled because if enable could allow to query for private data that we don't want to expose via web). You can use puppedb query syntax though [16:59:09] what are you trying to achieve? [16:59:36] also only few API endpoints are allowed in the web UI [17:00:08] volans we're looking for elastic hosts that have 10G nics, but are connected via the on-board 1G NICs [17:01:16] from facter, I think we'd look for the presence of eno1 with speed !=-1 (plugged in) and then other the presence of external NICs. Maybe won't get us everything, but close [17:01:51] is his a one off? or something that you need to integrate into something else? [17:01:57] one-off [17:03:09] is this the puppetdbquery syntax? If so, should be able to work thru it https://forge.puppet.com/modules/dalen/puppetdbquery [17:03:51] no, it's the syntax of the APIs, beside PQL, things like ["=", "certname", "sretest1001.eqiad.wmnet"] [17:03:56] it's quite horrible tbh [17:04:10] ;P [17:04:17] https://puppet.com/docs/puppetdb/5.2/api/query/tutorial.html [17:04:45] from the cumin hosts if you want you can query puppetdb via curl and then play with the json, but I think there are better alternatives, give me a sec [17:05:17] totally open to suggestions [17:21:18] inflatador: this might help you https://phabricator.wikimedia.org/P8744 [17:23:27] for oddball queries I've found it easier to do my own stuff with `jq` over the catalogs in question rather than use the PuppetDB query syntax [17:23:50] (although I have sometimes managed to do the latter -- but usually not from the puppetboard query interface, but rather with a curl to localhost on puppetdb1002) [17:28:29] with something like this you can easily find at which speed the hosts are running: [17:28:33] sudo cumin 'A:elastic and A:codfw' 'facter -p -j net_driver | jq ".net_driver[] | select(.speed != -1) | .speed"' [17:29:55] then with this you can easily get the names of the ifaces and from that easily deduce those that have additional NICs: [17:29:58] sudo cumin 'A:elastic and A:codfw and P{elastic2044.codfw.wmnet}' 'facter -p -j net_driver | jq ".net_driver | keys"' [17:30:38] inflatador: you can run the latter on just the subset of the first run, just copy-pasting the list of hosts you're interested in [17:31:21] like in codfw it doesn't seem to be any, all 10 hosts running a 1000 have eno[1-4] as NICs only [17:33:39] thanks volans , looks pretty clean! [20:00:41] cdanis: btw I meant to say, thanks for splitting the klaxon change into so many small commits, made the reviews really easy [20:08:48] `git add -p` is my friend [20:39:28] PSA: https://klaxon.wikimedia.org/ now displays the current business hours oncallers (if any) [20:43:35] cdanis: I'm either blind or it's not showing [20:43:50] right under the red "wake up an SRE" button [20:43:53] ^^ [20:44:26] that, and additionally, it will not show up when there's no dedicated oncall person [20:44:33] I don't see anything [20:45:08] not the button in the upper right, the one further down the page [20:45:22] oops yes, that [20:45:23] Oh ye [20:45:26] hm, perhaps I'll make it appear both places [20:59:17] https://i.imgur.com/kDaRrga.png [21:00:07] Looks good [21:00:18] if you didn't want to show it twice, the alternative is to place it below "recent alerts" -- that's further from the relevant button, but connects it with the other status display [21:00:21] either way works [21:00:39] I don't mind showing it twice; I think most of the time the second instance is below the fold anyway [21:00:43] nod [21:02:13] shipped as soon as puppet runs on alert1001 [21:55:13] Looks good cdanis [21:55:19] Have a good weekend everyone!