[09:50:25] Lunch [10:14:42] lunch 2 [11:02:04] lunch&relocation [13:09:57] Errand, back in a few [13:54:21] volans: I'm not super sure what "In DRY-RUN mode cookbooks don't log" from you rcomment: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723214/comment/5a4533f7_7765343e/ . I should stop logging anything on screen in dry-run? From what I see other spicerack modules log in debug for dry run [13:55:11] zpapierski: :facepalm: not sure how that came out [13:55:18] I meant a completely different thing, my bad [13:55:32] ahh now I know, I meant !log [13:55:36] they don't !log to SAL [13:55:46] ah, not really a worry for spicerack then [13:56:11] no, just that because they are not !logging and there is no public trace of tehm or announcement on IRC they should not modify anything [13:56:25] yes dry-run includes debug level logging too [13:56:25] ok, cool -thx [13:56:33] so in dry-run debug messages will go to the console [13:56:55] as for the general approach if you have something that does: [13:56:58] - modify something [13:57:02] - check that it got modified [13:57:26] it would be nice to not fail in dry-run mode raising an exception, so that an eventual cookbook using that feature will continue the dry-run run [13:57:30] without failing there [13:57:43] so you can either return earlier (before the modification) [13:57:55] or not raise if in dry-run (usually self._dry_run) [13:58:10] that's up to you and how things are developed there [13:58:36] zpapierski: to add to one of ge.hel's latest comment, if you want to have explicit parameters you can use this syntax: [13:59:09] def foo(self, *, a, b): [13:59:17] or [13:59:27] def foo(self, *, a=1, b=False): # if you have default values [14:00:00] that forces keyword arguments, calling foo(5) would raise TypeError: foo() takes 0 positional arguments but 1 was given [14:01:49] huh, nice trick, thanks [14:03:30] volans: another thing. this comment - https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723214/11/spicerack/tests/unit/test_kafka.py#69 . I actually test for the message in L74, but your comment suggest some better way of doing that, can you elaborate? [14:03:42] it's actually part of the grammar, see "parameter" in https://docs.python.org/3/glossary.html and look for "keyword-only" [14:03:47] there was also a PEP at the time [14:04:21] zpapierski: yes, pytest.raises(KafkaError, match="foo") will do that for you [14:04:37] https://docs.pytest.org/en/6.2.x/reference.html#pytest.raises [14:04:44] can be a regex [14:04:46] ah, much nicer, thanks :) [14:05:25] sorry I missed the assert later, otherwise I could have been more clear in my comment :) [14:11:22] no problem [14:28:02] gehel: when/if you're back, I don't understand this comment: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723214/11/spicerack/kafka.py#50 - can you clear it up a bit for me? [14:29:57] zpapierski: it looks like there are 2 different level of abstraction in the current Kafka class. [14:30:24] ahh, I understand now [14:30:43] I think - you're having issue with passing KafkaConsumer around? [14:30:46] Most the the private methods take the same 3 parameters (consumer, topics, site_name. Which makes me wonder if there isn't a missing abstraction: the connection to a specific kafka cluster [14:31:23] probably not kafka cluster, it seems to be a slightly higher level abstraction, not sure how to name it. [14:31:42] I think all 3 parameters look like configuration to me [14:32:20] not really, though - two different consumers and sites can participate [14:32:47] yes, but that would map to 2 different instances of that lower level abstraction [14:33:33] I'm not sure how can that be a configuration, though [14:34:07] let me see if I can hack some code to show better what I'm thinking [14:48:12] zpapierski: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/724443 [14:48:23] And now I see why what I meant does not make sense [14:50:59] actually, it might still make sense [14:52:26] I have no idea how to rename that KafkaClient class, which might be an indication that this is pure fabrication and a bad idea [14:53:46] now I understand (I was thinking you mean something completely different) [14:54:44] the practical benefit I could see with coupling the kafka consumer and the site name is to always manipulate site agnostic topic names [14:54:57] and remove the dangerous line: local_topic = self._get_topic_name(tp.topic[tp.topic.index(".") + 1:]) [14:57:11] I'm not entirely convinced this was a good idea. But having those topics be prefixed with a site is confusing and probably requires some kind of abstraction. Not sure I have the right one here. [14:57:12] honestly, I'm not a fan of KafkaClient approach (it doesn't really give you an abstraction, outside handling of a kafka consumer is still necessary), but coupling of kafka consumer and site_name can probably be done differently, with a data class [14:58:30] feel free to ignore my comment completely if it gets in the way of getting things done! [14:58:55] meeting time for me :/ [14:59:21] wait what? gehel and me getting in the way of getting things done... that never happened!™ :-P [14:59:30] [citation needed] [15:02:07] * gehel has always encouraged volans to ignore whatever gehel said [15:02:12] I'll finish up the rest of the comments and get back to that, to see if I have a better idea [15:06:28] hmm, am I blind or timedelta object in Python ignores the existence of miliseconds? [15:07:08] from the doc: A duration expressing the difference between two date, time, or datetime instances to microsecond resolution. [15:07:22] so I millis should be supported [15:07:44] https://docs.python.org/3/library/datetime.html#datetime.timedelta [15:07:55] https://docs.python.org/3/library/datetime.html#datetime.timedelta [15:07:59] milliseconds [15:08:14] but if you have to put 20000 milliseconds just put 20 seconds :D [15:09:06] ah,ok - I have no idea why reading doc I thought they go straight from seconds to microseconds [15:09:22] I must be getting tired [15:09:39] because the docs have the params ordered in a weird way [15:09:50] timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0) [15:09:59] nope, I was right [15:10:14] I don't need to specify milliseconds, I need to get milliseconds [15:11:14] instance attributes are: days, seconds, microseconds [15:11:33] a.total_seconds() * 1000 [15:11:35] it's a float [15:11:42] cast to int() if needed [15:11:58] not sure if that helps or makes it more readable [15:12:19] this I knew, but it feels weird to replace 20000 with timedelta(seconds=20).seconds * 1000 [15:12:36] or a total_seconds() but you know what I mean [15:12:38] wasn't me that commented :D [15:12:51] I know, I'm just thinking out loud [15:13:09] and finding some builtin python stuff weird [15:14:12] time and calendar management has always be a pain, there are numerous 3rd party libraries that tried to solve this over the years [15:14:16] none made it to the stdlib [15:14:29] and everyone tries to solve the problem from a different angle [15:14:30] gehel: the right one is MIrrorMaker 2 [15:14:33] Java had similar issue, joda.time was basically a way to go for years [15:14:39] re topic prefixes [15:14:46] and then they basicall y snatched it up into JDK [15:14:48] it wasn't available at the time [15:15:03] and our kafka cluster mirroring model is DC <-> DC [15:15:10] so prefixes were the only way to avoid infinite replication [15:15:24] only eqiad .* topics -> codfw cluster, and vice versa [15:15:31] MirrorMaker 2 abstracts this stuff away [15:15:54] it also handles offset trranslation between clusters, which IIUC is what this cookbook is mostly for [15:16:32] https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0#KIP382:MirrorMaker2.0-ProposedChanges [15:16:44] great stuff [15:16:51] some of that spicerack is about that, yeah [15:17:30] if it could precisely map offsets, we wouldn't need to have a DELTA for the translation [15:17:58] zpapierski: we should really upgrade kafka and mirror maker at some point [15:18:02] this is the first time anyone has cared tho [15:18:08] and it would be a fare amount of work [15:18:15] would <3 if yall would file a ticket asking for it :) [15:19:05] > Finally, an offset sync topic encodes cluster-to-cluster offset mappings for each topic-partition being replicated. [15:19:24] > A utility class RemoteClusterUtils will leverage the internal topics described above to assist in computing reachability, inter-cluster lag, and offset translation. It won't be possible to directly translate any given offset, since not all offsets will be captured in the checkpoint stream. But for a given consumer group, it will be possible to find high water marks that consumers can seek() to. This is useful for inter-cluster consumer [15:19:24] migration, failover, etc. [15:22:06] looks interesting, it would simplify this somewhat, but I'm guessing that probably won't be enough to trigger the update - what do you actually need to make that happen, proper way? [15:22:18] my understanding is that MM2 and offset mappings would allow for active/passive deployments of streaming processors [15:22:35] this is not our case [15:23:29] it would work, though - if the consumer group (I guess identified by the same name?) would have a specific offset set based on the original one in other DC, we could just replicate it throughout the other consumer groups [15:23:55] \o [15:24:02] and since MM2 is doing that, that offset would've matched the original one [15:24:04] o/ [15:24:07] o/ [15:24:57] I think you'd still have to do some manually reassigning of your consumer group offsets [15:24:58] this works only if you have a single streaming processor [15:25:06] but, you'd have built in tooling for the translation [15:26:00] a single stream processor gives you stronger consistency but 2 points were missing when we discussed this: [15:26:14] 1/ MM2 and offsets replication [15:26:21] 2/ a failover strategy [16:55:20] looking at the duckduckgo research, from their numbers I get they serve ~730 searches/sec, and if proportional to the traffic we receive (and assuming equivalent behaviors) could mean that google is serving 73times more (~53k searches/sec) [17:16:01] dinner [18:05:24] dcausse: That seems low. Looks like we're doing around 500 full text searches per sec, so we're in the same ball park as DDG? Or am I misunderstanding what you're saying? [18:11:38] Oh, that's only the searches relevant to Wikipedia. Makes more sense. Still, we're not doing too bad if we can compare ourselves to DDG! [18:12:12] gehel: depends how we look at it, i think we are something like 100 UI reqs/s and 400 automated req/s for fulltext [18:12:37] ddg and google tend to ban the automated req's :) [18:13:54] yeah, never a perfect comparison. Still, I'm quite happy to be in the same ball park as DDG! [18:14:18] but indeed, 730 is lower than i would have guessed. And not that far from our own infra. Reminds me of meeting an algolia engineer, they sell search/autocomplete as a service and he thought that their entire infra still wouldn't be able to serve our search numbers [18:14:59] Wow! We're doing something right! We need to keep those numbers to brag a bout them at least a little bit! [18:16:28] :) [18:32:02] gehel: ebernhardson: IRC or google meet to go over the wcqs patches? [18:32:09] I'm in the meeting but gonna grab water real quick, back in a min [18:32:17] I'm in meet [18:32:23] coming, sec [18:32:27] sorry, got distracted by some budget stuff :( [18:56:04] i suppose for random fun numbers, hive reports 935,695,381 full_text queries for the month of july. Call it 10B/yeear [19:02:54] my brain can't make any sense of request per year. [19:03:11] 10B looks like a big number, but a year is a long time [19:07:29] DDG claims 23B/y, algolia claims 1.5T/year!, but they say "search operations" not searches :) [19:24:13] hmm, 1.5T is a pretty big number, works out to ~50k/sec. Wonder what they consider a search operation [19:31:20] i suppose unexpected other value in the count, we did 200M GeoData_spatial_search requests in july, something must be using it more [19:45:02] ebernhardson: Just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/717630 and running puppet on `miscweb*` now...will there be a way for me to verify the UI is up properly, without the trafficserver stuff in place? [19:45:57] I feel like I'd have to start mucking with the resolv.conf on miscweb so I could curl `commons-query.wikimedia.org`, which would be a bit gross [19:46:20] (it should be fine if we don't have a way to check the UI at this stage but just wondering) [19:47:51] ryankemper: curl can do it, sec [19:48:39] ryankemper:something like: curl --resolve 'commons-query.wikimedia.org:443:localhost' --insecure https://commons-query.wikimeida.org/ [19:49:12] ebernhardson: oh awesome, makes sense that a million people would have had this exact same problem before :P I'll give it a spin [19:49:37] https://www.irccloud.com/pastebin/6RlXDNP8/ [19:50:03] Looks like it can't resolve...I was assuming that the `--resolve` was telling it to mock the resolution or something, guess it's time to read the manpage [19:50:16] hmm, --resolve should do the trick. Sec lemme find an actual line i've used before in a bash history [19:51:28] this works on wdqs1010, maybe it doesn't like 'localhost' and wants an ip to resolve to: curl --resolve 'query-preview.wikidata.org:443:127.0.0.1' --output - https://query-preview.wikidata.org/readiness-probe [19:52:08] > Provide a custom address for a specific host and port pair. Using this, you can make the curl requests(s) use a specified address and prevent the otherwise normally resolved address to be used. Consider it a sort of /etc/hosts alternative provided on the command line. [19:52:18] Yeah so it does do what I was assuming it did..I'll try the 127.0.0.1 [19:53:30] ebernhardson: nice, that works [19:55:04] ebernhardson: okay so just for an update, gonna eat some food real quick and then deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/720801, then around 1:30 PT legoktm and I are pairing on our respective trafficserver backend changes, so after that we should be able to verify that everything looks right before doing the final DNS work [19:55:32] awesome [21:04:31] ryankemper: i just went the easy route and registered the oauth under my acct, will need a puppet private patch similar to labs/private patch with the new values (this should be a private paste...at least i hope :): https://phabricator.wikimedia.org/P17341 [21:21:31] ebernhardson: I can't see it from a private browser so seems private enough to me! :P (ofc I didn't try logging in with a different account) [21:21:51] ebernhardson: nginx patch and the trafficserver stuff is merged. so we should be able to do our internal testing now [21:23:22] cool, looking to see what works [21:27:40] getting the site from webserve-rmisc-apps.discovery.wmnet, same from wcqs1001. Hitting the public side with a crafted request is still getting 502 from ATS, although maybe have to wait on puppet there [21:54:28] hmm, no i see from your message in -ops the puppet run on trafficserver is complete. hmm. [22:05:06] ryankemper: we must have missed something related to *.discovery.wmnet, it doesn't resolve to the local svc endpoint. Looking into how that's supposed to work [22:09:34] is the service pooled in conftool? [22:10:29] {"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=wcqs"} [22:10:29] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=wcqs"} [22:10:40] ebernhardson: ^ [22:11:19] legoktm: ahh, that would do it. Could you pool wcqs1001? [22:13:24] AIUI for discovery DNS I can just pool eqiad and/or codfw [22:13:25] and then at the service level I can pool individual hosts [22:13:55] nothing actually works yet (no proper data load), should be safe to pool them all for now [22:14:13] all 6 wcqs servers are pooled for service=wcqs [22:14:37] ebernhardson: just to clarify, you want me to pool eqiad and cofw? [22:14:51] legoktm: yes, please [22:15:34] wcqs/eqiad: pooled changed False => True [22:15:34] wcqs/codfw: pooled changed False => True [22:16:31] legoktm: thanks, checking [22:16:38] (for future reference, you can see pool state for discovery DNS at https://config-master.wikimedia.org/discovery/discovery-basic.yaml and individual servers at e.g. https://config-master.wikimedia.org/pybal/eqiad/wcqs) [22:17:23] thats nifty, thanks [22:18:18] hmm, ATS still says cannot find server though, can wait a few before checking [22:20:58] what's the URL? and what if you hit the discovery.wmnet hostname directly? [22:24:50] legoktm: wcqs.discovery.wmnet doesn't resolve. Public dns isn't live yet, i'm testing with: curl --resolve 'commons-query.wikimedia.org:443:198.35.26.96' https://commons-query.wikimedia.org/ [22:25:00] mostly that just tells curl to resolve it to text-lb in ulsfo [22:25:43] i'm checking over the stuff from wikitech dns/discovery, lits isn't too long but might have missed something (maybe the authdns-update at the end?) [22:26:29] there's no undeployed ops/dns changes [22:28:21] right, only internal endpoints are supposed to have dns right now. Once it all works we'll put the public dns up. [22:29:07] i'll just see whatever you did for toolhub that we didn't, probably something :) [22:29:12] oh, I think I see what's missing [22:29:42] https://gerrit.wikimedia.org/g/operations/dns/+/e1e708d9116bcbf09ce161f3e6d1f03ff6be2420/templates/wmnet#602 no entry for wcqs [22:31:13] https://wikitech.wikimedia.org/wiki/LVS#Add_the_dns_discovery_record is the docs [22:34:13] good find, i think https://gerrit.wikimedia.org/r/c/operations/dns/+/724520 should cover it [22:35:06] (back now) [22:36:15] cool, I can leave the DNS change for you then? :) [22:36:21] legoktm: yes, thanks for your help! [22:36:47] anytime :) [22:38:37] ebernhardson: we'll need to circle back and do https://wikitech.wikimedia.org/wiki/LVS#Add_monitoring_to_the_load-balanced_services (basically change state to `production` and run puppet), since that step is right before https://wikitech.wikimedia.org/wiki/LVS#Add_the_dns_discovery_record which is what we're doing rn [22:42:33] ryankemper: ahh, ok lets do that [22:44:38] it shouldn't be a blocker on this dns part though [22:44:57] also actually we need to go to `monitoring_setup` first and then `production` [22:46:07] I'll be back in ~20 mins to deploy the dns change, handyman just showed up and need to show em what needs fixing [22:47:53] kk [23:09:48] okay back [23:10:49] alrighty, sounds like monitoring is up next [23:11:11] i think that just enables pinging the endpoint we configured, those should all be returning 200. will find out :) [23:12:00] merging now, will run this procedure https://wikitech.wikimedia.org/wiki/LVS#For_both_active/active_and_active/passive [23:17:40] ebernhardson: https://phabricator.wikimedia.org/T282117#7386402 must be missing something [23:17:54] hmm [23:18:14] ebernhardson: weirdly though when I grep for `disc-wdqs` I only see the `utils/mock_etc/discovery-geo-resources`, and we do have a corresponding entry for wcqs [23:19:11] oh and note that I moved the ordering of `wcqs` to put it right by the `wdqs` entries in `templates/wmnet`, but I'd be shocked if the ordering mattered in that way [23:19:35] ryankemper: i suppose those mocks are supposed to be mimicing something that comes from etcd, probably a related step in there somewhere [23:19:59] That must be it... lemme make sure that pybal looks how it should [23:21:15] This looks right, here's echostore vs ours: [23:21:17] https://www.irccloud.com/pastebin/Bm8o7YGL/ [23:21:46] and wdqs for completeness' sake https://www.irccloud.com/pastebin/MGBiWF6Q/ [23:21:47] ryankemper: reading the lvs page, might just need a few more steps. it has 'Make the service page, add discovery resources' and under that 'Change the state of your service to production' so we might have to move the lvs part forward first [23:22:48] ebernhardson: yeah it must be the `production` part, since that has the step of running puppet on auth-dns servers [23:22:50] okay that makes sense [23:23:15] Okay I'll knock the add monitoring step out first [23:31:04] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/724533 [23:31:38] ah I totally ran PCC on the wrong selector since no change detected [23:33:50] Actually for https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254 it was `lvs2021.codfw.wmnet` which is an `lvs::balancer` that had a change, so not sure there, anyway this change doesn't actually need PCC [23:34:06] `lvs2010`* [23:57:02] State change into `monitoring_setup` looks good, next up `production`: https://gerrit.wikimedia.org/r/c/operations/puppet/+/724536