[07:31:00] gehel: I think I fixed the issue with dragging tabs out of Chrome and probably Chromium, at least for me - Chrome has a setting "Use system title bar and borders", which apparently needs to be turned off [07:39:47] Nice ! [07:42:31] hello, FYI prometheus can't talk to jmx_wdqs_streaming_updater anymore, expected ? [07:45:37] godog: for all servers ? [07:46:15] No, not expected. But we are doing a data reload, so there might be unexpected consequences [07:47:14] that's correct gehel [07:47:39] ah ok, when did the data reload started? this has been going on for a while [07:48:28] since october 1st to be exact, I'm looking at https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=1633033325113&to=1634111279957 [07:50:21] anyways I'll file a task [07:51:02] seems it's only a few servers [07:51:38] ah I think I see what's going on, wdqs-streaming-updater vs wdqs-updater [07:51:44] wdqs1009 and wdqs2008 are doing the data reload. This **should** not prevent prometheus from collecting data [07:52:13] Oh, yes, the updater is down on those, but blazegraph is up, so the updater metrics will not be collected [07:52:55] The others are wcqs, those are a new service, not fully configured yet [07:53:02] looks like everything is expected [07:54:08] ah I see, that explains indeed [07:54:31] thanks gehel ! [07:55:11] thanks to you for checking! [08:35:52] sigh... ovh... [08:43:46] I'm sure I've asked, but why not irccloud? [08:46:23] I don't know... I probably did not know it was available at the time [08:47:23] or perhaps it was not free (paid by wmf) [08:48:35] I usually connect to the irc server directly when ovh is down but since the libera.chat transition I guess I was too lazy to set this up [08:49:09] and did not expect the downtime to be so long, 1h30... [08:51:31] I think element/matrix bridging is another option, I haven't tried it though [08:51:44] and I think it's only for some wmf channels [08:53:12] also, in irccloud you'll get to read those code snippets we've been sometimes sending directly on your screen :) [08:59:15] ah interesting but I'm not ready to give up on weechat :) [09:25:13] now that's devotion :) [09:50:39] zpapierski: Emmanuel has some issues with his MacBook. I know nothing about Mac, but maybe you can help ? [09:50:55] of course [09:51:12] I was waiting for him to appear, actually to sync up [09:52:04] I don't think he has irccloud yet, maybe you can help him get that setup as well [09:52:20] Having a backlog of conversation would help ! [09:53:34] actually, let me send a request to techsupport for irccloud access [09:54:50] done (irccloud) [09:55:54] I 've sent Emmanuel a link to meet, can't seem to find him here on on Slack [09:59:20] He seems to be offline on Slack as well. We probably need to figure out a bit more how his connectivity is working :/ [09:59:30] yep [10:02:09] filed T293195, will probably move it to current work without waiting for sprint planning next week [10:02:09] T293195: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 [10:02:24] related to wcqs [10:02:28] lunch [10:02:44] lunch too [10:02:59] and I'm on a diet, so I'm staying here [10:30:17] gehel: let me know once you're here - we could take care of that cookbook [10:30:46] Will do [10:44:59] break [10:48:10] Emmanuel: we have an unmeeting this evening. You can ask David or Zbyszko to tell you what it is. [10:49:22] zpapierski told me already [11:05:01] WDQS just paged, anyone around by any chance? [11:20:37] I'm in a restaurant with France. No laptop, only smartphone [11:20:51] dcausse, zpapierski around by any chance ? [11:21:51] Oh, looks like it recovered already (looking at icinga in -operations) [11:36:44] Sorry, was away from computer/phone [11:37:50] Do only SREs have pool/depool permissions (for the future)? [12:13:53] zpapierski: you do have depool rights on wdqs: [12:13:55] gehel@wdqs1003:~$ sudo -u zpapierski sudo -l | grep pool [12:13:56] (root) NOPASSWD: /usr/local/bin/depool [12:13:56] (root) NOPASSWD: /usr/local/bin/pool [12:14:12] huh? interesting, I remember trying to depool and failing [12:14:20] maybe that changed [12:14:30] or I did something wrong [12:14:33] or didn't sudo [12:14:38] (don't remember [12:14:49] anyway - gehel , you're back and ready? [12:15:02] you do need `sudo -i`, the keys are in the root home dir [12:15:08] `sudo -i depool` [12:15:57] also this wouldn't have help in this instance, servers were marked down but still pooled, probably due to depool limits in pybal [12:16:04] I definitely didn't do that [12:16:37] blazegraph was probably unresponsive and pybal trying to depool them but failing because too many were unresponsive [12:17:02] it recovered after 2 minutes [12:17:20] not sure what that was. Probably an update batch that was too large and blocking the reads for too long? [12:17:39] hopefully this is going to work better with the streaming updater [12:17:45] zpapierski: yes, I'm around [12:18:14] it should, currently batches aren't super comparable, patches from streaming updater should be more uniform (due to them being already reconciled) [12:18:31] and of course, update process is way more straightforward for blazegraph [12:18:52] * gehel keeps his fingers crossed [12:19:23] I'm sure it will be better, I'm wondering how it will affect a long term stability of it all [12:20:59] gehel: I'm having my meal in 10 min, should be free by 3PM CEST - is it ok for the cookbook test? [12:21:15] sure [12:21:28] great, I'll ping you then, then [12:22:09] ryankemper: reminder: we have our ITC scheduled for this evening (my time) [12:22:16] dcausse: same thing for tomorrow [12:27:18] zpapierski: I have a meeting at 15:30-16:00, but we can get started and I can join you back afterward [12:27:51] sure, I hoped we'll be done before 15:30, actually [12:57:37] gehel: I'm ready, are you? [12:57:40] if so - meet.google.com/sty-biak-qgr [13:14:53] volans: how can you run a cookbook in dry run mode? doc (https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks) says -d, but it doesn't seem to work [13:15:14] zpapierski: that's a cookbook global option [13:15:17] see sudo cookbook -h [13:15:34] so cookbook -d sre.foo.bar --cookbook-specific-options [13:15:58] ok, I think we know where we made a mistake, thanks [13:24:13] reload of main dump completed on wdqs2008, starting lexeme reload with `/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/lex-munged/` [13:24:23] thanks! [13:24:52] wdqs1009 still loading: chunk 940 / 982 [13:41:10] lexeme data reload completed on wdqs2008, still some cleanup required [13:41:29] gehel: what kind of cleanup? [13:41:55] restarting the updater, touching /srv/wdqs/data_reload, etc... [13:42:23] gehel: ok, give me sec to double check everything before moving forward [13:43:28] sure [13:52:33] gehel: I think we were almost there - doc says that netbox_server requires "the hostname (not FQDN) " [13:52:41] so, e.g. wdqs1001? [13:52:56] hm.. interesting consumer offsets have disappeared in kafka [13:53:16] hmm, was it us? [13:53:20] ottomata: is there a process that purge unused consumer group offsets? [13:53:35] it shouldn't we were unable to instantiate kafka module [13:53:45] I think there are only kept for some time (a week?) [13:53:56] I set up the consumer offsets for wdqs1009 & wdqs2008 last week in preparation for the updater start [13:54:03] which is probably lower than the time it took us for the data reload [13:54:06] ah that would explain it [13:54:15] or we messed up something with our tests with zpapierski [13:54:36] I don't think so, like I said we never instantiated kafka module [13:54:43] reseting them [13:54:44] we probably need to reconfigure the updater on wdqs2008 to use the new stream [13:55:04] dcausse: i think they have the default retention policy [13:55:06] 7 days? [13:55:20] meeting in 5', but I can move forward on finishing the the data reload on wdqs2008 in ~30' [13:55:21] ottomata: that would explain what I see [13:55:22] but yeah they should def be longer [13:55:26] i think we can expand it [13:55:29] and probably hgsould [13:55:42] ok, thanks! [13:56:54] gehel: I think wdqs2008 is configured to use the streaming updater (double checking) [14:02:26] yes the system-unit is running: /srv/deployment/wdqs/wdqs/runStreamingUpdater.sh -- --brokers kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092,kafka-main2004.codfw.wmnet:9092,kafka-main2005.codfw.wmnet:9092 --consumerGroup wdqs2008 --topic codfw.rdf-streaming-updater.mutation [14:02:30] so should be good [14:03:26] starting it there [14:04:21] gehel: I have a hack for the data_transfer cookbook, if you'll have time before office hours, if not we can do it tomorrow [14:04:46] zpapierski: probably tomorrow, a few things I still absolutely need to finish today [14:04:51] sure [14:05:32] volans: is there a version or something similar to netbox_server that accepts FQDN? [14:06:19] zpapierski: what do you need to do? [14:06:30] or better, what do you have, what do you want to get :) [14:06:49] get that site thing you recommended here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/727021/comment/306aa092_7987ada4/ [14:07:09] I have a FQDN and I can do a per dot split of course, I'm just wondering do I need to [14:07:42] yes, netbox works with short hostnames, not FQDNs [14:09:13] so I have a FQDN and I need site - if I need to do a per dot split of FQDN I might as well go with split[1] and be done with it :) [14:09:34] yep, host.split('.')[0] is what you want [14:09:42] (unless we have dots in short hostnames, then I'm not sure, what to do) [14:09:52] you can't, would be a subdomain [14:10:00] right [14:10:10] so why not go with host.split('.')[1] for site? [14:10:19] is there any scenario that won't work? [14:10:22] because we have *.wikimedia.org hosts [14:10:31] all hosts with public IPs [14:10:40] I see, but that's not really our case [14:11:30] netbox is the source of truth for physical location of hosts, that said that would in practice work too, not sure how future-proof [14:12:13] lag is starting to catchup on wdqs2008 [14:12:14] ok, I will use netbox, I actually already have host.split(".")[0] there anyway for consumer name [14:13:16] cool it should catchup by tomorrow, right? [14:14:43] yes I gess at least 24h to catchup the 15days of update [14:15:26] we don't really need to wait before it catches up with data transfer, don't we? [14:16:40] since you need to have an up to date journal before repooling I guess it's better but not strictly required indeed [14:22:58] dcausse: where are we on the metrics / alerting for streaming updater? [14:23:09] T276467 [14:23:09] T276467: Ensure we have proper monitoring / alerting on the new Flink based WDQS Streaming Updater - https://phabricator.wikimedia.org/T276467 [14:23:17] gehel: it's done [14:23:36] flink is alerting on wikimedia-operations [14:23:59] and we use the same metric for individual wdqs server lag [14:24:07] so same alert [14:24:48] dashboards are https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater & https://grafana-rw.wikimedia.org/d/gCFgfpG7k/flink-session-cluster?orgId=1 [14:38:50] dcausse: can I restart the updater on wdqs2008? and cleanup the dumps and munged files? [14:39:07] gehel: I did it already (not the cleanup) [14:39:15] Oh cool! Thanks! [14:39:41] Happy enough with the current state? Can I cleanup the dumps + munged? [14:44:20] Emmanuel: how is your day going? Need help with anything? [14:44:31] Did you manage to resurrect your laptop? [14:45:49] I sent it back to the supplier [14:46:46] I'll get feedback by tomorrow [14:48:06] :( [14:50:06] Emmanuel: we have our public office hours in 10'. This is the time where anyone can join us and ask any question they want [14:50:27] We usually get a few regular people who want to chat either about Search or Wikidata Query Service [14:50:52] I'm tired too [14:50:56] That also a good occasion for you to learn a bit more about what we do and how we interact with our communities [14:51:17] Feel free to skip if your day was already long enough! [14:51:29] I was excited yesternight about getting the work system [14:51:55] Now, I just feel dissappointed [14:52:13] I'll join the office hours [14:52:35] disappointed about what? the laptop? You'll get it running at some point! [14:52:50] Yes [14:53:07] I just want to get started [14:54:25] yeah, I can understand that! Hopefully you'll get the laptop back tomorrow and can work with Zbyszko to get a dev environment ready [14:54:49] Aiit thanks [14:59:07] gehel: sorry was distracted, I'm happy with the current state! [15:00:26] dumps cleaned up, wdqs2008 just needs to catch up on lag [15:00:38] Office Hours are now, and we have a couple of guests! [15:51:56] ryankemper: fyi, I had a chat with SDineshKumar a few days ago and he is trying to help us move T278378 forward. You might want to chat with him at some point! [15:51:57] T278378: Pull Elasticsearch config out of Spicerack - https://phabricator.wikimedia.org/T278378 [15:52:50] gehel: awesome! [15:54:28] :) [17:35:09] * ebernhardson goes back to guessing how maven works [18:37:02] ebernhardson: ping me if you need [18:37:16] * gehel tends to know more about Maven than he really should [18:38:10] Do we track the most searched pages? There is a question about it in the #no-stupid-questions slack channel [18:46:37] gehel: well, it "works" but i suspect someone will complain in review :P We've excluded commons-logging from everywhere, so attempting to use the apache http lib in mw-oauth-proxy fails with NoClassDefFound. What i've done for now is hardcode a version of commmons-logging into the mw-oauth-proxy pom.xml [18:47:00] but i'm sure i'll find out that that can't be loaded into jetty because blazegraph already loaded something else :P [18:48:56] gehel: we could calculate metrics on pages most clicked through, or most often returned, we have the underlying data but don't have any particular refined data about it [18:49:41] just about anyone with superset access could calculate full_text from the discovery.query_clicks_daily table fairly easily (i hope)