[10:50:30] Lunch + errands [10:53:26] stopped blazegraph on wdqs1013 while we mitigate the maxlag problem [10:53:47] also stopped the updater (removing the data_loaded flag) [10:54:10] stopped puppet as well hoping that it would prevent blazegraph from being restarted [10:54:25] inflatador, ryankemper: ^ [10:54:36] issue is T360993 [10:54:37] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [11:07:36] moved ^ to our workboard as high (might be a UBN since we can't depool machines to catchup on lag after a long failure or a data-transfer) [11:09:22] lunxh [13:18:16] o/ [13:24:02] dcausse looking at that wdqs ticket, we do specify a user-agent in those prometheus checks...could that be used to filter out monitoring queries? [13:24:35] we can probably change the frequency of checks too, I believe we just used the default [13:24:48] inflatador: possibly? I was looking at query logs hoping to find some headers set on edges by varnish [13:29:30] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/query_service/monitor/wikidata_public.pp user agent is listed here...will try and figure out frequency [13:43:23] hmm, from what I can tell the scrape interval is 15s...unfortunately, it is necessary to create a check for each team that gets notified. So that's 2 checks. Monitoring checks should still be well below 1 qps just for the blackbox probes. We do have other probes that use the jmx exporter though...checking [13:55:54] could we filter based on x-forwarded-for? I see these UAs coming from 10.* addresses or from 2620:0:* ipv6, the ipv4 is private but the ipv6 I have no clue [14:00:32] I haven't looked at how the prometheus metric is created yet, just doing the math on scrape interval * pollers [14:00:35] https://phabricator.wikimedia.org/T360993#9661626 [14:17:27] re: ipv6 we only use public ipv6 addresses even if we don't route them publicly, should be easy enough to get the IPs of prom pollers [14:27:20] \o [14:30:31] based on the access logs from wdqs2019, the pollers are basically hitting every second. This doesn't seem to match the prometheus configs...need to fix nginx logs so they use XFF, but something def looks wrong [14:46:08] o/ [14:46:19] could be pybal? [14:53:35] yeah, or envoy... [14:56:58] can envoy send request on its own (something not forwarded)? [15:03:03] not sure...it seems really weird it would, but we do have tons of requests from that user agent [15:03:17] workout, back in ~40 [15:11:37] hmm, where should documentation about what the saneitizer is and does go? Not planning to write a ton, but maybe 2-3 paragraphs. Perhaps the top of the Source impl [15:17:27] Justin did some research around search, and there are quite a few good ideas. If you have time to review https://docs.google.com/presentation/d/17Zcb6-OiUS1mqJSCzLzh694SHkmZPEusKQ0jUUJoZzM/edit#slide=id.g23ce2a303f9_0_0, this might be a good starting point to our discussio nwith the Web / App teams [15:19:15] ebernhardson: +1 for the source impl [15:37:43] inflatador: going to use a list UA I think, the list probably set from puppet then used by nginx to set a header like "X-MONITORING-QUERY: true" and adapt blazegraph to ignore those in the metric used for detecting the pooling status [15:54:34] back [16:37:56] dcausse ACK, that will help for sure. I'll keep trying to figure out why the pollers are hitting the hosts every second or so (or at least, that's what it looks like from the nginx access log) [17:02:12] not sure why but always nervous when calling a http_request.getHeader(headerName), always afraid that it might end-up being case-sensitive [17:02:25] lol, yea [17:21:57] ebernhardson: something I forgot to check regarding the saneitizer is that rerenders do not trigger an upsert but in the context of fixing an issue discovered by the saneitizer we might want an upsert [17:22:21] dcausse: oh! indeed that does sound important [17:22:38] will need a stronger variant of re-render or some such [17:23:27] dr0ptp4kt just wanted to say, great job on T359062 ....still wrapping my head around the implications, but thanks for your work on this [17:23:28] T359062: Assess Wikidata dump import hardware - https://phabricator.wikimedia.org/T359062 [17:24:12] not sure if sad or not "A current state of the art cloud compute instance approaches the performance characteristics of a 2018 gaming desktop and 2019 Intel-based MacBook Pro." [17:26:41] yeah, but the cloud compute is powered by VC money, so it's automatically better ;P [17:44:11] lunch, back in ~40 [18:09:22] * ebernhardson wonders while looking at i18n files if sanity is going to be an annoying term to localize... [18:09:44] i guess i can use correctness in the actual translated strings [18:16:02] hmm, seems like the backfill on wikidata finished almost 24h ago, but didn't shutdown :( looking [18:19:52] the job completed, but it seems the flinkdeployment state never got updated. I suppose the script could query the flink rest api instead, it's just more annoying to manage a port-forward or `kubectl exec ...` to run the request inside the pod [18:22:45] :/ [18:34:50] back [18:42:05] dinner