[08:07:46] wdqs102[234] are barely reachable in the mornings [08:09:30] What would be the procedure in this case? Check resource consumption? Restart? [08:10:49] pfischer: I think it depends, in general an sre tries to access the console and see what's going on, here we're unsure what's going on, every mornings these servers misbehave [08:11:48] we suspect a cronjob that runs around this time possibly putting the server on its knees [08:12:37] there's also one nfs mount which I've seen in the past can hang the machine for some time [08:13:12] Does that show up in I/O metrics? [08:19:16] yes a bit but sadly some metrics are unavailable during that time because it's unreachable (https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=wdqs1023&var-datasource=thanos&var-cluster=wdqs-test) [08:46:19] ryankemper: I marked T352878 as blocking T350464 (tried to run some queries this morning and the availability issues on these machines made this almost impossible) [08:46:19] T352878: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 [08:46:20] T350464: Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph - https://phabricator.wikimedia.org/T350464 [10:37:28] errand+lunch [12:28:38] dcausse: here’s my spin on estimating page_rerender topic sizes (per wiki): https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=670687915, turns out: <2% of the wikis that where active in the observed period are responsible for 92% of records [13:09:50] pfischer: how did you estimate this? [13:10:13] I counted events in page-links-changed [13:11:05] Translated that in shares of total number of records in that topic and multiplied it with the expected topic size for page_rerender [13:11:58] pfischer: I don't know much page-links-changed to know if it's a proxy, Erik used the cirrusSearchLinksUpdate topic [13:12:05] *good proxy [13:12:32] No worries, I’ll rerun the script. [13:14:34] pfischer: see T352335#9380712 [13:14:35] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [13:15:29] commons,wikidata,fr,it should have rerenders enabled so they can be used to test the estimation [13:16:20] also I'd perhaps not worry too much and just double check that the current size of the rerender topic matches what we had expected when enabling these wikis? [13:16:48] and then agree on 2 or 3 other groups of wikis to enable [13:18:36] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.mediawiki.cirrussearch.page_rerender.v1&var-topic=eqiad.mediawiki.cirrussearch.page_rerender.v1 [13:19:27] That matches Erik’s estimate quite precisely [13:20:01] yes seems like it [13:20:56] perhaps let's pink Luca with this data and ask if it's ok enabling another set of wikis picking those based on the data you extracted? [13:21:00] s [13:21:04] s/pink/ping/ [13:23:07] I’m still trying to figure out, how Erik extracted the per-wiki-shares of cirrusSearchLinksUpdate jobs. Do the logs contain that information? [13:24:05] the cirrusSearchLinksUpdate topic should have the wiki id yes, the per-message size estimate is 0.6kb (what we used so far) [13:24:07] Based on page-links-changed the share of commons would be much higher: 60% instead of 13% (Erik) [13:24:23] oh I see [13:24:56] if you have something that can extract data from kafka you might be able to adapt it to run it on top of this topic? [13:25:08] we can try to find where has put this script as well [13:25:19] we can try to find where *Erik has put his script as well [13:25:27] Is this a topic? https://stream.wikimedia.org/v2/ui/#/ does not know it. [13:25:39] no it's not public :/ [13:25:49] will have to fetched from kafka-main directly [13:26:01] Sure, I’ll do that. [13:26:06] Is it replicated to jumbo? [13:26:19] no it's a job so not replicated there [13:27:02] Okay, I’ll run my Kafka-proxy against main… [13:27:39] since we run in codfw you should consume codfw.mediawiki.job.cirrusSearchLinksUpdate [13:47:36] Update the spreadsheet, it’s closer to Eriks shares again [13:48:12] Color-coded 10GB (per broker) batches [13:59:25] o/ [14:05:39] o/ [14:05:43] pfischer: thanks! [14:15:11] dcausse are you actively working on the graph split hosts? I was going to try updating all their firmwares [14:15:45] inflatador: extracting a couple numbers, just a sec [14:19:43] inflatador: I'm done [14:20:00] dcausse cool, will give a heads-up when I get started [14:30:51] dcausse do we need NFS on these hosts anymore, since the data is already loaded? [14:31:06] inflatador: I don't think so [14:31:35] OK, that might be something I try too...although we have NFS on a couple of other hosts with no issues I'm aware of [14:34:09] side note, do we have a dashboard for WDQS that shows HTTP error rates? I can't seem to find it. Something like https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1 [14:34:55] Dashboard links on https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Main_dashboards are broken...I'll need to fix that ;( [14:36:29] inflatador: the "wikidata-query-service" dashboard is not broken? [14:36:46] I use it all the time: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m [14:36:52] but perhaps it's something else? [14:37:59] re errors there's a "Ratio of failed queries" at the end of this dashboard could it be what you're looking for? [14:39:39] dcausse I was looking for 500x errors specifically...I think we are using those as part of the SLO. I generated a bunch of them by accident when I was setting up monitoring for the LDF endpoint [14:40:03] maybe error rate? [14:40:28] although that seems to come from BG itself...not sure [14:40:36] error rate should be from blazegraph, did you hit blazegraph with these queries? [14:41:24] Y, if Blazegraph gets a request with an Accept header, it throws a 500 [14:41:47] wow [14:42:00] you mean without? [14:42:17] Y, without [14:42:34] the ldf endpoint runs its own code so might not be captured by this error rate [14:42:51] do you remember when it happened? [14:43:11] Y, I can dig it up...been working on this one for several weeks now. More context here https://gerrit.wikimedia.org/r/c/operations/puppet/+/983415/ [15:02:08] workout, back in ~40 [15:57:19] back [15:59:16] hello, wdqs1024 has a unit that is failing and complains that the service is not here at all: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service [15:59:26] [M#;IService prometheus-blazegraph-exporter-wdqs-categories not present or not running [16:01:03] volans: thanks for the heads up, these hosts should not run the category endpoint [16:01:33] perhaps some puppet adjustments to make so that this service does not try to start [16:02:13] wdqs1022 and 1023 probably have the same problem [16:38:35] dcausse pfischer any opinion on https://phabricator.wikimedia.org/T349848 ? Was wondering if this needs further work, if it's blocked or maybe closed if not important? [16:40:14] dcausse pfischer another one from our mtg: https://phabricator.wikimedia.org/T350186 . Do we think the relforge indices are OK, or does it need further work? [16:42:35] inflatador: re mw api usage, up to you if this task is helpful to you we should keep it? [16:43:36] re relforge index correctness, Peter extracted new numbers yesterday IIRC [16:46:00] pfischer: should we revert https://gerrit.wikimedia.org/r/c/search/extra/+/982048? seems like the second workaround you implemented is a better approach [16:59:15] inflatador: is T353672 something you're working on? If so, can you move it to in progress? [16:59:16] T353672: Expose Prometheus Blackbox Exporter's ability to add http headers in puppet module - https://phabricator.wikimedia.org/T353672 [17:02:31] gehel ACK, done [17:03:49] inflatador: sorry, I was unclear. I was thinking about having it on our board and in the right column. I'll move it. [18:07:18] lunch, back in ~1h [18:14:35] going offline, happy end year festivities and take care [18:52:01] was back, but I'm taking my son to art class [18:52:04] back in ~20 [21:48:33] dcausse: yes, we should revert the extra plugin to wmf9 at least the "feature" should be removed from the codebase again so wmf11 will no longer support it [21:52:19] inflatador: regarding relforge correctness, I checked yesterday: https://phabricator.wikimedia.org/P54487 [21:55:08] It improved by an order of magnitude since Erik ran it last time [21:55:43] pfischer thanks...are these numbers within tolerances or do we still need to keep this open? Rev mismatch doesn't seem like a huge problem unless it's way off? [22:38:21] inflatador: good question, I don’t think we agreed on what’s good enough. Now, that I’m looking at those numbers, I also notice that we do not know how many (page-wise) distinct updates were performed since the index snapshot that’s the ground truth of relforge. So we only see 10k mismatches, which might be okay if ran 1M updates but may not be okay if we only ran 100k updates. I’ll check the numbers on Thursday [22:38:22] once more. [22:38:51] pfischer no worries. There is no pressure to close this, although we probably need to make the AC a bit more clear ;)