[09:13:05] Trey314159: https://www.mediawiki.org/wiki/Topic:Xtc9mi1igunqr1g2 you might have opinions (relates to documentation and formatting) [11:23:26] lunch [14:26:37] o/ [14:35:04] dr0ptp4kt: running a couple tests on wikibase_rdf_scholarly_split_refactor (seems to have a more recent run than wikibase_rdf_scholarly_split) [14:35:57] dcausse: yeah, ran that Friday. see you after a bit, heading to appt [15:15:32] hm... hive allows me to query Adam's db but spark fails with "Permission denied: user=dcausse, access=EXECUTE, inode="/user/dr0ptp4kt/wikibase_rdf_scholarly_split_refactor/snapshot=20231106":dr0ptp4kt:dr0ptp4kt..." [15:19:09] dr0ptp4kt: could you run: "hdfs dfs -chgrp -R analytics-search-users hdfs:///user/dr0ptp4kt/wikibase_rdf_scholarly_split_refactor" to see if this helps me getting access to this dataset? [15:20:57] dcausse looks like the puppet 7 upgrade started wdqs-blazegraph on the graph split hosts ;( . Will that break the data reload? The updater did not start. [15:21:33] inflatador: you mean "restarted" [15:21:35] ? [15:23:22] dcausse ah, nm...it didn't start them, they were already started by the cookbook [15:23:40] just the exporters [15:25:52] the import "should" restart a couple of the curl requests but it's better to avoid maint operations while we do a reload [15:25:57] so we're all good, sorry for the confusion [15:26:27] Yeah, unfortunately the puppet upgrade wasn't communicated until after it was done ;( [15:27:35] ok, a puppet run should not restart blazegraph (I hope) [15:28:52] we're past 12B triples imported on wdqs1022 :) [15:29:16] nah, just checked and puppet is still disabled and the BG processes are OK, I just got an email over the weekend saying puppet was upgraded on those hosts. Sounds like nothing to worry about after all [15:30:39] kk [15:36:33] dcausse: will update perms in middle/end of the meeting when back at an ssh console, sorry about the hassle [15:36:57] sure np and no rush! [15:42:33] getting alerts for search-loader too, hmmm https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1 [15:43:16] Will be 5-10' late for triage [15:43:31] \o [15:43:51] looks like the consumer group lag is dropping re: search-loader [15:44:56] yes saw this this morning, could not find what was wrong, might be some prometheus query issues, I don't see any errors [15:45:22] o/ [15:53:52] if anyone has time to review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/973242 LMK, Otherwise I'll probably self-merge. Still having the issue from Thurs where only one release is stable [15:54:57] inflatador: do you know what's the error now? [15:58:03] dcausse: the same from friday, it's an over-quota error [15:58:08] s/friday/thursday/ [15:58:16] dcausse no, looks like quota issues to me...it keeps asking to make a worker and failing. [15:58:59] If I submit a single release (commons OR wikidata) the job will stabilize [16:01:25] inflatador: ok, we can increase the quota but we'll have to revert at some point [16:01:41] pfischer, Trey314159: triage meeting: https://meet.google.com/eki-rafx-cxi [16:06:00] dcausse: thanks for the link to that discussion. I have opinions, but I'm not 100% sure I made things clearer—but I did write some stuff! [16:06:12] thanks! [16:54:12] ryankemper not going to make SRE meeting today. Added a brief note about the flink operator stuff to the agenda just as an FYI [16:54:34] back in ~45 [17:04:53] inflatador: ack [17:29:57] one option for the high-volume of api requests we will make, not sure though: https://phabricator.wikimedia.org/T345185#9321210 [17:36:06] oh, now i remember the problem for mjolnir...we accidently archived the search/mjolnir/deploy repo when we archived search/mjolnir [17:36:18] i guess dupe it over to gitlab? [17:51:28] back [17:52:34] Re: migrating mjolnir repo, I can take care of that [17:52:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/973849 should do the trick [17:52:48] i already created the repo and had it import from gerrit [17:54:50] ACK, thanks...will merge that later today [18:09:39] hmm, seems like the reindex after adding page_id didn't make it to cloudelastic enwiki index, wonder if we missed other s:S [18:11:39] Applied the quota increase to staging and it looks like both wikidata and commons are making taskmanagers now...maybe this worked? Still checking tho [18:13:55] Maybe not. `The producer attempted to use a producer id which is not currently assigned to its transactional id` [18:14:08] ^^ from the wikidata jobmgr [18:19:45] looking, not clear what that is :S [18:21:06] * ebernhardson didn't realize producers had id's :P [18:22:59] not sure...flinkdeployment resources are not stabilizing like they did before though [18:23:13] random reading suggests it should go away and the new producer id should be auto-assigned, unless we somehow reused the same producer id in both sides [18:24:08] it's not clear where the producer id comes from though, there is a `client.id` reported in the logs, but that seems to be unique per application [18:24:33] How could I check that? That does sound like it could be relevant, since it seems like one will run when the other's disabled. Let me trying disabling one job again [18:24:38] oh i think the client.id is consumers, not producers [18:24:55] inflatador: i'm wondering as well :) I had never seen a prouducer id in the configs [18:25:35] no worries, I just destroyed the commons release, let's see if wikidata stabilizes [18:25:38] i guess it reports the producer id in the exception, but it's just some number [18:29:05] i suspect the "Appendix: Known Critical Issues with current KafkaSink" of https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710 could be relevant here [18:29:28] > At any point in time, a Flink job may fail for any reason, causing it restart from the last completed Flink checkpoint. Within the Flink checkpoints, the KafkaSink stores TIDs (Kafka's transactional.id) of Kafka transactions that were pre-commited as part of the checkpoint. On restart, these TIDs are restored and their previous ongoing transactions should be (possibly redundantly) [18:29:30] committed. [18:30:10] although, if you destroyed it should have gotten rid of the old checkpoints [18:34:47] Have you seen the cirrus updater remove checkpoints after a destroy? I haven't looked at that at all [18:35:07] inflatador: yes, the checkpoint counter restarts from 0 [18:36:03] interesting, I want to say that doesn't happen with the rdf-streaming-updater...let me check logs though [18:36:08] although it's not clear that this one did, i see this in the logs: " Reset the checkpoint ID of job fb5b74147271e3dc88494bcd9ed9fc44 to 2160553." [18:37:03] unfortunately thats not the logs i was previously looking at in cirrus, in cirrus i was looking at the reported # on the checkpoints it creates. But this doesn't get that far [18:43:14] well, somehow I broke my logstash query ;( [18:44:05] andd...we're back [18:49:39] wikidata job has definitely not stabilized...going to apply again and head to lunch but probably not going to help [18:54:53] inflatador: did you destroy the wikidata release, it's restoring from s3://rdf-streaming-updater-staging/wikidata/savepoint_202311061445/savepoint-8cff20-f928f0373166 which is probably too old [18:55:19] curriously the commons app has `Failed to commit KafkaCommittable{producerId=554654, ...` and the wikidata app has ` Failed to commit KafkaCommittable{producerId=556451, ...` [18:55:23] i'm surprised they have the same producerId [18:55:28] * ebernhardson still isn't entirely sure what that is :P [18:55:37] oh, nevermind they are'nt the same [18:55:39] just close [18:56:52] commons is perhaps having the same issue if it's restoring from 7+days save point [18:57:46] re T345185#9321210, do [18:57:47] T345185: Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 [18:57:50] oops [18:58:28] dcausse: it's a bit of a fake-api call via jobs. Seems plausible if awkward [18:58:47] we'd have to reconstruct whatever the json version of a job looks like [18:58:47] do I understand this as "RunSingleJobHandler" could be called from an api request and the response could be the output of the cirrusbuildoc [18:59:25] ok [18:59:31] dcausse: yes, the problem is basically to try and keep the load inside the jobrunner cluster, because we can't easily shift servers between mediawiki clusters. Aaron's idea is to submit fake jobs [19:00:16] it also works around the private wiki problem though, because jobs do their own thing [19:00:55] ok but we'd stay on the jobrunner cluster.. [19:01:20] indeed it punts the problem down the road, once mw-on-k8s is ready i think we'd want to get out of that [19:37:59] back [19:38:55] dcausse confirmed, I haven't made any new savepoints. I can try restoring from a checkpoint from today. Not sure what to do about the commons job though [19:39:39] inflatador: for commons@staging you can simply start without any savepoint [19:40:38] dcausse ACK, if I need to make savepoints before I destroy in staging LMK [19:48:53] Is it possible to tell how long checkpoints are valid? [19:50:19] inflatador: i'm not 100% sure, but i saw this earlier today and suspect it's relevant: [19:50:21] In Kafka, transactional.id.expiration.ms is the time in milliseconds that the transaction coordinator will wait without receiving any transaction status updates. The default value is 7 days [19:50:28] inflatador: it depends on how long kafka transaction are valid [19:52:38] not sure we want checkpoints older than 7days to still be valid anyways, most topics have a 7days retention period [19:53:10] makes sense [20:10:12] OK, pointed the wikidata job at the last checkpoint and it's stable [20:10:37] will do commons no [20:10:38] w [20:14:28] re page_id from earlier, i checked the 9 clusters and only enwiki on cloudelastic is missing the page_id field, looks like it hasn't managed to complete a reindex. Seems it couldn't hurt so i kicked off that one now [20:41:34] Commons and staging wikidata apps are now stable...I guess we wait a few days and migrate to prod [20:45:34] randomly interesting, fetching a couple fields from ~80k docs/sec increases cpu load in cloudelastic from 5% to 35% [20:50:27] also has the awkward behaviour that cloudelastic doesn't have lvs, so all the json responses are being encoded by a single host [20:52:56] oh, we do have something? cloudelastic.wikimedia.org resolves and returns different hosts at least. Wonder how i could have forgotten [20:53:11] Y, we do have LVS https://config-master.wikimedia.org/pybal/eqiad/ [20:53:31] excellent [20:54:27] i think i also encoded that limitation into the cirrus updater helm bits, will have to double check [21:23:22] Hi all. I'm planning to schedule my Amazon OpenSearch Elasticsearch 6.5 cluster to 6.8 on Thursday morning. Since it does a blue/green deployment, changing the IP addresses of the cluster's endpoint hostname at the end of the migration, I have Nginx configured to refresh DNS every 60 seconds. The cluster typically gets a few hundred searches per [21:23:23] minute and the indexing rate is typically very low, < 10 per minute, but it can spike occasionally into the few hundred. Should I lower the DNS refresh rate during the upgrade to something like 20 seconds, just to help minimize the number of search errors, or is 60 seconds reasonble here? [21:26:00] I'll let Erik or Brian opine but it sounds like leaving it at 60s should be fine [21:26:19] justinl: hmm, does the old cluster immediately shut down, or does it hang around for a minute or two? If it hangs around long enough for the refresh to come through, you should be all good [21:26:41] "Amazon OpenSearch Service uses a blue/green deployment process when updating domains. A blue/green deployment creates an idle environment for domain updates that copies the production environment, and routes users to the new environment after those updates are complete. In a blue/green deployment, the blue environment is the current production [21:26:42] environment. The green environment is the idle environment. [21:26:42] Data is migrated from the blue environment to the green environment. When the new environment is ready, OpenSearch Service switches over the environments to promote the green environment to be the new production environment. The switchover happens with no data loss. This practice minimizes downtime and maintains the original environment in the [21:26:43] event that deployment to the new environment is unsuccessful." [21:27:12] However, Nginx has to know about the new IPs, hence the special `resolver` config it needs. [21:28:09] So it's likely that for some short period, depending on when the last refresh was relative to when DNS is updated, there will be some time where search accesses and index updates would fail. [21:28:50] hmm [21:29:41] I'm not really sure then, we've done all our upgrades in-place so I haven't thought too deeply before about his this would work out [21:29:56] It sounds like reducing the dns refresh couldn't hurt [21:30:13] FWIW here's what the nginx config looks like: [21:30:16] ```server { [21:30:17] listen 127.0.0.1:9200; [21:30:17] server_name localhost 127.0.0.1; [21:30:18] resolver 169.254.169.253 valid=60s; [21:30:18] access_log /var/log/nginx/elasticsearch/access.log combined if=$log_request; [21:30:19] error_log /var/log/nginx/elasticsearch/error.log; [21:30:19] location / { [21:30:20] allow 127.0.0.1/32; [21:30:20] deny all; [21:30:21] set $proxy_backend vpc-my-cluster-name.us-east-1.es.amazonaws.com; [21:30:21] proxy_pass https://$proxy_backend; [21:30:22] proxy_redirect off; [21:30:22] proxy_buffering off; [21:30:23] proxy_set_header Host vpc-my-cluster-name.us-east-1.es.amazonaws.com; [21:30:23] proxy_set_header X-Forwarded-Host $http_host; [21:30:24] proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; [21:30:24] } [21:30:25] }``` [21:31:05] I've had to do this before with service software updates, but I can't control when those happen. I can much better with actual version upgrades, so if I can reduce the risks, so much the better. [21:31:33] ebernhardson re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973849 , since we deploy mjolnir with scap, do we need to update anything related to scap? Or is that kubernetes.yaml enough? [21:33:29] inflatador: i suspect that being called "kubernetes" is a misnomer. From the git history when we last changed the repo definition it was in role/common/deployment_server.yaml, it's since been moved into role/common/deployment_server/kubernetes.yaml [21:34:27] justinl: almost sounds like it'd be nice if nginx could refresh on-failure, but somehow i doubt it has that. Reducing the refresh should at least limit your downtime [21:35:52] inflatador: it looks like kubernetes there basically refers to the type of deployment server, and in prod all deployment servers are k8s deployment servers [21:36:34] ebernhardson ACK, FWiW /srv/deployment/search/mjolnir/deploy is pointing to the correct gitlab repo [21:38:03] inflatador: excellent, in that case we just need to run a normal scap deploy and it should push out the new python3.10 version [21:39:04] ebernhardson Now you've got me wondering if I could somehow get such failures to generate an event on the Salt bus (I use Salt for server configurations), which could trigger an orchestration that runs an Nginx reload (which does the refresh). [21:40:52] But for this upgrade, I'll just lower the refresh to like 20 seconds at the start of the upgrade procedure. [21:43:11] +1 [21:56:56] `scap deploy` + `systemctl restart mjolnir-kafka-*` = mjolnir running successfully on py3.10 [21:59:20] \o/ [22:53:21] low priority puppet patch for the search-loader decom: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973880/