[09:50:21] lunch [13:23:39] \o [13:32:54] o/ [14:58:38] DPE Search - Triage/Planning starting in a moment [16:02:07] ebernhardson: if you have a sec: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1059898 [16:03:55] dcausse: should be fine [16:03:59] thanks! [16:04:42] realized i didn't figure out how we get secrets into helmfile yet and into the http request headers, will ponder [16:06:49] ebernhardson: there are secret helmfile values in /etc/helmfile-defaults/private/main_services/cirrus-streaming-updater/ [16:07:05] I think it's a git repo somewhere [16:07:27] hmm, says it's not a git repo. Might come from puppet secrets [16:08:11] ah right I think I remember Brian edit those on puppetmaster (or somewhere else) [16:08:44] yea, i think it's managed similar to wmf-config private, with a repo that only exists on the deploy hosts [16:08:53] or in this case, puppetmaster [17:45:51] a bit suspicious that the lag reports something around 1min (https://grafana-rw.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1&var-slo_period=7d&var-threshold=600&var-source=eqiad%20prometheus%2Fk8s&var-job=search_eqiad&from=now-3h&to=now) [17:46:09] was expecting to see the impact of the 10min window [17:46:35] event-time might not be propagated from the producer to the consumer... [17:47:17] hmm, yea that sounds like maybe only the internal lag [17:47:37] yes... [17:47:41] dinner [18:06:37] * ebernhardson wonders what the difference is between kubesvc and kubepods [18:06:53] kubesvc is internal orchestration stuff maybe? [18:07:50] ebernhardson: yes, kubesvc is the control plane API [18:10:32] err sorry, that's not right, if you're talking about conftool [18:11:30] that's the conftool/lvs cluster for "all k8s service endpoints", because of how they're deployed [18:11:32] cdanis: well, i was looking for the network masks that identify connections coming from cirrus streaming updater [18:11:46] cdanis: and in modules/network/data/data.yaml it has both [18:12:10] i realize it wouldn't just be cirrus updater, it would just be the generic hosts that run kubemaster.svc.eqiad.wmnet [18:15:04] I think it's actually different than that, I *think* that the kubesvc is the range that the k8s cluster uses for its service IPs (which includes its kubemaster api service), and then another range for the the per-pod IPs used [18:16:41] hmm, ok that makes sense. Thanks! [18:17:37] https://netbox.wikimedia.org/ipam/prefixes/376/prefixes/ [18:19:17] ah and https://netbox.wikimedia.org/ipam/prefixes/636/ [18:19:45] i guess i should learn more about netbox :) Interesting [18:20:14] /16 certainly gives it plenty of space to work in [18:53:12] dcausse: kicked off the `wdqs-main` reload on `wdqs1021` like so: `ryankemper@cumin2002:~$ test-cookbook -c 1053205 sre.wdqs.data-reload --task-id T370754 --reason "WDQS main subgraph" --reload-data wikidata_main --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ --stat-host stat1009.eqiad.wmnet wdqs1021.eqiad.wmnet` [18:53:13] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [19:10:10] ryankemper: thanks! [19:13:56] \o/ [19:25:40] ryankemper: checking /lib/systemd/system/wdqs-updater.service it does seem to point to the eqiad.rdf-streaming-updater.mutation topic on wdqs1021 [19:27:08] reloading wikidata_main it should be eqiad.rdf-streaming-updater.mutation-main [19:30:39] dcausse: are you saying it should point to that after it's done, or that it should be pointing to that now and therefore something is wrong? [19:30:58] dcausse: apart from the above, looks like it failed fast. error message doesn't have a lot: [19:31:01] https://www.irccloud.com/pastebin/P7cRnaxx/ [19:32:09] ryankemper: I think it should point to eqiad.rdf-streaming-updater.mutation-main right before the reload unless puppet runs and update it right before starting the updater [19:32:54] ryankemper: re reload failure, could you upload the debug log somwhere? [19:33:33] test -f /srv/wdqs/wikidata.jnl, seems like blazegraph did not even start... [19:34:56] dcausse: this is a freshly reimaged host, maybe I need to `touch /srv/wdqs/wikidata.jnl` so it doesn't bomb out? feels like that shouldn't be necessary though [19:40:24] dcausse: here's the full log, made it private just in case: https://phabricator.wikimedia.org/P67225 [19:46:06] ryankemper: thanks, /srv/wdqs/wikidata.jnl should be created by blazegraph when it starts, looking at the logs [19:47:43] ryankemper: sudo journalctl -u wdqs-blazegraph.service shows some errors like 'Failed at step CHDIR spawning /bin/bash: No such file or directory' [19:52:57] can't start it manually [19:53:24] ah it's because scap did not run, /srv/deployment/wdqs/wdqs is empty [19:54:25] it's the entry WorkingDirectory=/srv/deployment/wdqs/wdqs in the system unit that fails I think [20:43:55] * ebernhardson always finds thinks like http auth in java so complicated...in php you would simply set the header. In java you implement classes that support 10x the functionality you need, and can't pass parameters into them because they get injected at different levels and instead need to perform cartwheels :P