[01:28:46] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10782607 (10ArthurPSmith) So https://www.wikidata.org/wiki/Property:P13551 has now bee... [07:50:54] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10782973 (10Ollie.Shotton_WMDE) > Is the problem currently resolved by a process that... [09:14:21] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10783059 (10A_smart_kitten) It might be that the fix hasn't actually been deployed yet... [10:15:22] 06Traffic, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 5 others: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576#10783133 (10Mvolz) [13:53:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10783352 (10cmooney) p:05High→03Medium [14:12:46] 06Traffic, 06serviceops, 10Content-Transform-Team (Work In Progress): Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10783395 (10cscott) [17:59:39] brett: i’m a little late, 5 mins [17:59:47] No worries! I'm just idling in the meet [19:23:50] sukhe: We're currently having puppet compilation errors on wdqs hosts, citing issues with realserver.pp realserver_ips value errors. [19:24:14] My guess is that we need to remove "include profile::lvs::realserver" from modules/role/manifests/wdqs/main.pp as well [19:24:52] brett: removal of load balanced service right? [19:24:59] yeah [19:25:03] yeah sounds right [19:25:07] that needs to be removed [19:25:13] What about include profile::lvs::realserver::ipip ? [19:26:03] we'll just remove both for now [19:26:13] brett: remind me please, which service is this? [19:26:15] wdqs-main? [19:26:17] wdqs [19:26:20] wdqs-internal [19:26:58] and which host are you getting error on? [19:27:02] All of em ;) [19:27:05] ok looking [19:27:09] wdqs1011 is an example [19:27:12] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Class[Lvs::Realserver]: [19:27:14] parameter 'realserver_ips' variant 0 expects size to be at least 1, got 0 [19:27:16] parameter 'realserver_ips' variant 1 expects a Hash value, got Array (file: /srv/puppet_code/environments/production/modules/profile/manifests/lvs/realserver.pp, line: 27, column: 5) on node wdqs1011.eqiad.wmnet [19:27:16] yeah [19:27:34] remove both [19:28:08] let me see the task to get the full context [19:28:37] Take the old eqiad and codfw wdqs-internal hosts offline to reconfigure them as necessary for the new graph split [19:28:40] Bring hosts back online [19:28:50] ok, yeah remove it for PCC sake at least [19:29:21] https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers [19:29:27] Change state: lvs_setup to state: service_setup, and remove the service stanza from profile::lvs::realserver::pools. Then: [19:29:56] Looks like we should add those lines to remove as well [19:30:01] yep [19:39:11] Yep, that fixed it! [19:39:14] I'll update the doc [19:39:16] nice and thanks [19:39:19] makes sense [19:45:00] (not all services will have ipip so mention "if present" or something) [19:46:59] I wrote "Also remove any inclusions (such as include profile::lvs::realserver) from any role classes or puppet compilation may fail" [19:50:54] sukhe: One more question: https://config-master.wikimedia.org/pybal/eqiad/wdqs-internal is still hanging around... why's that? [19:57:03] brett: removed from conftool-data and all? [19:58:28] once removed from there and puppet merge is run, it should remove it from there as well, I think. [19:58:33] indeed, via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136756 [20:07:56] sukhe: do you know where puppet agent has to run for the etcd entries to be removed? is it just configmaster? [20:08:34] if it has to run on the wdqs-internal backends, we merged the followup patch that flips their site.pp role to insetup so that could explain it...but i don't see why it'd have to run on the backends i'd expect it to just be the config masters [20:09:53] puppet merge should take care of it [20:10:10] as far as the configmasters are concerned, I see the entries gone from `/srv/config-master/discovery/discovery-basic.yaml` as well as the .tmpl files under `/etc/confd/templates`, but we still see `/srv/config-master/pybal/eqiad/wdqs-internal` and `/srv/config-master/pybal/codfw/wdqs-internal` as present [20:10:22] if not, we are probably missing something [20:10:36] happy to take a look shortly [20:10:56] as long as you have done the other removal, it is all good [20:12:06] yeah we've removed everything else, this is the last step before we can declare the teardown done (well, except one final step of removing the lvs VIPs from netbox and the corresponding a records from the dns repo https://gerrit.wikimedia.org/r/c/operations/dns/+/1139936) [20:12:44] I'm at a bit of a loss, we've definitely removed the service catalog, the conftool service discovery entry, and the conftool-data entries in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136756 [20:14:11] will look [20:22:35] looking now [20:24:34] maybe it has to run on the hosts themselves and after that on config master. this can be the case when exported resources are involved. [20:26:41] mutante: unlikely but to be sure I applied again andnochange [20:31:08] sukhe: the wdqs-internal hosts have insetup role now and not internal [20:31:24] so maybe i should put a patch to restore the role, run puppet, and then run puppet on configmasters again? [20:33:07] ryankemper: I don't think so; basing this on for example: [20:33:07] sukhe@cumin1002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/pools/codfw/wdqs-internal/wdqs [20:33:12] which returns nothing as expected [20:33:48] nothing in the pybal pools themselves as well [20:33:49] yeah I see same when I do `confctl --tags dc=codfw,cluster=wdqs-internal,service=wdqs --action get all` on configmaster, I get `{}` like I'd expect [20:34:01] so looking at config-master and forcing a run there [20:34:16] (which I did before too but now doing on all hosts :) [20:35:14] we're talking about it in the Meet but I'm pretty sure if we just `rm /srv/config-master/pybal/codfw/wdqs-internal` it'll go away for good [20:35:31] I am not sure about that [20:37:17] Maybe try it and run puppet again? My best guess is that etcd created that file (directly or indirectly) and there's no mechanism to remove it when the data disappears from etcd...LMK if I'm wrong about that [20:38:14] we haven't needed to do that in the past for service removals, plus, I manually ran conftool-merge as well, which I *think* already takes care of that [20:38:26] confirming [20:40:11] and there is no reference to wdqs-internal in /etc/conftool/data [20:40:53] ah [20:41:52] sukhe@config-master1001:/srv/config-master/pybal/eqiad$ ls -l wdqs-internal [20:41:55] -r--r--r-- 1 root root 444 Dec 12 17:10 wdqs-internal [20:42:04] in profile::configmaster it uses the class pybal::web which then uses pybal::conf_file .. that is what is writing this, right? [20:42:35] mutante: I checked that, the catalog no longer has a pybal::conf_file object for `wdqs-internal` [20:42:41] because. these do have an $ensure parameter.. but it's also just: [20:42:41] class { 'pybal::web': [20:42:41] ensure => present, [20:42:43] so this file really hasn't changed in a while. and the question is why, because the conftool-merge should have taken careof this [20:42:51] so that ensure => present does not look flexible [20:43:01] because there is certainly no pools per pybal [20:43:05] wdqs-internal ones that is [20:43:26] and why we didn't see that before? because well, we mostly look at the pool output from pybal and not from this [20:43:38] curl localhost:9090/pools/ [20:43:41] on a pybal host for example [20:44:12] mutante: yeah, we should probably fix this [20:44:22] some of the files have not been changed there since 2023 [20:44:26] but they still persist [20:44:41] ryankemper: inflatador: anyway, it's pretty clear that there is nothing for you to do here [20:44:46] it seems like you COULD absent this stuff.. but it just never is [20:45:09] another thing that might be useful? [20:45:10] # Script to dump pool states to a json file. [20:45:24] /usr/local/bin/dump-conftool-pools [20:47:34] mutante: as a confirmation, yeah [20:48:07] also no wdqs-internal [20:48:13] yeah, we are good here. I can file a task for this tomorrow. [20:49:18] ryankemper: inflatador: so yeah, if wdqs-internal is going away for good, you can simply rm the file and also the etcd directory (that one is more tricky, so make sure you paste the command here or somewhere else for review) [20:49:32] if you are going to bring this service back, just leave this alone [20:50:24] as long as you both and brett have followed through https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service and there are no pending alerts, it's all good [20:51:17] excellent, yes this service is going away for good so I will nuke those files [20:51:22] starting with the simple rm then will paste etcd dir [20:51:26] ok [20:51:31] yes thanks [20:51:42] and when you remove the IPs from netbox, make sure to run the sre.dns.netbox cookbook as well [20:53:01] ah, will do that [20:54:18] your etcd removal command should be [20:54:21] etcdctl -C https://conf1007.eqiad.wmnet:4001 rmdir /conftool/v1/pools/eqiad/wdqs-internal/ [20:54:28] etcdctl -C https://conf1007.eqiad.wmnet:4001 rmdir /conftool/v1/pools/codfw/wdqs-internal/ [20:55:02] please double check [20:55:10] checking [20:55:50] and if you get an error 11 during the removal, you will need to pass --username root and the password in /root/.etcdrc [20:55:54] rather, 110 [20:56:11] commands looks good, will try [20:59:45] sukhe: success (had to rmdir the subdirectory first), pasting commands in wm-ops [20:59:51] fair enough [21:02:56] looks good [21:03:23] thanks for the verbose logging btw. SAL is helpful for people who may repeat steps later [21:03:39] I have to head out now but I think we are set [21:08:41] Thanks for all your help! [21:26:24] s-ukhe if you end up filing a task for removing those old files LMK. One thing I might suggest is to use a caching layer (memcached/redis etc) in front of etcd instead of reading those flat files. That way, old pools would just drop out of cache. Dunno if it's worth the effort, but assuming we already have code that caches etcd data it shouldn't be too hard