[10:18:48] Hi! Regarding: https://phabricator.wikimedia.org/T353456 [10:19:42] I just sent a patch with Eric's suggestion but this might need some sre coordination given that its Friday before global holidays [10:24:02] I've no natural understanding of scap stuff, but it seems wise to deploy this - akosiaris wdyt? [10:27:45] I am a long way from a Cassandra expert, but I think it would be worth a deploy today [10:28:59] I've been deploying restbase lately so I am OK to do the actual scap deploy. [11:54:14] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10C.Suthorn) >>! In T266155#8710759, @TheDJ wrote: >>>! In T266155#8707579, @doctaxon wrote: >> @TheDJ... [12:05:21] Who would be the right person to add for review for this patch? https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/985116 [12:32:05] * akosiaris checking [12:34:08] so, this is going to be a noop on 30/38 nodes and will only decrease workers by 8 in restbase[2028-2035] [12:34:50] the 30 noop hosts have 40 CPUs so ncpu=40 which is the same value as the one set in num_workers [12:36:45] which matches Eric's last comment on the task [12:37:20] and decrease will be ofc by 24, not 8 as I originally said above [12:37:55] however, IIRC this setting is set via puppet [12:38:01] and can't be deployed via scap [12:38:23] * akosiaris doublechecking [12:40:39] yeah, it's set via puppet in /etc/restbase/config-vars.yaml which then gets merged with the scap supplied one by /usr/local/bin/apply-config-restbase and installed [12:46:08] so +2 and merge and run puppet, then? [12:47:14] it won't work, that's what I am saying [12:47:29] puppet supplied settings take precedence [12:47:49] I am crafting a puppet patch to do the same thing though [12:50:22] OIC, sorry [12:51:37] https://gerrit.wikimedia.org/r/c/operations/puppet/+/985154 [12:53:01] as a side note, this is a prime example of a software that someone tried to tune it to the hardware available (even with such crude heuristics as ncpu) instead of tuning it to the needs of users. [12:57:57] thanks alex [13:04:05] thanks akosiaris [13:17:54] Will the puppet patch also require an extra scap deploy to pick up the changes ? [13:20:40] that's what I am trying to find out [13:20:53] puppet runs the following [13:21:06] /usr/bin/scap deploy-local -D 'log_json:False' --repo restbase/deploy --force config_deploy; [13:21:15] which spews out [13:21:29] Rendering config_file: /srv/deployment/restbase/deploy-cache/revs/40c15b1d55475ef5936b48bf923a5538dbd828bb/.git/config-files/etc/restbase/config.yaml using /etc/restbase/config-vars.yaml [13:21:36] so it shouldn't need anything else [13:21:45] but somehow I don't see a difference [13:22:14] I wonder whether during some migration the behavior changed... [13:23:50] scpa config-files.yaml has this [13:23:57] --- [13:23:57] template: config.yaml.j2 [13:23:57] erb_syntax: True [13:23:57] remote_vars: /etc/restbase/config-vars.yaml [13:24:02] --- [13:24:02] /etc/restbase/config.yaml: [13:24:03] template: config.yaml.j2 [13:24:03] erb_syntax: True [13:24:03] remote_vars: /etc/restbase/config-vars.yaml [13:24:13] so it should honor iit... [13:25:45] omg [13:26:17] it's definitely not honoring the variable, the resulting file is still saying num_workers: ncpu [13:29:09] the config yaml template has it hardcoded at the first place [13:29:49] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/restbase/deploy/+/refs/heads/master/scap/templates/config.yaml.j2#4 [13:30:49] indeed, so it's not picking up anything apparently? [13:31:12] wanna deploy a change to switch that to num_workers: <%= num_workers %> ? [13:31:24] ok, let me send a patch [13:34:20] should i just restore my previous patch? [13:34:55] sure, go ahead, although you probably want just 1 of the 5 files [13:36:14] nvm: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/985159 [13:42:04] ok should i try scap? [13:42:09] deploy [13:44:45] yup [13:46:37] ok can you check what restbase2015.codfw picked up? [13:46:46] *what config [13:47:06] doing so [13:47:29] num_workers: 40 [13:47:30] done [13:47:32] ok [13:47:37] worked, awesome. [13:47:40] i will run scap to the rest of the nodes then [13:47:58] Now as to why it didn't work on my local tests... somehow I suspect some scap magic [13:48:14] anyway, this worked, so I am happy [13:53:34] waiting for scap to complete deployment to all nodes, but so far so good [14:05:41] 10serviceops, 10Commons, 10MediaWiki-File-management, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10TheDJ) [14:08:38] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) >>! In T266155#9423263, @C.Suthorn wrote: > It would make a much better UX, if instead of the... [14:21:28] \o/ [14:21:39] nemo-yiannis: was the host list updated for scap? [14:24:08] nemo-yiannis: that looks to be no (which unfortunately means that the hosts that needed the new setting, didn't get it) [14:27:31] nemo-yiannis: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/985161 [14:28:26] urandom: that feels like something it'd be nice if it could be updated automatically / templated from hiera [14:29:13] (though that's obviously not a Friday-22-December sort of change!) [14:29:22] Emperor: automatic would be nice, yes. I'm not sure how you'd do that via Puppet though [14:29:48] better yet would be to get RESTBase off of that cluster, and then we wouldn't need a scap deploy at all :) [14:30:40] +1 :) [14:44:48] Can you merge? Then i can run another scap deploy [14:45:08] urandom: ^ [14:45:40] nemo-yiannis: are you able to +1? [14:45:59] i can, but i really don't know anything about the nodes [14:46:06] patch looks OK [14:46:06] ok [14:46:55] nemo-yiannis: merged. [14:47:15] ok [14:51:16] gah, it just occurred to me that 34 & 35 are going to error, because those haven't been added to the cluster (yet) [14:52:20] (yet more points in favor of not duplicating lists of things across systems) [14:52:59] should i continue with scap or press stop ? [14:53:07] i only deployed the canary nodes [14:53:53] what will scap do here? will it still successfully deploy to all the others? [14:54:00] it won't try to rollback, will it? [14:54:13] not sure, what errors are you expecting ? [14:54:26] I mean, they aren't setup for a deploy [14:54:41] i dont know whats gonna happen [14:55:04] if you're at a good stopping point, let me push another change [14:55:17] I can't see anything bad happening, but lets play it safe [14:55:30] ok [14:57:14] nemo-yiannis: OK, going to merge https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/985163 [14:57:29] ok [14:57:36] (...taking restbase203[4-5] off that list for now) [14:58:05] nemo-yiannis: Ok, done. [14:58:08] sorry for the confusion [14:58:31] ok [14:58:56] trying again [15:16:16] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm Got the part replaced. The bios settings got wiped when the CMOS battery got swapped out. I believe everything is back to how it should be. the idra... [18:02:22] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Volans) @Jhancock.wm the easiest and safest way to reconfigure a BIOS is to run the `sre.hosts.provision` cookbook like it was a new host just with some options to skip unnecessary steps like `--no-dhc... [19:54:51] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Jhancock.wm) thank you. I'll remember that one. [20:49:12] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10sbassett) Any objections to making this public now?