[09:51:23] Traffic pad is up for today's meeting; we have a few very important/urgent topics to discuss, so I propose we timebox them and do the normal roundtable async this time. [09:51:33] if something really needs to be discussed, add it to the topics :) [09:57:48] ack! [11:44:03] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7376383, @akosiaris wrote: >>>! In T290536#7371552, @jijiki wrote: >>>>! In T290536#7364817, @Joe wrote: >>> We could thus start wit... [11:46:38] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [12:26:15] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#7383272, @jijiki wrote: > That is a good idea, I started a different task to discuss our options in partitioning our mediawiki... [13:42:25] 10Traffic, 10Phabricator, 10Release-Engineering-Team: Phabricator search times out - https://phabricator.wikimedia.org/T291775 (10hashar) It seems to be a search for `Mediawiki-in` across all documents (not just Tasks). I am guessing it is similar to T258803 which is about searching for `gerrit`. A comment... [13:43:36] 10Traffic, 10Phabricator, 10Release-Engineering-Team (Seen): Phabricator search times out - https://phabricator.wikimedia.org/T291775 (10hashar) [14:29:45] <_joe_> heads up: I'm restarting low-traffic pybals in eqiad and codfw in a few [15:54:42] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Tested in beta, I think this is working now. [16:02:17] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05nskaggs→03aborrero We will need 2 NICs connected on these servers: * primary NIC, with a public IPv4 address, `cloud... [16:08:00] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) a:05Mholloway→03Ottomata [17:16:57] (VarnishTrafficDrop) firing: 68% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:21:57] (VarnishTrafficDrop) resolved: 68% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:00:09] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Scheduled for a backport window tomrrow. [18:18:38] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10serviceops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @hashar This should be between netops and dcops I think. [19:03:24] \o chugging along on the `wcqs` work. here's the patch to enable trafficserver backend routing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/720078 It should be pretty straightforward so planning on merging it in an hour or two after deploying a couple of other patches but let me know if anything is missing/wrong [19:09:22] heh, I'm here to deploy roughly the same change as ryankemper except for toolhub: https://gerrit.wikimedia.org/r/c/operations/puppet/+/711648 [19:09:40] ryankemper: do you know if puppet takes care of all the changes? or do services need to be manually restarted? [19:11:26] bblack: ^ if you're around [19:12:32] legoktm: I believe puppet takes care of it. I remember adding a new entry for http://query-preview.wikidata.org and I don't remember needing to restart anything specifically. Not 100% sure though [19:12:47] did you manually run puppet on specific servers? [19:14:08] looking at puppet, ats gets auto restarted when the config file changes, but I don't see the same for varnish https://github.com/wikimedia/puppet/blob/production/modules/varnish/manifests/common/vcl.pp [19:15:12] legoktm: https://phabricator.wikimedia.org/T266470#6884920 starting from around this comment has some context [19:16:41] I'm not sure if that ats-cp was related to the backend change or was for something subsequent [19:18:00] majavah: in my past experience, vcl changes just require a puppet run [19:18:48] ryankemper: ack, do you want to pair and we can roll out our changes one after another? [19:19:04] legoktm: so I think if I'm reading that old ticket correctly, I merged the patch, ran puppet on one `cp-ats` server and made sure that `sudo cat /etc/trafficserver/remap.config | grep query` showed the change, then ran puppet on the rest of `cp-ats` [19:20:29] legoktm: sure! only thing is I need to merge a couple patches before, so might not be ready to proceed with mine for another 45 mins or so [19:21:03] cumin no longer has a "cp-ats" alias [19:21:44] just cp-{text,upload,etc.} and cp-{text,upload,etc.}_{datacenter} [19:23:46] ryankemper: how about 1:30 PT? [19:24:04] legoktm: sounds perfect [19:38:17] yeah just to confirm: puppet alone will do VCL/ATS -level changes like these [19:41:44] bblack: for my change it should be the text cache, so sounds like the procedure for me will be something like: `disable puppet on cp-text` -> `run puppet on single cp-text host and verify remap.config looks good` -> `run/re-enable puppet on all of cp-text`? [19:42:00] or should I disable puppet on all `cp*`, or alternatively not bother disabling puppet at all? [20:15:13] ryankemper: the general and somewhat conservative method is: disable on all cp-text, run on one to verify that puppet succeeds on a real host, then enable+run on the whole cp-text fleet (but staggered out a bit with cumin batching, e.g. "-b 5")