[08:25:24] elukey: everything looking good on cp4032 after running puppet there.. it looks like puppet automatically restarts varnishkafka on config updates [08:26:26] vgutierrez: perfect! [08:26:41] lemme check metrics [08:27:09] I'm gonna apply it cluster wide at puppet level and proceed as suggested in wikitech (sudo cumin -s5 -b3) [08:29:09] uh... [08:29:35] I think I've found a typo on our puppetization that prevents from having monitoring on statsv [08:31:10] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-3h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=statsv&var-cp_cluster=All&var-instance=cp4032 [08:31:14] yeah [08:31:22] elukey: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml#199 [08:31:24] :( [08:31:42] it looks to me like that line should say statsvs insted of eventlogging [08:31:49] the eventlogging one is on line 216 [08:32:13] but at line 200 I see profile::cache::kafka::statsv::monitoring_enabled: true [08:32:45] hmmm right [08:32:47] it's just a dupped line [08:32:59] E_COFFEE [08:34:38] so on cp4032 I see /var/cache/varnishkafka/statsv.stats.json [08:34:49] that in theory should be picked up by the prometheus exporter [08:38:14] it isn't happening for any reason? [08:43:39] 10Traffic, 10DNS, 10SRE: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Vgutierrez) updating the subscribers list to add "our" Brandon :) [08:44:33] 10Traffic, 10DNS, 10SRE: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10jcrespo) ups, sorry, my bad! [08:47:38] elukey: from https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-3h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=statsv&var-cp_cluster=All&var-instance=All it looks like something is not ok with statsv on cp4032 [08:48:21] ah ok so the metrics are flowing, but the dashboard may be wrong [08:48:22] and indeed... Aug 26 08:47:58 cp4032 varnishkafka[14019]: KAFKADR: Kafka message delivery error: Local: Message timed out [08:48:34] ok ok lemme check [08:49:27] maybe a proper restart of the service is required? [08:49:44] cause obviously cp4032 is able to reach port 9093 on kafka-main1* [08:50:40] the main difference with the other varnishkafkas is that this one pushes to kafka main [08:50:45] meanwhile the others to jumbo [08:51:07] let's try with a restart, even if I think something else is going on [08:51:34] gotcha [08:51:37] vgutierrez@cp4032:~$ nc -zv kafka-main1001.eqiad.wmnet 9093 [08:51:37] kafka-main1001.eqiad.wmnet [10.64.0.200] 9093 (?) open [08:52:44] elukey: hmmm different cluster, different TLS material then? [08:52:51] for mTLS authentication [08:54:07] so I recall that we added https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs to Jumbo [08:54:32] but I thought that by default nothing was enforced (on Jumbo we allow only varnishkafka to push to the webrequest topics) [08:54:56] maybe let's revert on the host, and follow up in a task [08:55:59] gotcha [08:56:16] vgutierrez: I think we'd need to add acls, check on kafka-main1001 'kafka acls --list' [08:56:29] we have mirror_maker stated explicitly [08:56:48] something like [08:56:52] kafka acls --add --allow-principal User:CN=varnishkafka --producer --topic statsv [08:57:38] hmmm but if that's missing [08:57:52] transport wouldn't matter, right? [08:58:27] or moving to TLS isn't just a transport change from kafka's point of view? [08:59:02] vgutierrez: let's do a manual try and remove the vk config bits for client tls auth [08:59:33] I can do it if you give me the green light [08:59:49] oh... I see.. we are also moving from pushing anonymously to CN=varnishkafka [08:59:57] elukey: sure, go ahead [09:00:07] exactly yes [09:01:51] (It seems like I was right and godog jinxed it after all lol) [09:02:43] elukey: so.. can we easily fix the ACL? [09:03:26] vgutierrez: let's see if the last restart works, if so I think that what I wrote above should be the fix [09:03:39] or we could simply skip client auth [09:04:44] errors went back to 0 for cp4032 [09:06:39] perfect, it is how kafka interprets acls then [09:06:46] vgutierrez: haha! if only it was so simple [09:07:47] vgutierrez: so executing the above ACL should grant ANONYMOUS the right to read the statsv topic, and allow vk to produce to it [09:07:55] (denying the other ones) [09:08:02] that seems a good solution [09:08:10] hmm s/ANONYMOUS/CN=varnishkakfa? [09:08:23] *kafka [09:08:59] nono ANONYMOUS will be able to read even after the acl is introduced [09:09:15] otherwise the consumers will stop working [09:09:22] oh gotcha [09:09:37] we'll also need kafka acls --add --deny-principal User:ANONYMOUS --operation Write as second step to allow only vk to procuce [09:10:07] but as starter we can skip it [09:13:25] this also brings up again the question "where do we store kafka acls?" [09:13:39] I don't recall if they are in puppet [09:13:49] anyway, I'll open a taks :D [09:13:52] *task [09:14:00] vgutierrez: ok if I proceed with the above acl? [09:14:11] please [09:14:19] it looks like I cannot reach bast3005.wikimedia.org [09:14:20] :/ [09:15:11] siiiigh I was about to say, same here, traceroute stops here for me [09:15:12] 13. wikimedia-ic316335-adm-b3.ip.twelve99-cust.net 0.0% 44 48.1 51.1 47.1 82.9 8.2 [09:15:27] Current ACLs for resource `Topic:statsv`: [09:15:27] User:CN=varnishkafka has Allow permission for operations: Write from hosts: * [09:15:30] User:CN=varnishkafka has Allow permission for operations: Describe from hosts: * [09:16:28] restarting vk on cp4032 [09:18:58] tried on kafka-main1001 : kafkacat -t statsv -C -b localhost:9092 [09:19:00] all good [09:19:08] (so anonymous can read) [09:19:35] elukey: <3 [09:19:37] vgutierrez: I'd say let's wait 10/15 mins for new metrics, then if you want you can proceed with the rollout [09:19:37] thanks [09:19:43] sure [09:20:05] <3 [09:27:35] 10Traffic, 10DNS, 10SRE: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Brandon) >>! In T289618#7311261, @Vgutierrez wrote: > updating the subscribers list to add "our" Brandon :) I feel so unwanted 🥲 [09:27:51] OMG! [10:05:03] elukey: all looking good in https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-3h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=statsv&var-cp_cluster=All&var-instance=All, proceeding with the rollout [10:06:52] \o/ [10:28:04] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@carrot:~/wikimedia.org/operations/dns/templates$ host dev.learn.wiki. ns0.wikimedia.org. dev.learn.wiki has addres... [10:32:49] nice job re: statsv kafka ssl [10:49:29] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) a:05CDanis→03cmooney [11:14:24] 10Traffic, 10SRE, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) a:03Dzahn No problem, I can do this (soon). [11:21:28] 10Traffic, 10SRE, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) ` Ready to create Ganeti VM durum1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row D with 2 vCPUs, 8GB of RAM, 15GB of disk in the private network. ` [12:11:03] vgutierrez, godog - there are some follow ups to do in my opinion: [12:11:53] 1) add some docs in puppet about what to add to vk if TLS is being used (the kafka acls etc..). Not a great solution, but we do it for kafka mirror maker as well. Long term it would be great to have those in puppet somehow (the commands I mean), I can open a task [12:12:21] 2) More immediate, lock ANONYMOUS from producing to statsv (so only consumers will be able to pull without auth). We do it for vk in jumbo [12:12:26] (for the webrequest topics) [12:13:21] basically: kafka acls --add --deny-principal User:ANONYMOUS --operation Write --topic statsv [12:13:36] (see https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs) [12:19:50] elukey: both SGTM yeah [12:20:41] definitely +1 on a task to at least document/keep track of the fact that acls aren't in puppet [12:25:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) @RobH I've updated that Wiki page now with instructions on how to create the USB disk image and begin the install. https://wikitech.w... [12:26:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [12:43:48] 10Traffic, 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) @AntiCompositeNumber I tried purging it with no success- my guess would be that it is not on cache, bu... [13:55:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jcrespo) a:03jbond I am assigning this to you to reflect the fact that you seem to have created a fix or workaround for it-... [14:38:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) thanks jcrespo this has now been fixed ` lang=console,name=ganeti5001 $ sudo facter -p lldp_parent... [14:38:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) 05Open→03Resolved [15:16:28] 10netops, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10Infrastructure-Foundations, 10SRE: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10dcaro) [15:36:39] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) @cmooney USB in place [15:45:04] 10Traffic: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) p:05Triage→03Medium [15:56:26] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Aklapper) [15:57:32] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [16:02:44] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Ladsgroup) I'm sorry if it sounds stupid or you already considered it but for the sake of being consistent with most other teams. You can have `Traffic-team` for tracking the ongoing work and `Traffic` stay... [16:22:57] 10Traffic, 10SRE, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10Dzahn) 05Open→03Resolved The VM has been created and is up and running. [16:23:37] 10Traffic, 10SRE, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10ssingh) >>! In T289693#7312422, @Dzahn wrote: > The VM has been created and is up and running. Yes, thanks, sorry, should have updated the ticket! [16:34:44] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) >>! In T289787#7312331, @Ladsgroup wrote: > I'm sorry if it sounds stupid or you already considered it but for the sake of being consistent with most other teams. You can have `Traffic-team` for tra... [17:09:51] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Ladsgroup) Sure. Let me know if I can help with anything! [23:16:40] 10netops, 10Infrastructure-Foundations: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10Legoktm)