[07:48:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:58:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:00:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:05:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:51:06] Morning all. I'm planning to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/742747 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/755435 this morning. [09:52:30] So I'll briefly stop puppet on all cp-* nodes while I roll out the changes on cp3050 and test. Once all is clear, I'll restart puppet on all nodes. [09:52:50] mmandere, vgutierrez --^ :) [09:53:06] thanks for the ping [09:53:07] * vgutierrez checking [09:53:20] Any objections to my proceeding with the change? elu.key will be on-hand in case things do not go as planned. [09:54:19] LGTM :) [09:55:24] Thanks. I'll start in a few minutes then. [09:56:44] thanks vgutierrez <4 [09:56:46] <3 [10:05:29] <4 is <3 + 1 :beer [10:05:38] 🍺 [10:05:38] xDD [10:05:59] sure :D [10:38:38] Those two patches have been merged and tests with cp3050 have all been fine, so I'm about to re-enable puppet on all of the cp-* nodes again. [10:41:02] !log re-enabled puppet on all cp-* nodes. [10:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:09] vgutierrez: o/ [15:20:28] qq - are cp403[56] nodes not serving traffic or similar? [15:21:20] from what I can see varnishkafka doesn't work in those nodes [15:21:29] I mean, it doesn't send traffic to kafka [15:21:59] :? [15:22:09] that's weird [15:22:56] cp4035 is a regular text node [15:23:14] same as cp4036 actually [15:23:16] * vgutierrez checking [15:24:27] yeah I cannot see kafka messages for 35/36 from kafkacat, but I can see the other nodes [15:25:08] both are serving traffic [15:25:31] and varnishkafa-webrequest is apparently up & running [15:25:58] it is, but not sending traffic to kafka afaics [15:26:11] and even before today's change [15:26:22] have you restarted it? [15:26:27] it only has a 17 minutes uptime [15:26:39] Active: active (running) since Wed 2022-01-26 15:07:47 UTC; 17min ago [15:27:09] I have yes [15:27:25] there was a metric that indicated some kafka issues, so I tried to restart [15:27:31] but then I noticed the rest [15:27:43] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-12h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=cp4036 [15:27:50] tcpdump shows attempts of connections [15:28:04] yeah old logs say [15:28:05] KAFKAERR: Kafka error (-195): ssl://kafka-main1004.eqiad.wmnet:9093/1004: Disconnected (after 1199322ms in state UP) [15:28:28] wait, kafka-main?? [15:28:35] ah yes statsv [15:28:42] and webrequest as well [15:28:52] oh no [15:28:53] that's jumbo [15:28:54] sorry [15:29:07] weird I can reach both clusters via telnet [15:29:24] yeah, nc shows the same [15:29:34] BUT tcpdump doesn't show any traffic to the jumbo cluster on port 9093 [15:30:17] it's only showing traffic to kafka-main [15:31:03] 4035 right? [15:31:42] sorry? [15:31:58] are you checking cp4035 ? [15:32:02] oh yes [15:32:11] sorry didn't specify :) [15:32:16] super weird [15:32:19] dunno why I was thinking about port numbers [15:32:25] yes yes my bad [15:33:08] vgutierrez: if you are ok I'd like to try to spin up another vk that prints to console [15:33:12] and see if it works [15:33:16] indeed [15:33:18] go ahead please [15:33:48] so the current instance of varnishkafka-webrequest hasn't logged anything [15:33:53] that's weird [15:34:32] openat(AT_FDCWD, "/var/lib/varnish/frontend/_.vsm", O_RDONLY) = -1 ENOENT (No such file or directory) [15:34:45] * elukey cries in a corner [15:34:58] I was about to suggest if it wasn't something related to the varnish shm [15:35:44] yeah [15:35:45] %7 VSM_OPEN: Failed to open Varnish VSL: Cannot open /var/lib/varnish/frontend/_.vsm: No such file or directory [15:35:49] this is with debug logging [15:37:03] I fear that we have been dropping traffic for ages on those nodes [15:37:55] so /var/lib/varnish/frontend on those two looks identical to the other nodes [15:38:00] ._vsm doesn't exist [15:38:14] but _.vsm_child and _.vsm_mgt does [15:38:22] yep [15:38:49] got it [15:38:58] ldd of varnishkafka [15:39:02] on cp4024, libvarnishapi.so.2 => /lib/x86_64-linux-gnu/libvarnishapi.so.2 (0x00007f0d36d5a000) [15:39:12] cp4035, libvarnishapi.so.1 => /lib/x86_64-linux-gnu/libvarnishapi.so.1 (0x00007fb06d9c1000) [15:39:14] FFS [15:39:55] ahh libvarnishapi 5.1.3-1wm15 [15:40:00] vs 6.x on the rest of the nodes [15:40:01] indeed [15:40:04] let me fix that [15:41:32] removing the 5.1.3 version also triggered [15:41:33] Unpacking varnishkafka (1.1.0-1) over (1.0.14-1) ... [15:42:34] restarting varnish now... [15:42:43] I'll restart the varnishkafka instances afterwards [15:43:46] https://debmonitor.wikimedia.org/packages/libvarnishapi1 [15:43:50] we have it on a lot of nodes [15:44:55] I guess that the main issue is the varnishkafka version [15:45:04] 1.0.14 isn't enough [15:45:09] I assume that you're missing metrics from 12 nodes [15:45:43] cp1087, cp4021, cp4033, cp4034 and cp4036 [15:45:48] and all the drmrs cluster [15:46:24] and also cp5006 and cp5012 [15:46:24] ok.. we got traffic to the jumbo cluster on cp4035 again [15:46:46] uh, those two are using the right varnishkafka version [15:46:51] at least according to debmonitor [15:48:38] ok... is cp4035 back to normal regarding varnishkafka? [15:48:42] lemme check [15:48:52] if this is started https://phabricator.wikimedia.org/T264074 it is really bad :( [15:49:06] you don't trust debmonitor? :-P [15:49:50] see? 2022 is 2020 season 3 [15:50:00] that's a task from September 2020 [15:50:05] lol [15:50:31] volans: of course I do [15:51:01] at least when I use it [15:51:04] * vgutierrez runs away [15:51:05] :D [15:51:39] vgutierrez: confirmed that cp4035 works [15:51:40] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-3h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [15:52:35] vgutierrez: if you have time can we proceed with the rest? :( [15:53:04] so I'm gonna trigger a varnishkafka upgrade on cp[6002,6005,6009-6013].drmrs.wmnet,cp1087.eqiad.wmnet,cp[4021,4033-4034,4036].ulsfo.wmnet [15:54:02] I'll check the eqsin nodes [15:54:41] those are running the expected version apparently [15:54:43] maybe a restart is missing? [15:55:40] 10Traffic, 10SRE: Create Ganeti VMs for Wikidough in drmrs - https://phabricator.wikimedia.org/T300156 (10ssingh) [15:55:42] is ok if I restart the varnishkafka instances on those 12 nodes elukey? [15:55:47] +1 [15:57:32] cp[4021,4033-4034].ulsfo.wmnet should be fixed already [15:58:55] cp4036 and the whole drmrs cluster should be done now [15:58:56] super thanks [15:59:53] so varnishkafka in cp5006 has a 5h uptime.. so it isn't the same issue [15:59:59] *varnishkafka-webrequest [16:00:02] nono sorry red-herring [16:00:04] they work [16:00:06] just checked [16:00:08] ack :) [16:00:27] so the problem was the missing upgrade on the nodes [16:00:39] hmmm it's weird [16:00:44] very yes [16:00:48] drmrs nodes have been installed very recently [16:00:54] I don't know why they got the old version :/ [16:01:12] 10Traffic, 10SRE: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) [16:01:17] we have a lot of monitors for vk (errors mostly) but not on single instance traffic [16:01:20] sigh [16:01:34] ok I'll open a task to track this mess and see for how long we have dropped traffic [16:01:40] thanks a lot vgutierrez [16:01:41] <3 [16:01:45] thank you :D [16:02:17] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney I went through all the cabling and confirmed the correct patches. the connections at the demarc are pretty foolproof wit... [16:05:13] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney it appears to be disabled cmjohnson@re0.cr1-eqiad> show interfaces descriptions Interface Admin Link Description... [16:23:01] 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) [16:23:06] restarted vk instances on cp1087 as well [16:23:16] 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) [16:25:20] elukey: oh thanks, I missed that one [16:29:26] vgutierrez: something good - https://phabricator.wikimedia.org/T290694 [16:29:44] the nodes are relatively recent, I think that your theory about new nodes coming up with the wrong vk is very valid [16:30:24] cp1087 is old though [16:31:29] not that old [16:31:31] Debian GNU/Linux 10 auto-installed on Fri Jun 4 20:02:59 UTC 2021 [16:32:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson thanks. The interfaces on the CR are down by default. Not sure if you changed anything but there is no improvement rig... [16:32:47] ack taking notes, the outage should have started when the os was installed for the last time on these nodes [16:33:02] funny enough I've reinstalled cp1088 on Monday and it looks ok [16:33:06] dropping data since Jun 2021 (2k rps) is very great [16:44:22] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney, I have a light meter and I see light from lsw-f and lsw-e to the demarc, and then I see light to cr1 and cr2 from old c... [16:48:11] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson thanks ok. yeah it is odd. All the switch->switch links have come up ok (using the same CWDM4 optics), so it'd be unus... [17:39:58] vgutierrez: found the issue [17:40:04] varnishkafka | 1.0.14-1 | buster-wikimedia | main | amd64, source [17:40:07] varnishkafka | 1.1.0-1 | buster-wikimedia | component/varnish6 | amd64, source [17:40:13] and varnishkafka init.pp installs from main [17:40:26] so the component is then added, but puppet doesn't force the upgrade [17:40:32] that is left there hanging [17:40:50] Sigh [17:41:03] ok so at least there is some sense :D [17:42:12] yeah I had related issues bringing up the new cp6 nodes recently [17:42:46] IIRC we had to just manually force-install varnish and maybe one of the related libraries [17:43:16] (to get everything on the right versions. puppet wouldn't do it for us, or even succeed at puppetization due to dep failures because of it) [17:44:03] probably needs some kind of fixing to the puppetized apt pinning or whatever [17:47:22] I opened https://phabricator.wikimedia.org/T300164 to track what happened [18:09:54] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson reversed the fibers and we got the links up: ` cmooney@re0.cr1-eqiad> show interfaces diagnostics optics et-1/0/2 | mat...