[06:53:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:03:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:41:28] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster [08:07:17] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster executed with errors: - cp6002... [10:08:18] 10netops, 10Infrastructure-Foundations: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) p:05Triage→03High [10:09:19] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10ayounsi) Thanks for taking care of it. Proper fix is most likely T295672. [10:43:55] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1001 for hosts: `rpki1001.eqiad.wmnet` - rpki1001.eqiad.wmnet (**PASS**) - Downtimed hos... [11:51:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) Please ignore the above, unrelated CRs. I pasted the wrong task ID when doing the commit. [12:24:16] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) a:03cmooney [12:38:43] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) > My guess would be that this is Charter filtering traffic on their IXP port to only routers they have peerings with, for security/anti-DDoS reasons. > >... [14:50:22] 10netops, 10Infrastructure-Foundations: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) p:05Triage→03Low [14:54:37] 10netops, 10Infrastructure-Foundations, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10ayounsi) p:05Triage→03Low [17:14:57] (EdgeTrafficDrop) firing: 69% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [17:19:57] (EdgeTrafficDrop) resolved: 69% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [18:33:01] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [18:36:08] vgutierrez: did you open a task by any chance for cp2032? [18:36:31] not sure if related bug I saw this in syslog a couple of minutes before the last log line before the reboot [18:36:59] Nov 15 17:32:57 cp2032 cadvisor[1102]: E1115 17:32:57.904462 1102 info.go:87] Failed to get disk map: open /sys/block/nvme0c33n1/dev: no such file or directory [18:37:42] Hmm nope [18:38:02] s/bug/but/ [18:39:08] nah, red herring [18:39:29] we have 288 of those messages in each syslog file (~per day) [18:46:33] I never have time to dig into it [18:47:03] but every time I notice cadvisor on our nodes, it bugs me on some level. I know it seems to consume an inordinate amount of resources for what metrics it's bringing us in practice [18:58:37] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [19:31:36] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [19:57:29] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [20:07:38] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [20:29:41] bblack: when retiring a service behind trafficserver, first remove ATS config snippet and then remove from DNS? or first delete from DNS wait and later remove from ATS? doesn't matter? [20:30:25] also that was just a way to say I am about to delete scholarships.wm.org unless it's a bad time to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/737979/2/hieradata/common/profile/trafficserver/backend.yaml [20:43:54] mutante: I'd say pull the traffiserver config first [20:44:13] bblack: ok, ACK, ty [20:49:13] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster completed: - cp6001 (**WARN**)... [20:50:41] merged and checked on cp1079, letting puppet do the rest by itself