[07:02:03] <wikibugs>	 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#9972298 (10SLyngshede-WMF) @bd808  Sorry I haven't booked a time slot for debugging, but time hasn't been on my side.   For now I think we should just try t...
[07:44:19] <volans>	 slyngs: FYI I see that you removed from data.yaml chelsyx, but it's still in the nda group and reported by the daily_account_consistency_check
[07:44:49] <slyngs>	 Yes, I removed them for the nda group this morning, but thank you
[07:45:02] <volans>	 ack, thx
[08:26:33] <XioNoX>	 topranks: https://gnmic.openconfig.net/changelog/ "Commit confirmed gNMI extension" :) Even though I don't know if Junos or Sonic supports it
[10:19:08] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9972893 (10cmooney) 05Open→03Resolved
[10:47:51] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9972928 (10ayounsi) First Puppetization of new Netbox frontends : * `sudo mkdir /srv/deployment/` was needed. TODO: Add to Puppet * And then this error, fixed with `sudo mkdir /srv/ne...
[11:25:48] <jinxer-wm>	 FIRING: [23x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:26:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:51:08] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973522 (10MatthewVernon) ms and thanos frontends depooled, you're good to go from a swift POV.
[14:08:39] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973584 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9abb3472-bf69-45f5-8c93-e3c8cfbe9e4e) se...
[14:09:14] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973587 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d7f08b17-a319-4077-a271-a0ef15a438a3) se...
[14:12:58] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5a6d4b-345e-4f18-8342-05572d6411e7) se...
[14:19:51] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973616 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=de50ae5f-fec9-4347-b2ef-225a3af373f6) se...
[14:34:15] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973728 (10cmooney) Switch upgrade complete, all looks good hosts are online and responding to ping again.  Thanks f...
[14:47:24] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973806 (10ABran-WMF) dbhost repooling dbproxy reloaded backuphost checked and looks green
[14:57:12] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973844 (10MatthewVernon) Swift and thanos frontends repooled, all seems OK.
[15:22:59] <cdanis>	 topranks: XioNoX: how do I get a two-year view of a 'traffic bill' in librenms?
[16:04:39] <btullis>	 Hello. I have a path for adding a new thirdparty repository, if anyone has time to review. Thanks. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053724
[16:13:35] <topranks>	 cdanis: hey 
[16:13:49] <topranks>	 tbh I'm not sure you can get that easily 
[16:13:58] <topranks>	 is there one in particular you're interested in?
[16:15:24] <cdanis>	 topranks: yeah, eqiad peering/transit
[16:15:52] <topranks>	 and you want the percentiles?
[16:18:44] <cdanis>	 topranks: no I just want to see total bandwidth
[16:18:54] <cdanis>	 a stacked bits plot would be fine
[16:19:17] <topranks>	 ok, yeah I don't think billing can cover it but I think we can make a custom dashboard perhaps 
[16:19:19] <topranks>	 let me have a look 
[16:19:38] <topranks>	 definitely the kind of thing that's easier with prometheus/grafana 
[16:21:17] <cdanis>	 the context is, i noticed that in march this year traffic on the equinix peering link in eqiad increased a lot https://librenms.wikimedia.org/graphs/to=1720714500/id=11600/type=port_bits/from=1689178500/
[16:21:55] <cdanis>	 and I was wondering if that was traffic sloshing around from other links or what
[16:26:27] <btullis>	 cdanis: Thx for the review.
[16:27:27] <cdanis>	 np :)
[16:36:49] <topranks>	 yeah that is quite significant, I'm trying to think nothing is jumping out at me in terms of policy or other changes that would account for it 
[16:36:53] <topranks>	 (I may be missing something)
[16:37:01] <topranks>	 this is the combined peering/transit in eqiad:
[16:37:02] <topranks>	 https://librenms.wikimedia.org/graphs/id=11611%2C11624%2C11599%2C11605%2C7292%2C11600%2C11610%2C11615/type=multiport_bits_separate/from=1657643400/to=1720715400/
[16:37:34] <cdanis>	 uhh
[16:37:55] <cdanis>	 is that WME, perhaps?
[16:41:13] <topranks>	 I think it's shifted traffic from elsewhere, codfw possibly
[16:41:21] <topranks>	 this is our total transit globally: https://librenms.wikimedia.org/graphs/id=7159,11599,11605,11611,11624,13969,16334,16765,16767,16841,16842,17843,17844,19112,19399,19413,19414,21980,22269,23135,23136,23194,23199,31633,31685,31687/type=multiport_bits_separate/from=1720629640/to=1720716040
[16:41:39] <topranks>	 and this is total peering: https://librenms.wikimedia.org/graphs/id=7292,11600,11610,11615,13971,13972,16550,16721,16766,16788,16839,17845,17846,17849,19105,19109,19110,19339,22276,23132,23197,31635/type=multiport_bits_separate/from=1720629640/to=1720716040
[16:42:11] <cdanis>	 hm
[16:42:13] <topranks>	 it doesn't stand out on those as you'd expect if it was "new" traffic 
[16:42:26] <topranks>	 I need to make similar for codfw 
[16:42:41] <cdanis>	 topranks: no, there's a definite increase
[16:42:52] <cdanis>	 look, here's total global peering+transit:
[16:42:54] <cdanis>	 https://librenms.wikimedia.org/graphs/id=7292%2C11600%2C11610%2C11615%2C13971%2C13972%2C16550%2C16721%2C16766%2C16788%2C16839%2C17845%2C17846%2C17849%2C19105%2C19109%2C19110%2C19339%2C22276%2C23132%2C23197%2C31635%2C7159%2C11599%2C11605%2C11611%2C11624%2C13969%2C16334%2C16765%2C16767%2C16841%2C16842%2C17843%2C17844%2C19112%2C19399%2C19413%2C19414%2C21980%2C22269%2C23135%2C23136%2C23194%2C23
[16:42:56] <cdanis>	 199%2C31633%2C31685%2C31687/type=multiport_bits_separate/from=1657644000/to=1720716000/
[16:42:58] <cdanis>	 uh
[16:43:04] <cdanis>	 https://w.wiki/AdW8 there it is lol
[16:43:34] <cdanis>	 so, the complicated thing is, this is the two-year view with the *current* mapping of device ports to links
[16:43:42] <cdanis>	 so this doesn't tell you much
[16:44:16] <cdanis>	 that being said, we do look hotter on egress in june than we did in april/may
[16:45:55] <topranks>	 indeed yeah the ports changing is something to bear in mind, although most probably haven't 
[16:46:28] <cdanis>	 yeah I am kinda skeptical of the two-year view but it really does look like things changed even from a few months ago
[16:47:13] <topranks>	 we are hotter alright, I was more focussing on the March 17th jump in use in eqiad
[16:47:14] <topranks>	 https://w.wiki/AdWM
[16:50:14] <topranks>	 which I don't see an equivalent jump on total transit/peering combined:  https://w.wiki/AdWQ
[16:52:51] <cdanis>	 topranks: I'm not convinced I'm holding this netflows query right, but, check this out: https://w.wiki/AdWW
[16:53:20] <cdanis>	 is 'AS Src 64600' the right thing to use as a filter here?
[16:53:28] <cdanis>	 (that's our internal confederation number or something?)
[16:53:42] <cdanis>	 oh that's pybal lol
[16:54:20] <topranks>	 64600 is Pybal 
[16:54:21] <topranks>	 yeah 
[16:54:33] <cdanis>	 i didn't know we had netflow data back this far
[16:55:26] <cdanis>	 https://w.wiki/AdWZ
[16:55:34] <cdanis>	 it seems fair to say that our egress is up
[16:55:37] <cdanis>	 just, globally
[16:55:57] <cdanis>	 maybe the real question is, how much of this is AI researchers trying to mass-download Commons
[16:57:16] <topranks>	 that is indeed a very interesting question to ask 
[16:59:20] <cdanis>	 hm, looks like you don't get IP Src or related fields earlier than May
[16:59:23] <topranks>	 does seem up yep.  I need to dig more into how the ASN stuff works out in our netflows 
[16:59:55] <cdanis>	 me too tbh :)
[17:00:19] <cdanis>	 oh hm I also don't remember if all sites have netflow exporters, nowadays
[17:00:23] <topranks>	 I think it should be fine to use that 64600 to get all the PyBal stuff, to the routers its traffic from an external network same as any other 
[17:00:29] <cdanis>	 yeah
[17:00:31] <topranks>	 I think they do 
[17:00:51] <topranks>	 there are VMs at each site anyway 
[17:01:24] <cdanis>	 oh maybe that was the original issue, I think we added netflow before we had ganeti at every site
[17:13:06] * elukey afk! o/
[17:19:24] <topranks>	 cdanis: I tried to map out the global usage using some of the gnmi stats 
[17:19:26] <topranks>	 https://w.wiki/AdXT
[17:19:48] <topranks>	 that's packet-per-second based on the qos queue data, we only have the throughput stats for past 2 weeks 
[17:20:21] <topranks>	 definitely showing about a million pps more now than at the tail-end of 2023 
[17:53:26] <cdanis>	 yeesh
[17:53:31] <cdanis>	 topranks: any chance you have bytes instead of pps?
[18:15:36] <topranks>	 cdanis: afraid not, we are collecting those stats now but only been doing so for past few weeks 
[18:16:27] <topranks>	 oh wait maybe I'm mixed up - one sec 
[18:18:16] <topranks>	 cdanis: yep I was being dumb I've changed it now to show bits/sec
[18:19:46] <cdanis>	 topranks: this is a subset of the data, right?
[18:20:18] <topranks>	 I believe it should be it all apart from magru 
[18:20:29] <cdanis>	 hm
[18:20:33] <cdanis>	 okay cool
[18:20:47] <topranks>	 and it's doing a query based on interface description - which is not ideal but does mean the physical port changes aren't a factor 
[18:20:47] <cdanis>	 I was just looking at this https://grafana.wikimedia.org/goto/3kyLNBlSg?orgId=1
[18:21:19] <cdanis>	 to a first approximation cp_upload ethernet TX *is* our internet egress, and yeah that lines up pretty well
[18:21:49] <topranks>	 yep seems to 
[18:22:12] <cdanis>	 ulsfo and eqiad egress are both greatly increased over the past few months
[18:24:48] <topranks>	 hmm yep 
[18:27:01] <topranks>	 both locations you'd not be surprised to find AI datacenters either 
[18:27:05] <cdanis>	 mhm
[18:28:41] <cdanis>	 during the second of yesterday's incidents caused by commons scraping (lol) I had been thinking, it would probably make sense to do some splitting up at the geodns level of at least the largest public cloud regions
[18:29:09] <cdanis>	 like, it is kind of a liability to have all of aws us-east-1 pointed at eqiad
[18:29:10] <topranks>	 yeah might not be a bad idea 
[18:29:55] <topranks>	 it's in Ashburn too though so it makes sense on one level 
[18:29:58] <topranks>	 but yeah 
[18:30:07] <cdanis>	 oh yeah, and it's a definite hit to latency
[18:30:11] <topranks>	 is it better if the scraping hits our active core site?
[18:30:23] <cdanis>	 not if the scraping of images is making our egress too hot
[18:30:24] <topranks>	 in the sense of reducing traffic across transport links?
[18:30:30] <cdanis>	 hmm
[18:30:31] <topranks>	 no
[18:30:34] <cdanis>	 you're right
[18:30:44] <cdanis>	 for scraping of uncached stuff we do have to worry about transport links
[18:30:45] <topranks>	 it's all a balancing act really