[07:02:03] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#9972298 (10SLyngshede-WMF) @bd808 Sorry I haven't booked a time slot for debugging, but time hasn't been on my side. For now I think we should just try t... [07:44:19] slyngs: FYI I see that you removed from data.yaml chelsyx, but it's still in the nda group and reported by the daily_account_consistency_check [07:44:49] Yes, I removed them for the nda group this morning, but thank you [07:45:02] ack, thx [08:26:33] topranks: https://gnmic.openconfig.net/changelog/ "Commit confirmed gNMI extension" :) Even though I don't know if Junos or Sonic supports it [10:19:08] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9972893 (10cmooney) 05Open→03Resolved [10:47:51] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9972928 (10ayounsi) First Puppetization of new Netbox frontends : * `sudo mkdir /srv/deployment/` was needed. TODO: Add to Puppet * And then this error, fixed with `sudo mkdir /srv/ne... [11:25:48] FIRING: [23x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:28] FIRING: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:51:08] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973522 (10MatthewVernon) ms and thanos frontends depooled, you're good to go from a swift POV. [14:08:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973584 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9abb3472-bf69-45f5-8c93-e3c8cfbe9e4e) se... [14:09:14] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973587 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d7f08b17-a319-4077-a271-a0ef15a438a3) se... [14:12:58] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5a6d4b-345e-4f18-8342-05572d6411e7) se... [14:19:51] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973616 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=de50ae5f-fec9-4347-b2ef-225a3af373f6) se... [14:34:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973728 (10cmooney) Switch upgrade complete, all looks good hosts are online and responding to ping again. Thanks f... [14:47:24] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973806 (10ABran-WMF) dbhost repooling dbproxy reloaded backuphost checked and looks green [14:57:12] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973844 (10MatthewVernon) Swift and thanos frontends repooled, all seems OK. [15:22:59] topranks: XioNoX: how do I get a two-year view of a 'traffic bill' in librenms? [16:04:39] Hello. I have a path for adding a new thirdparty repository, if anyone has time to review. Thanks. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053724 [16:13:35] cdanis: hey [16:13:49] tbh I'm not sure you can get that easily [16:13:58] is there one in particular you're interested in? [16:15:24] topranks: yeah, eqiad peering/transit [16:15:52] and you want the percentiles? [16:18:44] topranks: no I just want to see total bandwidth [16:18:54] a stacked bits plot would be fine [16:19:17] ok, yeah I don't think billing can cover it but I think we can make a custom dashboard perhaps [16:19:19] let me have a look [16:19:38] definitely the kind of thing that's easier with prometheus/grafana [16:21:17] the context is, i noticed that in march this year traffic on the equinix peering link in eqiad increased a lot https://librenms.wikimedia.org/graphs/to=1720714500/id=11600/type=port_bits/from=1689178500/ [16:21:55] and I was wondering if that was traffic sloshing around from other links or what [16:26:27] cdanis: Thx for the review. [16:27:27] np :) [16:36:49] yeah that is quite significant, I'm trying to think nothing is jumping out at me in terms of policy or other changes that would account for it [16:36:53] (I may be missing something) [16:37:01] this is the combined peering/transit in eqiad: [16:37:02] https://librenms.wikimedia.org/graphs/id=11611%2C11624%2C11599%2C11605%2C7292%2C11600%2C11610%2C11615/type=multiport_bits_separate/from=1657643400/to=1720715400/ [16:37:34] uhh [16:37:55] is that WME, perhaps? [16:41:13] I think it's shifted traffic from elsewhere, codfw possibly [16:41:21] this is our total transit globally: https://librenms.wikimedia.org/graphs/id=7159,11599,11605,11611,11624,13969,16334,16765,16767,16841,16842,17843,17844,19112,19399,19413,19414,21980,22269,23135,23136,23194,23199,31633,31685,31687/type=multiport_bits_separate/from=1720629640/to=1720716040 [16:41:39] and this is total peering: https://librenms.wikimedia.org/graphs/id=7292,11600,11610,11615,13971,13972,16550,16721,16766,16788,16839,17845,17846,17849,19105,19109,19110,19339,22276,23132,23197,31635/type=multiport_bits_separate/from=1720629640/to=1720716040 [16:42:11] hm [16:42:13] it doesn't stand out on those as you'd expect if it was "new" traffic [16:42:26] I need to make similar for codfw [16:42:41] topranks: no, there's a definite increase [16:42:52] look, here's total global peering+transit: [16:42:54] https://librenms.wikimedia.org/graphs/id=7292%2C11600%2C11610%2C11615%2C13971%2C13972%2C16550%2C16721%2C16766%2C16788%2C16839%2C17845%2C17846%2C17849%2C19105%2C19109%2C19110%2C19339%2C22276%2C23132%2C23197%2C31635%2C7159%2C11599%2C11605%2C11611%2C11624%2C13969%2C16334%2C16765%2C16767%2C16841%2C16842%2C17843%2C17844%2C19112%2C19399%2C19413%2C19414%2C21980%2C22269%2C23135%2C23136%2C23194%2C23 [16:42:56] 199%2C31633%2C31685%2C31687/type=multiport_bits_separate/from=1657644000/to=1720716000/ [16:42:58] uh [16:43:04] https://w.wiki/AdW8 there it is lol [16:43:34] so, the complicated thing is, this is the two-year view with the *current* mapping of device ports to links [16:43:42] so this doesn't tell you much [16:44:16] that being said, we do look hotter on egress in june than we did in april/may [16:45:55] indeed yeah the ports changing is something to bear in mind, although most probably haven't [16:46:28] yeah I am kinda skeptical of the two-year view but it really does look like things changed even from a few months ago [16:47:13] we are hotter alright, I was more focussing on the March 17th jump in use in eqiad [16:47:14] https://w.wiki/AdWM [16:50:14] which I don't see an equivalent jump on total transit/peering combined: https://w.wiki/AdWQ [16:52:51] topranks: I'm not convinced I'm holding this netflows query right, but, check this out: https://w.wiki/AdWW [16:53:20] is 'AS Src 64600' the right thing to use as a filter here? [16:53:28] (that's our internal confederation number or something?) [16:53:42] oh that's pybal lol [16:54:20] 64600 is Pybal [16:54:21] yeah [16:54:33] i didn't know we had netflow data back this far [16:55:26] https://w.wiki/AdWZ [16:55:34] it seems fair to say that our egress is up [16:55:37] just, globally [16:55:57] maybe the real question is, how much of this is AI researchers trying to mass-download Commons [16:57:16] that is indeed a very interesting question to ask [16:59:20] hm, looks like you don't get IP Src or related fields earlier than May [16:59:23] does seem up yep. I need to dig more into how the ASN stuff works out in our netflows [16:59:55] me too tbh :) [17:00:19] oh hm I also don't remember if all sites have netflow exporters, nowadays [17:00:23] I think it should be fine to use that 64600 to get all the PyBal stuff, to the routers its traffic from an external network same as any other [17:00:29] yeah [17:00:31] I think they do [17:00:51] there are VMs at each site anyway [17:01:24] oh maybe that was the original issue, I think we added netflow before we had ganeti at every site [17:13:06] * elukey afk! o/ [17:19:24] cdanis: I tried to map out the global usage using some of the gnmi stats [17:19:26] https://w.wiki/AdXT [17:19:48] that's packet-per-second based on the qos queue data, we only have the throughput stats for past 2 weeks [17:20:21] definitely showing about a million pps more now than at the tail-end of 2023 [17:53:26] yeesh [17:53:31] topranks: any chance you have bytes instead of pps? [18:15:36] cdanis: afraid not, we are collecting those stats now but only been doing so for past few weeks [18:16:27] oh wait maybe I'm mixed up - one sec [18:18:16] cdanis: yep I was being dumb I've changed it now to show bits/sec [18:19:46] topranks: this is a subset of the data, right? [18:20:18] I believe it should be it all apart from magru [18:20:29] hm [18:20:33] okay cool [18:20:47] and it's doing a query based on interface description - which is not ideal but does mean the physical port changes aren't a factor [18:20:47] I was just looking at this https://grafana.wikimedia.org/goto/3kyLNBlSg?orgId=1 [18:21:19] to a first approximation cp_upload ethernet TX *is* our internet egress, and yeah that lines up pretty well [18:21:49] yep seems to [18:22:12] ulsfo and eqiad egress are both greatly increased over the past few months [18:24:48] hmm yep [18:27:01] both locations you'd not be surprised to find AI datacenters either [18:27:05] mhm [18:28:41] during the second of yesterday's incidents caused by commons scraping (lol) I had been thinking, it would probably make sense to do some splitting up at the geodns level of at least the largest public cloud regions [18:29:09] like, it is kind of a liability to have all of aws us-east-1 pointed at eqiad [18:29:10] yeah might not be a bad idea [18:29:55] it's in Ashburn too though so it makes sense on one level [18:29:58] but yeah [18:30:07] oh yeah, and it's a definite hit to latency [18:30:11] is it better if the scraping hits our active core site? [18:30:23] not if the scraping of images is making our egress too hot [18:30:24] in the sense of reducing traffic across transport links? [18:30:30] hmm [18:30:31] no [18:30:34] you're right [18:30:44] for scraping of uncached stuff we do have to worry about transport links [18:30:45] it's all a balancing act really