[00:20:23] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [06:55:05] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10ayounsi) 05Resolved→03Open Thanks @Ladsgroup yeah some devices got way too verbose at sending debug logs and we don't use debug level logs for alerting so the ab... [11:50:25] Hi folks! I have a probably trivial question: when gdnsd has to take a decision about what A Record to return for a discovery record, how does it do it? [11:50:48] for example: say that I have inference.discovery.wmnet, that is mapped to two IPs, active active [11:51:16] I know about confd dns-disc etc.., up to configs like /var/lib/gdnsd/discovery-inference.state on the dns servers [11:51:44] but I am still puzzled about how, for example, in esams inference.discovery.wmnet returns a specific A record [11:52:18] /etc/gdnsd/discovery-map seemed promising but I still have some doubts [11:53:16] to go back to the example, in esams inference.discovery.wmnet is mapped to the eqiad VIP, where is this info mapped? [11:54:59] (it is fine to send me to some Wikitech doc with a sad glare) [11:56:21] IIRC but I'm not 100% sure it should in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/geo-maps#281 basically the geo-mapping of our own DCs IP space to the various other DCs [11:56:51] and we should probably improve https://wikitech.wikimedia.org/wiki/DNS/Discovery with the authoritative answer :D [12:33:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye [13:14:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [13:16:46] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) p:05Triage→03High [13:29:55] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Reset completed, the card came back up briefly but quickly failed again ` cmooney@re0.cr1-esams> show chassis fpc 1 detail Slot 1 information: State... [13:42:31] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Some logs following the issue of the "request chassis fpc online slot 1" command: {F41507770} [13:45:00] elukey: see modules/profile/files/dns/auth/discovery-map [13:45:14] so from esams, you should hit eqiad as expected [13:46:17] and of course you can vary this by using the ECS option and so if you do dig +subnet=eqsin/24 even from an esams host, you will hit codfw instead of eqiad because all it is doing is checking that [13:47:12] this is fed from: [13:47:12] disc-inference => { [13:47:13] map => discovery-map, [13:47:13] service_types => discovery-state-inference, [13:47:13] dcmap => { [13:47:15] codfw => 10.2.1.63 [13:47:18] eqiad => 10.2.2.63 [13:47:20] } [13:47:23] } [13:47:25] which is discovery-map above [13:48:43] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) JTAC case 2023-1115-011066 opened. [13:48:50] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) a:03ayounsi [13:50:29] sukhe: o/ [13:51:09] so IIUC the "nets" map in discovery-map should be read as "if the client is within this range, then this is the preference" [13:51:58] like [13:51:59] 10.64.0.0/12 => [eqiad, codfw], # eqiad private/mgmt [13:52:03] yep [13:52:38] super all makes sense now [13:53:01] I will update the docs shortly [13:53:03] so ECS can be used as well to tweak the range, and get different answers [13:53:07] sukhe: <3 [13:53:16] otherwise I can do it, didn't mean to force you :) [13:54:01] elukey: yeah, dig +subnet= @ns0.wikimedia.org example [13:54:16] * elukey nods [13:54:41] elukey: happy to update the docs, you can leave it to us :) [13:56:46] super [14:00:14] (played with ECS, really nice) [14:03:55] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:12:08] (PurgedHighEventLag) firing: (3) High event process lag with purged on cp3066:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [14:17:02] (PurgedHighEventLag) resolved: (6) High event process lag with purged on cp3066:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [14:31:59] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) Thanks for the patch! I hope it'll make a dent, I'll monitor it. While I was monitoring it, I tried this: ` root@db1217.eqiad.wmnet[librenms]> select * f... [14:43:33] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) @ayounsi okay to truncate that table? [14:50:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [14:50:49] 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) [15:09:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye [15:49:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:04:01] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) 05Open→03Resolved Per my chat with Arzhel in irc, table truncated! `root@db1119.eqiad.wmnet[librenms]> truncate table syslog; Query OK, 0 rows affect... [17:06:53] ahoy - I'd like to move a service to lvs_setup, would I be okay to do an lvs restart in the next few minutes? https://gerrit.wikimedia.org/r/c/operations/puppet/+/973825 [17:10:13] hnowlan: should be fine, go for it [17:10:25] I am guessing you know which hosts to affect already? [17:15:46] sukhe: yeah I think so - puppet says lvs1020 and lvs2014 for secondary, lvs1019 and lvs2013 for primary [17:16:23] sounds right, thanks [17:16:26] let us know if we can help [17:16:45] 10Traffic, 10Phabricator, 10SRE: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) p:05Medium→03Low [17:30:00] looks like it's all good - thanks! [17:30:20] thanks! [18:02:06] 10Traffic, 10Data-Engineering, 10Observability-Logging: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) > moved all traffic to HAProxy ...We did?! Wow. Can you link some other tasks so I can get some context? [18:56:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye... [20:48:50] 10Traffic, 10Data-Engineering, 10Observability-Logging: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Sure! The main task was https://phabricator.wikimedia.org/T323557 [21:39:04] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall)