[06:45:56] (EdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:50:56] (EdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:16:03] 10Traffic, 10SRE, 10observability, 10Discovery-Search (Current work): flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10elukey) @Dzahn do you know if the list[12]* nodes are scheduled to be upgraded to Bullseye during the next few weeks? It would... [08:22:56] (EdgeTrafficDrop) firing: 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [08:24:24] 10Traffic, 10Observability-Logging, 10SRE, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) >>! In T293879#7460663, @gerritbot wrote: > Change 734893 **merged** by Ema: > %%%[operations/puppet@produ... [08:27:56] (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [11:46:39] hi gtraffic can i get a review on https://gerrit.wikimedia.org/r/c/operations/dns/+/734262 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/734263 [13:11:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10ayounsi) Circuit is up. [14:03:21] ema: deployment-cache-text06 has puppet disabled by you, still needed or ok to re-enable? [14:04:12] majavah: not needed, re-enabled and running now [14:04:13] thanks! [14:04:21] thx! [14:53:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Stage drmrs in Netbox - https://phabricator.wikimedia.org/T283594 (10ayounsi) 05Open→03Resolved a:03ayounsi Netbox now reflects reality. Only cable IDs and asset tags are missing. [15:25:06] I've gotta push the digicert cert out to hosts today too, we're at 8 days from issuance now, and too many more days and we'll run risks on the other side for at least RSA [15:25:36] going to push just the cert patch first - https://gerrit.wikimedia.org/r/c/operations/puppet/+/732009 - this doesn't make it live anywhere, just makes it deployed and available. [15:35:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) 05Open→03Resolved [15:42:58] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [15:43:32] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn mw2255 is done [18:41:40] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10RobH) [18:41:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10RobH) [18:43:45] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10RobH) a:03Jclark-ctr [19:22:11] as we're getting pretty close to the local daily minimum on eqsin traffic patterns, gonna push the digicert-2021 update there shortly [19:22:41] (will start with one node, then all eqsin, and then leave esams for tomorrow) [20:00:55] bblack: might be interesting for you right now: https://logstash.wikimedia.org/goto/492ad53f42741268eafa830ecd52b9f5 [20:00:59] (interesting in that, nothing seems to have changed) [20:30:13] yeah good news :) [20:31:12] esams switch is @ https://gerrit.wikimedia.org/r/c/operations/puppet/+/735009 - will save for tomorrow [20:55:16] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [21:03:45] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) Thanks @Papaul ! it's back in service now I am not sure what is next exac... [21:43:25] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn thank you. I think it is best to just close this task and go "on d... [22:27:24] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) 05Open→03Resolved a:03Dzahn I agree and boldly resolve it, expecting... [22:38:44] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [22:55:28] 10Traffic, 10SRE, 10observability, 10Discovery-Search (Current work): flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Dzahn) @elukey Not list* but we could potentially test it with librenms.wikimedia.org. That fulfills the requirements of "uses...