[08:59:13] topranks: hmmm maybe I'm getting old but I'd say that I've updated T286065 regarding acmechief1001 and disable puppet on acme chief clients... has been updated afterwards after seeing how smooth row D went? [08:59:13] T286065: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 [09:00:24] phabricator diff says that I actually did that change :) [09:00:29] vgutierrez: I changed it as the plan is still to disable puppet fleet-wide. [09:00:41] gotcha [09:00:48] sorry, you must have. We are both getting old :) [09:01:15] But yeah, after talking with Moritz it didn't seem to make sense to take action on it specifically if we were disabling puppet everywhere temporarily. [09:01:21] yup [09:01:26] cool thanks [09:02:40] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:20:57] (VarnishTrafficDrop) firing: 56% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:26:16] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks @jcrespo. Yes this makes perfect sense. Due... [10:00:57] (VarnishTrafficDrop) resolved: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [10:07:27] (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [10:37:27] (VarnishTrafficDrop) resolved: 66% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [11:38:51] 10Traffic, 10SRE: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) >>! In T286713#7215152, @Vgutierrez wrote: > we should reduce the threshold, 3 weeks should be better for a LE acme-chief manage... [11:40:42] 10Traffic, 10SRE: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) Same with the https://phabricator.wikimedia.org cert, it is still a DigiCert cert for me. So this is about adjusting the monitor... [11:41:28] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [11:41:40] mutante: I guess you're in europe now.. from US point of view is still let's encrypt [11:50:46] vgutierrez: yes, i am spending summer in Europe, heh [11:51:11] so since the Icinga servers are in US... i should monitor like they are LE certs then? [11:51:43] but if i change both then we have no more monitoring of the DigiCert cert expiry [11:52:34] it's just a bit confusing to people this is tied to specifically planet and phabricator alerts.. but yea.. they just happen to _also_ check the cert while checking https works [11:54:02] maybe I should try to keep monitoring https but without the cert expiry option..if possible.. and then make a single new check just for the DigiCert [11:54:57] or... and this is the easiest fix.. I change the planet check to use the "LE" checkcommand.. and leave the other untouched. then we still have 1 alert for DigiCert but not 2 [11:55:24] i'll think that i can do for sure, no point in having 2 alerts for the same expiry [12:28:58] so the unified cert is monitored on a lot of places [12:29:04] basically on every cp server that uses it [12:30:14] for planet itself it's a pity that we don't have a check that monitors the expire date based on the total life of the cert [12:48:37] 10Traffic, 10SRE, 10Sustainability (Incident Followup): False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) [12:50:07] 10Traffic, 10SRE, 10Sustainability (Incident Followup): LVS can't handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Vgutierrez) [14:40:57] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 4 host(s) and their services with reason: Eqiad row C maintenance ` cp[... [14:43:00] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [14:46:57] (VarnishTrafficDrop) firing: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [14:49:33] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [14:50:26] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 1 host(s) and their services with reason: Eqiad row C maintenance ` lvs... [14:51:57] (VarnishTrafficDrop) resolved: 69% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [14:53:57] (VarnishTrafficDrop) firing: 67% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [14:54:46] vgutierrez: heads up on our network change on row C in a few mins. [14:54:55] we're ready for you :) [14:55:01] ok great :) [14:55:46] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [15:03:57] (VarnishTrafficDrop) resolved: 68% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [15:04:10] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [15:07:56] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:10:28] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [15:22:15] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:23:12] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:23:49] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:27:35] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) switch shipped out today tracking information below Tracking Number: 1ZA19A021295420730 [16:49:57] (VarnishTrafficDrop) firing: 50% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [16:54:57] (VarnishTrafficDrop) resolved: 50% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:02:21] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) All went very well with the change, this time I ran rapid ping from the CR to see if any packet loss was observed, and did detect some loss,... [17:02:37] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) 05Open→03Resolved [17:02:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)