[01:54:25] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) I ran the following maintenance scripts on the beta cluster with PHP 7.4: * mwscript maintenance/cleanupUploadStash.php --wiki=enwiki ** Broken independent... [03:12:02] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) I filed the first bug as T310765. [03:40:10] 10serviceops: Allow coexisting php version in our puppet code - https://phabricator.wikimedia.org/T293450 (10tstarling) [03:40:28] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) I filed the second bug as T310767. Note that both bugs represent backwards-compatible changes in PHP 7.4 in which incorrect code was previously silent but... [03:41:02] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) 05Open→03Resolved a:05JMeybohm→03tstarling [07:40:10] 10serviceops, 10DNS, 10SRE, 10Traffic, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10SLyngshede-WMF) p:05Triage→03Medium [15:30:39] hi, I have request a new ssh key pair for the deployment server keyholder. It will be used by scap to push wikiversions.json updates back to Gerrit (we currently use local personal key pairs). https://phabricator.wikimedia.org/T310620 [15:31:02] if that can please be added to some backlog to be processed by anyone who knows about keyholder [16:04:41] howdy all. i'm trying to figure out a phab deploy for T310742, which is breaking things for a bunch of users. should be brief, but i'm afraid of paging everybody again. advice welcome. [16:07:33] 10serviceops, 10Data-Engineering, 10Event-Platform, 10SRE: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) a:05Jelto→03None [16:08:59] 10serviceops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) [16:14:02] 10serviceops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10SRE: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) [16:37:54] good luck brennen :') [16:44:42] at least its well documented! https://gerrit.wikimedia.org/g/operations/puppet/+/1a6b2be46b48460f50105a72d09bb3df29a095ca/modules/phabricator/README [16:47:48] although, note the use of the passive voice on line 84 of that file, "an outage needs to be planned" --- just vague enough to cover a lot of unknowns :) [16:52:06] would it be an Very Bad Idea (tm) idea to manually patch https://phabricator.wikimedia.org/D1202 to fix the issue, then plan an outage for some other time..? [16:52:35] i have had the same thought, but i cannot think of a way to simply patch the file in question. [16:52:55] thcipriani and i were discussing this - it stays in php-fpm cache basically forever, right? [16:53:31] also phd is a long-running process, but i don't know enough to know whether phd loads this code for anything. possibly not. [16:53:39] that's my understanding: https://www.php.net/manual/en/opcache.configuration.php#ini.opcache.validate-timestamps is falsy on phab [16:54:20] and there's some bug in opcache_reset and opcache_invalidate that leads to corruption (at least that's what we found with MediaWiki) [16:54:26] how disruptive is restarting php-fpm going to be..? [16:55:03] good q. we could downtime everything we know how to downtime in icinga and aim for a fast restart of php-fpm. [16:55:04] I don't think it will be disruptive. It will probably wake up everyone who gets paged is our worry. [16:55:11] ah! [16:55:23] yeah, looking to avoid a repeat of yesterday. [16:55:38] we need a section of our upgrade docs that reads: how not to further sre alarm fatigue [16:58:21] okay daft question maybe but can't you "just" entirely downtime the phab hosts `phab1001` & `phab2001`? [16:59:14] * TheresNoTime sees a button labelled "Schedule downtime for this host and all services" 🤷 [16:59:59] so yesterday we downtimed everything we could find in icinga and still managed to page some folks, but we're not sure how we did that, so we need one of the folks who got paged to help us figure out what paged them, unfortunately :( [17:00:45] ah fair enough, and good on y'all for making sure you don't needless page people if you can help it :) [17:01:04] s/needless/needlessly [17:01:59] we really need to have a procedure here for this kind of short notice bugfix, but phab deploys are in sort of a weird state generally. [17:02:22] or really what i'd say is: our knowledge of doing them is. [17:02:29] ^ [17:19:57] afaict it would've been https://alerts.wikimedia.org/?q=instance%3D~%5Ephab.%2A%24 + https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?host=phabricator.wikimedia.org&ts_start=1655337600&ts_end=1655399508&limit=0&type=0&order=new2old&timeperiod=last7days&start_time=2022-06-16+00%3A00%3A00&end_time=2022-06-16+17%3A11%3A48 ... [18:25:44] 10serviceops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10SRE: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10SLyngshede-WMF) p:05Triage→03Medium [18:38:14] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) 05Open→03Resolved All merged. Thanks! 🎉 [18:51:32] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) [18:53:28] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gitlab-runner1001.eqiad.wmnet` - gitlab-runner1001.eqiad.w... [19:01:04] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gitlab-runner1001.eqiad.wmnet` - gitlab-runner1001.eqiad.w... [19:39:47] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: `gitlab-runner2001.codfw.wmnet` - gitlab-runner2001.codfw.... [20:36:10] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) [20:59:42] 10serviceops, 10Beta-Cluster-Infrastructure, 10SRE, 10Scap, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle)