[04:10:42] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) 05Open→03Resolved a:03tstarling [05:29:54] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Here's a model of the benefit of the multi-DC project for users west of codfw. The servers are 30ms closer, but codfw seems a bit slower, so if... [08:03:59] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche) @Joe we were thinking we can flatten the configs one level, since we are already parsing the ent... [08:11:03] 10serviceops, 10Discovery-Search: Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10Gehel) [09:16:50] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ebbd495c-3b68-48ff-9689-249fb45b7e02) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 4 host(s) and their services with reason: Downtiming replaced... [09:23:03] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Performance-Team (Radar), 10Wikimedia-production-error: Shellbox error rate increased from 100/d to 1000/d, 2022-07-12 - https://phabricator.wikimedia.org/T313374 (10Krinkle) [09:56:40] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Performance-Team (Radar), 10Wikimedia-production-error: Shellbox error rate increased from 100/d to 1000/d, 2022-07-12 - https://phabricator.wikimedia.org/T313374 (10JMeybohm) Looking at logstash (https://logstash.wikimedia.org/goto/8b05ef476e1c74f8cb625fda5... [10:03:02] 10serviceops, 10Parsoid, 10Patch-For-Review, 10Performance-Team (Radar): Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 (10Clement_Goubert) 100% of parse traffic served in php 7.4 [10:06:15] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6f733b0f-40ea-4041-9f04-a763b9c4800e) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replaced... [10:11:40] 10serviceops, 10Discovery-Search: Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10Joe) There is a general problem I have with this plan, which is that as we stand, the API and appserver clusters are reserved (as much as possible) to li... [10:13:07] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10Clement_Goubert) `parse1019.eqiad.wmnet` replaced `wtp1028.eqiad.wmnet` `parse1020.eqiad.wmnet` replaced `wtp1029.eqiad.wmnet` `parse1021.eqiad.wmnet` replaced `wtp1030.eqiad.wmnet` `parse1022.eqiad.wmnet` replace... [10:15:06] \o/ 100% of parsoid traffic is now php7.4 [10:18:30] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cab1609c-6ed0-4c7d-b9ad-b74825cd9914) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 7 host(s) and their services with reason: Downtiming replaced... [10:19:00] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=70990857-b496-4c85-adb4-d10e9748c1fe) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 15 host(s) and their services with reason: Downtiming replace... [10:32:46] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz) [10:33:06] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz) [11:59:31] jayme: o/ ping about 409s received, I'll try to look into it, sigh [11:59:50] elukey: might be a non-issue, though [12:00:12] just stumbled over it 🤷 [12:00:41] the 409 return code is also new for me [12:00:50] but better to check of course [12:01:25] I think it happens when something tries to PATCH an outdated resource version [12:24:32] 10serviceops, 10API Platform, 10Growth-Structured-Tasks, 10Image-Suggestions, and 7 others: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset b... - https://phabricator.wikimedia.org/T313973 [12:25:55] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=44dff5a3-6ba9-406e-bc1e-cda6337f8c64) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Downtiming replaced... [12:26:24] 10serviceops: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ad41fb53-7491-45d5-83e5-1e634c9d9190) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Downtiming replaced... [12:38:02] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) Change this task to a proper decommission checklist. [12:38:23] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [12:39:55] For decom of appservers, is the order remove from puppet then run decom cookbook like the checklist form seems to imply, or first decom cookbook then merge the puppet change removing the hosts as https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Removing_old_appservers_from_production_(decom) says? I'd rather be sure :) [12:44:52] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [12:47:19] Nevermind, I think I was just confused by [] - any service group puppet/hiera/dsh config removed [12:54:26] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [13:12:26] claime: the decom should be run before removing the hosts from site.pp [13:13:22] volans: thanks! :) [13:15:01] a.kosiaris, _j.oe_, I'm starting the first decom of the wtp servers (oldest one removed from rotation), I'll ping you if I run into trouble [13:31:18] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp1034.eqiad.wmnet` - wtp1034.eqiad.wmnet (**PASS**) - Downtimed h... [13:33:12] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [13:43:53] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp1035.eqiad.wmnet` - wtp1035.eqiad.wmnet (**PASS**) - Downtimed h... [13:47:52] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [13:50:35] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [13:55:47] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp1036.eqiad.wmnet` - wtp1036.eqiad.wmnet (**PASS**) - Downtimed h... [14:06:19] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp1037.eqiad.wmnet` - wtp1037.eqiad.wmnet (**PASS**) - Downtimed h... [14:08:43] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [14:23:15] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp[1038-1042].eqiad.wmnet` - wtp1038.eqiad.wmnet (**PASS**) - Down... [14:23:51] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [14:39:04] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp[1043-1047].eqiad.wmnet` - wtp1043.eqiad.wmnet (**PASS**) - Down... [14:39:25] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [14:57:21] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `wtp[1025-1028,1048].eqiad.wmnet` - wtp1025.eqiad.wmnet (**PASS**) -... [14:57:32] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [14:58:57] _joe_, akosiaris, down to the last 5, if you want to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/830802 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830803 before I proceed [15:00:22] 10serviceops, 10decommission-hardware, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [15:02:40] <_joe_> claime: did you already decom the servers? [15:04:56] _joe_, The last 5 ? No, not yet. About to if you give me the go ahead [15:05:14] hmm the mediawiki config change should have gone before decomming them [15:05:15] <_joe_> yes [15:05:21] <_joe_> akosiaris: not really [15:05:24] ? [15:05:27] <_joe_> it should go whenever we want [15:05:39] <_joe_> unless someone reassigns those IPs today [15:06:15] ah, won't hurt vs what would have been better ? [15:06:29] it won't hurt, not arguing there [15:07:59] I'll add it to the decom doc then, because there's no mention of it and when it should be done [15:08:21] Or it's not directly apparent to me anyway [15:09:12] I'll do the config change before continuing then [15:09:59] claime: thanks! [15:19:41] I'm getting a few failures at the sync-file stage [15:20:09] 2022-09-08 15:18:18,309 [WARNING] LB lvs2009:9090 reports pool api-https_443/mw2299.codfw.wmnet as enabled/up/pooled, should be disabled/*/not pooled [15:20:11] 2022-09-08 15:18:23,314 [ERROR] Error depooling the servers: enabled/up/pooled [15:20:32] + a python stacktrace for safe-service-restart on poolcounter [15:21:04] php-fpm-restart: 28% (in-flight: 31; ok: 69; fail: 14; left: 182) [15:23:32] akosiaris: ^^ [15:24:26] I have a tmux running on deploy1002 [15:24:54] hmm [15:25:52] It's all codfw, a fallout of multidc? [15:27:24] fail: 111 ? [15:27:28] yes [15:27:45] php-fpm-restart: 92% (in-flight: 23; ok: 162; fail: 111; left: 0) \ [15:28:05] well, 124 now [15:28:10] Yeah [15:28:14] _joe_: ^ [15:28:21] php 7.4 related? [15:28:36] uhm..somethings wrong with conf2005/etcdmirror [15:28:49] maybe that's related [15:29:38] ah that ticks the codfw checkbox better [15:30:44] Connection to etcd failed due to MaxRetryError [15:30:52] yep [15:31:05] MaxRetryError("HTTPSConnectionPool(host='conf1009.eqiad.wmnet', port=4001) [15:31:11] <_joe_> ok [15:31:18] <_joe_> so it's a network issue? [15:31:23] <_joe_> let's move to #sre [15:31:30] <_joe_> can someone restart etcdmirror? [15:31:41] I can talk to the port from conf2005 [15:31:42] yup, restarting it now [15:31:54] but yeah, the network connection seems fine right now [15:31:57] hiccup? [15:31:57] <_joe_> did conf2005 page? [15:32:07] <_joe_> akosiaris: I would assume a network blip yes [15:32:13] I'm in a meeting currently - can jump out if I'm needed [15:32:19] I deployed nginx updates a little while ago, but that should really induce just a very tiny window of non-avail given how nginx refreshes the binary on update? [15:32:30] I did not get paged, just saw by accident [15:32:39] <_joe_> moritzm: sigh yeah that could kill etcdmirro [15:32:43] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service [15:32:48] so yeah on IRC we did get it [15:32:52] <_joe_> yeah this was a paging alert [15:32:54] I don't know about VO ? [15:33:20] ah, at least I can't remember this tripped for any past nginx updates [15:33:29] restarted etcdmirror [15:33:31] looks fine [15:33:34] <_joe_> ok [15:33:35] thx [15:33:38] It's replicating ok [15:33:57] So, I relaunch my sync-file ? [15:33:57] <_joe_> ahhh that's why the restarts failed [15:34:08] <_joe_> claime: wait [15:34:12] ok holding [15:34:19] <_joe_> I fear we've probably just caused an issue [15:34:40] <_joe_> yes [15:34:42] <_joe_> https://config-master.wikimedia.org/pybal/codfw/api-https [15:34:44] <_joe_> fuck [15:34:50] ah shit [15:36:40] I'm letting you take the wheel, I'm out of my depth here [15:36:50] <_joe_> ok fixed [15:37:19] so, 19m ago the nginx restart [15:37:24] I think it times well ? [15:38:10] Did the rebase at 1514Z, the sync-file not long after [15:38:12] Yes [15:38:58] conf2005 started alerting at 1518Z [15:39:25] Looks like it times correctly yeah [15:40:43] moritzm: thanks for adding that tidbit of information, really helped pinning it down [15:40:54] Yeah, definitely [15:40:59] I might have missed something, sorry - why would all the hosts get depooled by this? [15:40:59] I [15:41:11] I'm wondering how to prevent it for the next nginx update/restart, though? [15:41:19] So what should I do with my half-scattered config file? [15:41:27] claime: sync it again [15:41:33] akosiaris: ack, doing [15:44:53] claime: no failures this time arouind I see [15:45:03] akosiaris: yep it's going ok now [15:45:20] I'll add to my procedure "check etcdmirrors, objects may be closer than they appear" [15:45:25] claime: you are lucky, you know that? [15:45:46] Yeah. I figured. [15:46:04] <_joe_> so, for claime's sake and anyone else curious, what happened was [15:46:17] <_joe_> * moritz upgrades nginx [15:46:23] <_joe_> * etcdmirror crashes [15:47:02] <_joe_> * clement makes a deployment. The restart script sends the depool command, but given the replica is broke, that does nto propagate to teh codfw lvs servers [15:47:14] <_joe_> so the script detects that and refuses to restart the server [15:47:51] <_joe_> * when i suggest to restart etcdmirror, that replicates a state where all backends are depooled [15:48:23] <_joe_> ah we ned to fix parsoid and the jobrunners too [15:48:39] claime: I think you will be filling your first incident report. For an almost incident, which is the best kind of incident [15:48:51] https://wikitech.wikimedia.org/wiki/Incident_status [15:49:07] _joe_: fixing parsoid [15:49:17] click create report, fill in the stuff you feel capable of filling in and we 'll take it from there [15:49:25] but that's bureaucracy for later [15:49:36] <_joe_> it was an incident https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=codfw&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All&from=1662651131187&to=1662651531187 [15:49:52] <_joe_> not an outage, just degradation [15:49:56] oh shoot [15:50:01] it was indeed [15:50:03] <_joe_> that is pybal's depool threshold saving our arses [15:50:21] thanks for the summary Giuseppe [15:50:32] <_joe_> so technically [15:50:38] <_joe_> i earned a new t-shirt [15:51:18] codfw parsoid repooled [15:51:44] jobrunners look ok [15:51:52] https://config-master.wikimedia.org/pybal/codfw/jobrunner [15:52:38] <_joe_> ah right [15:52:43] <_joe_> we dont restart them [16:06:36] hi can someone here take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/830850 specifically is it fine to change the helpfile-defaults as indicated in pcc https://puppet-compiler.wmflabs.org/pcc-worker1002/37171/deploy2002.codfw.wmnet/index.html [16:13:59] no longer needed [16:16:29] 10serviceops: Incident: 2022-09-08 codfw api-https api appserver appserver parsoid degradation - https://phabricator.wikimedia.org/T317340 (10Clement_Goubert) [16:23:12] 10serviceops, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Krinkle) [16:23:40] 10serviceops, 10Commons, 10MediaWiki-Uploading, 10SRE-swift-storage, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Krinkle) [16:26:03] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners), 10Security: Findings in Security Readiness Reviews of Trusted GitLab Runners - https://phabricator.wikimedia.org/T317341 (10Jelto) [16:29:19] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners), 10Security: Findings in Security Readiness Reviews of Trusted GitLab Runners - https://phabricator.wikimedia.org/T317341 (10Jelto) [18:37:10] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10dancy) [18:39:36] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners), 10Security: Findings in Security Readiness Reviews of Trusted GitLab Runners - https://phabricator.wikimedia.org/T317341 (10Mstyles) [19:28:43] 10serviceops, 10API Platform, 10Growth-Structured-Tasks, 10Image-Suggestions, and 7 others: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset b... - https://phabricator.wikimedia.org/T313973 [22:16:53] 10serviceops, 10API Platform, 10Growth-Structured-Tasks, 10Image-Suggestions, and 7 others: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset b... - https://phabricator.wikimedia.org/T313973