[00:12:45] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:13:10] (03CR) 10Krinkle: [C: 03+2] ResourceLoader: Remove DependencyStore::renew [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813670 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [00:17:13] * Krinkle staging on mwdebug1002 [00:17:43] (03PS3) 10Krinkle: wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 (owner: 10Reedy) [00:17:51] (03CR) 10Krinkle: [C: 03+2] wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 (owner: 10Reedy) [00:18:34] (03Merged) 10jenkins-bot: wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 (owner: 10Reedy) [00:25:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:28:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:28:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:29:02] !log krinkle@deploy1002 Synchronized wmf-config/wikitech.php: Ib539da0c0953 (duration: 02m 47s) [00:29:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:30:22] (03Merged) 10jenkins-bot: ResourceLoader: Remove DependencyStore::renew [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813670 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [00:31:47] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:35:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:35:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:36:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:36:24] (03CR) 10Krinkle: enwiki: Raise wgPageTriageMaxAge to indefinite (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [00:39:00] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:15] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.19/includes/ResourceLoader/: Ie11bdfdcf5e6724 (duration: 02m 55s) [00:44:50] (03PS1) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813725 (https://phabricator.wikimedia.org/T113916) [00:45:42] (03PS2) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813725 (https://phabricator.wikimedia.org/T113916) [00:54:40] (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for all wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813725 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [00:55:31] (03Merged) 10jenkins-bot: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813725 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [01:01:33] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:02:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:02:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:03:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:03:41] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.19/includes/ResourceLoader/: Ie11bdfdcf5e6724 (duration: 02m 55s) [01:09:11] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:44] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I73fbfee8248c (duration: 02m 45s) [01:11:31] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:12:58] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I73fbfee8248c (duration: 02m 56s) [01:14:03] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:16:37] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:23:07] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:31:53] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:17] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:39] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:57:46] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10EBernhardson) From the other ticket these are the messages that were coming in on dmesg before the reimage was attempted: ` [Tue Jul 5 17:45:27 2022] ata2: exception Emask 0... [03:32:40] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:29] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:54] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) @EBernhardson thanks this line is helpful "SATA link down" telling me i need to check connection from main board to disks. I will look into it once onsite [04:20:18] (03CR) 10Krinkle: [C: 04-2] "Feature is broken such that this would likely deindex over 90% of Wikipedia mainspace from search engines." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [04:23:27] !log tstarling@puppetmaster1001 conftool action : get/ReadOnly; selector: name=ReadOnly,scope=codfw [04:25:46] !log tstarling@puppetmaster1001 conftool action : edit; selector: name=ReadOnly,scope=codfw [04:32:52] !log oblivian@puppetmaster1001 conftool action : edit; selector: name=ReadOnly,scope=codfw [04:33:58] Does the global rename seem to be stuck? Is there any server issues? https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress [04:38:55] (03CR) 10Phuedx: [C: 03+1] Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) (owner: 10Dbrant) [04:56:13] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:22] morning apergos ^^ looks like there's nothing to deploy later on, but I'll be around for training if there's any late additions — I also did an unexpected deploy (with ur/banecm's support) the other day ^^ [05:02:38] okey dokey! [05:03:03] I'm two for two on not breaking everything :D [05:03:04] we won't really know for sure until the time, people often add stuff at the last minute. [05:12:15] (03PS1) 10Marostegui: db2164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813734 (https://phabricator.wikimedia.org/T311493) [05:13:51] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) So a few days ago db1132 died during one of the incidents. It had PS enabled and: ` performance-schema-instrument='memo... [05:14:27] (03CR) 10Marostegui: [C: 03+2] db2164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813734 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:18:12] (03PS1) 10Marostegui: instances.yaml: Add db2164 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813735 (https://phabricator.wikimedia.org/T311493) [05:19:29] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2164 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/813735 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:20:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2164 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31038 and previous config saved to /var/cache/conftool/dbconfig/20220714-052056-marostegui.json [05:20:57] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:00] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:28:00] (03PS1) 10Marostegui: change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T0600). [06:21:52] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:29:41] thcipriani (cc apergos): I'd ideally like to add myself to the `UTC morning backport and config training` window (in `deployments-calendar.yaml`) - is that okay, or should I wait a bit? [06:30:28] {[worksforme}} [06:30:31] er [06:30:43] {[ -> {{ [06:31:01] ^^ [06:34:58] I might screen record the next deploy I do if that'd be useful for training others in the future? :) [06:35:50] hmm let's think about that [06:36:14] I mean, it's fine by me if you want to screen record it, but we would need to think about how the material could be effectively used in a training [06:36:39] certainly just plopping it in a folder and saying to someone "watch this" wouldn't really do much :-D [06:37:02] perhaps with a voiceover walking through the steps? ^^ [06:38:09] tell me what the goal is you have in mind and I can better comment on how to help make it happen :-) [06:38:24] I've not got that far yet ;P [06:38:28] ah! [06:38:45] * TheresNoTime just has "ideas", no claim as to if they're *good* or *well thought out* :D [06:39:04] ideas are great, I just wanna figure out how we can use them [06:39:21] and how to implement them depends on where we want to get in using them [06:40:55] anyways I would say, record what you like, and we can figure out what to do with it later [06:42:01] :) my semi-formed idea is having live training is *very important*, but sometimes there's nothing to deploy — having a resource to watch while someone does the same "talking through" as you (very well!) did might be useful? [06:43:02] ok, let's find a place to put that so we don't lose it, and talk about training ideas. it's been awhile anyways. [06:43:37] I can make a post at https://wikitech.wikimedia.org/wiki/Talk:Deployments/Training if that's a good place? [06:44:45] No idea :-D try it and see? make sure Tyler sees it at a minimum [06:45:36] we had a series of discussions about trainings, including a 'training for trainers' session I gave [06:45:43] revisiting some of that is probably a good idea [06:45:58] and also discussing the time of this window, it's not much used [06:46:30] <_joe_> TheresNoTime: I dream of a time where deployments don't need special training and voodoo on a production server :) [06:46:46] apergos: I have noticed this window is often fairly empty.. [06:46:53] _joe_: keep dreaming! ;P /j [06:47:20] <_joe_> well if the org lets us finish mediawiki on kubernetes, it's easier :) [06:47:23] (I know the one-day dream is a full CI setup which "just does it all"?) [06:47:29] <_joe_> a huge if [06:47:40] <_joe_> I kind of disagree on the CI stuff [06:47:48] <_joe_> we don't have good enough test coverage to do it [06:47:55] <_joe_> not now, not in a forseeable future [06:47:57] a good point.. [06:48:11] Pro Tip: Scope creep is an effective tool to keep your production environments from ever being updated [06:48:12] <_joe_> unit test and integration tests I mean [06:48:23] though perhaps that can "just" be remedied by a gated CI with human interaction to do that final "push"? [06:48:45] <_joe_> yeah I think all the merging and black magic should happen automagically [06:48:57] <_joe_> but the deployment is nother beast [06:49:18] (03PS3) 10David Caro: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274 [06:50:53] that can be future engineer's (tm) problem to figure out /s [06:58:44] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T0700). [07:00:19] (here) [07:00:24] good morning! [07:00:36] there are no trainees signed up for today, and no patches in the window. [07:01:10] We do have two deployers here today :-) [07:01:26] instead of deploying, I guess we can Get Inspired (tm) [07:01:51] if anyone would like to get their patch in and self-deploy, now is the time; I will wander off if there are no takers in about 10 minutes. [07:01:58] <_joe_> apergos: https://untranslatable.co/p/anonymous/chi-ha-i-denti-non-ha-il-pane-e-chi-ha-il-pane-non-ha-i-denti [07:02:24] lolol [07:12:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2028.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [07:13:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2028.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [07:14:08] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.192 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:00] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:48] that 10 minutes is long since up so I'm wandering off. Have a good week everyone and see you next time! [07:27:04] (03PS1) 10Muehlenhoff: Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/813825 [07:29:11] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/813825 (owner: 10Muehlenhoff) [07:29:29] o/ [07:35:57] (03PS1) 10David Caro: grid:exec: cleanup /tmp of stale files [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) [07:36:47] (03CR) 10CI reject: [V: 04-1] grid:exec: cleanup /tmp of stale files [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) (owner: 10David Caro) [07:37:36] (03PS2) 10David Caro: grid:exec: cleanup /tmp of stale files [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) [07:39:21] (03PS3) 10David Caro: grid:exec: cleanup /tmp of stale files [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) [08:41:09] 10SRE, 10ops-eqiad, 10DC-Ops: Failed disk on analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T293111 (10BTullis) HI @wiki_willy - Apologies for the missing tag. No, we can leave this disk in a failed state thanks. I'll work on the decom soon. Thanks. [08:50:01] (03PS2) 10Btullis: Remove more alerts that have moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/744809 (https://phabricator.wikimedia.org/T293399) [08:55:35] (03CR) 10David Caro: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [08:57:06] (03CR) 10David Caro: [C: 03+2] wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [08:57:09] (03CR) 10David Caro: [C: 03+2] ceph: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 (owner: 10David Caro) [08:57:12] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [08:57:16] (03CR) 10David Caro: [C: 03+2] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [08:57:20] (03CR) 10David Caro: [C: 03+2] toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 (owner: 10David Caro) [08:57:29] (03CR) 10David Caro: [C: 03+2] wmcs: move openstack/__init__.py to openstack/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813661 (owner: 10David Caro) [08:57:48] (03CR) 10David Caro: [C: 03+2] wmcs: move wmcs/__init__.py to wmcs/libs/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813662 (owner: 10David Caro) [09:02:01] (03CR) 10Ladsgroup: change_oaac_accepted_T312977.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:02:52] (03CR) 10Marostegui: change_oaac_accepted_T312977.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:03:29] (03PS2) 10Marostegui: change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) [09:03:51] (03CR) 10CI reject: [V: 04-1] change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:03:53] (03CR) 10Ladsgroup: [C: 03+1] change_oaac_accepted_T312977.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:04:05] (03Merged) 10jenkins-bot: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [09:04:20] (03Merged) 10jenkins-bot: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [09:04:22] (03Merged) 10jenkins-bot: ceph: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 (owner: 10David Caro) [09:05:46] (03Merged) 10jenkins-bot: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 (owner: 10David Caro) [09:05:48] (03Merged) 10jenkins-bot: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 (owner: 10David Caro) [09:05:50] (03Merged) 10jenkins-bot: wmcs: move openstack/__init__.py to openstack/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813661 (owner: 10David Caro) [09:05:55] (03PS3) 10Marostegui: change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) [09:06:53] (03PS1) 10Matthias Mullie: Sync with current master [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813828 [09:07:54] (03CR) 10Marostegui: [C: 03+2] change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:09:51] (03PS1) 10Matthias Mullie: Sync with current master [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813829 [09:10:04] (03Abandoned) 10Matthias Mullie: Sync with current master [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813828 (owner: 10Matthias Mullie) [09:12:34] (03Merged) 10jenkins-bot: wmcs: move wmcs/__init__.py to wmcs/libs/common.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813662 (owner: 10David Caro) [09:12:36] (03Merged) 10jenkins-bot: change_oaac_accepted_T312977.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813736 (https://phabricator.wikimedia.org/T312977) (owner: 10Marostegui) [09:16:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:16:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:16:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 10 hosts with reason: Maintenance [09:17:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 10 hosts with reason: Maintenance [09:23:57] (03PS1) 10Marostegui: change_oaac_accepted_T312977.py: Fix [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813832 [09:24:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:24:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:24:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 10 hosts with reason: Maintenance [09:24:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 10 hosts with reason: Maintenance [09:25:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:25:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:26:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:26:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:26:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T312977)', diff saved to https://phabricator.wikimedia.org/P31039 and previous config saved to /var/cache/conftool/dbconfig/20220714-092618-marostegui.json [09:26:43] (03CR) 10Marostegui: "recheck" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813832 (owner: 10Marostegui) [09:29:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312977)', diff saved to https://phabricator.wikimedia.org/P31040 and previous config saved to /var/cache/conftool/dbconfig/20220714-092901-marostegui.json [09:29:24] (03CR) 10Marostegui: [C: 03+2] change_oaac_accepted_T312977.py: Fix [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813832 (owner: 10Marostegui) [09:32:24] (03Merged) 10jenkins-bot: change_oaac_accepted_T312977.py: Fix [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813832 (owner: 10Marostegui) [09:33:55] (03CR) 10Btullis: [C: 03+2] Remove more alerts that have moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/744809 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:35:14] (03PS4) 10David Caro: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 [09:39:07] (03PS1) 10Btullis: Remove trailing check_promethus checks for hadoop [puppet] - 10https://gerrit.wikimedia.org/r/813833 (https://phabricator.wikimedia.org/T293399) [09:44:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P31041 and previous config saved to /var/cache/conftool/dbconfig/20220714-094406-marostegui.json [09:44:13] (03CR) 10David Caro: "You should probably rebase on top of latest wmcs branch, just merged a bunch of refactoring stuff." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [09:44:32] (03CR) 10David Caro: [C: 03+2] wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 (owner: 10David Caro) [09:44:38] (03Abandoned) 10Btullis: dbproxy: add clouddb sections to conftool [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [09:46:07] (03CR) 10David Caro: [C: 03+1] "LGTM waiting for someone else to ack the repo changes" [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [09:46:57] (03CR) 10JMeybohm: [C: 03+2] k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:47:13] (03CR) 10JMeybohm: [C: 03+2] k8s: Retry checks for expected pods on drain (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:47:26] (03CR) 10JMeybohm: [C: 03+2] k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:47:35] (03PS1) 10Marostegui: db1135,dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813835 (https://phabricator.wikimedia.org/T308339) [09:47:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 for onsite maintenance T308339', diff saved to https://phabricator.wikimedia.org/P31042 and previous config saved to /var/cache/conftool/dbconfig/20220714-094756-root.json [09:48:31] (03CR) 10Marostegui: [C: 03+2] db1135,dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813835 (https://phabricator.wikimedia.org/T308339) (owner: 10Marostegui) [09:49:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but you can also simply drop the part related to modules/aptrepo/files/distributions-wikimedia, stretch-wikimedia will be reti" [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [09:49:46] (03CR) 10David Caro: [C: 03+2] "❤️" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813679 (owner: 10RhinosF1) [09:51:42] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) @Cmjohnson db1135 and dbproxy1021 are now off and ready for the move. [09:52:22] (03Merged) 10jenkins-bot: wmcs: Parse enums at argparse level [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810285 (owner: 10David Caro) [09:52:41] (03PS3) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) [09:55:24] (03CR) 10Btullis: [C: 03+2] superset: Turn template processing back on [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) (owner: 10Ebernhardson) [09:55:57] (03Merged) 10jenkins-bot: k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:55:59] (03Merged) 10jenkins-bot: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:56:01] (03Merged) 10jenkins-bot: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:57:20] (03CR) 10CI reject: [V: 04-1] k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:58:11] (03PS3) 10David Caro: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [09:58:13] (03PS10) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [09:58:50] (03PS3) 10RhinosF1: unset_cluster_maintenance: fix formatting error [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813679 [09:59:10] (03CR) 10David Caro: "Just rebased on top of the latest changes, please +1 if it still looks ok and I'll merge" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [09:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P31043 and previous config saved to /var/cache/conftool/dbconfig/20220714-095911-marostegui.json [09:59:31] (03CR) 10David Caro: "Just rebased" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [10:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T1000). [10:00:08] (03CR) 10David Caro: unset_cluster_maintenance: fix formatting error [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813679 (owner: 10RhinosF1) [10:01:42] (03CR) 10Btullis: [C: 03+1] druid: Fixed UID/GIDs are universally in use now [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff) [10:01:47] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/812891 (owner: 10PipelineBot) [10:02:02] (03CR) 10Btullis: [C: 03+2] Remove trailing check_promethus checks for hadoop [puppet] - 10https://gerrit.wikimedia.org/r/813833 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [10:03:06] (03PS4) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) [10:04:16] (03CR) 10JMeybohm: [C: 03+2] Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333 (owner: 10JMeybohm) [10:05:17] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/812891 (owner: 10PipelineBot) [10:06:46] (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [10:06:58] (03CR) 10CI reject: [V: 04-1] Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [10:07:52] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:08:44] (03Merged) 10jenkins-bot: unset_cluster_maintenance: fix formatting error [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/813679 (owner: 10RhinosF1) [10:08:49] (03Merged) 10jenkins-bot: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333 (owner: 10JMeybohm) [10:11:40] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:11:49] (03PS1) 10David Caro: wmcs.novafullstack: Update the VM name on every loop [puppet] - 10https://gerrit.wikimedia.org/r/813838 [10:14:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312977)', diff saved to https://phabricator.wikimedia.org/P31044 and previous config saved to /var/cache/conftool/dbconfig/20220714-101418-marostegui.json [10:14:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:14:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T312977)', diff saved to https://phabricator.wikimedia.org/P31045 and previous config saved to /var/cache/conftool/dbconfig/20220714-101438-marostegui.json [10:15:59] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:16:48] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:20:36] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:21:28] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:25:27] (03PS3) 10Majavah: Remove systemd241 Stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/810421 [10:25:40] (03CR) 10Majavah: Remove systemd241 Stretch backport (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [10:26:22] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:35:31] (03PS1) 10Ladsgroup: Add fix_flaggedrevs_timestamps.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) [10:36:15] (03CR) 10Marostegui: Add fix_flaggedrevs_timestamps.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [10:40:03] (03CR) 10Ladsgroup: Add fix_flaggedrevs_timestamps.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [10:43:23] (03CR) 10Marostegui: Add fix_flaggedrevs_timestamps.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [10:46:51] (03PS2) 10Ladsgroup: Add fix_flaggedrevs_timestamps.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) [10:47:03] (03CR) 10Ladsgroup: Add fix_flaggedrevs_timestamps.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [10:48:23] (03CR) 10Marostegui: [C: 03+1] Add fix_flaggedrevs_timestamps.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [10:51:35] (03PS1) 10Marostegui: site.pp: Remove db216[2-5] from insetup [puppet] - 10https://gerrit.wikimedia.org/r/813840 (https://phabricator.wikimedia.org/T311493) [10:52:34] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db216[2-5] from insetup [puppet] - 10https://gerrit.wikimedia.org/r/813840 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:57:50] (03PS1) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [10:58:42] (03CR) 10CI reject: [V: 04-1] Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:00:11] (03PS2) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [11:01:09] (03PS3) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [11:01:23] (03CR) 10Ladsgroup: [C: 03+2] Add fix_flaggedrevs_timestamps.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [11:01:29] (03PS1) 10JMeybohm: Actually run tests on type: php scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/813843 [11:01:50] (03Merged) 10jenkins-bot: Add fix_flaggedrevs_timestamps.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/813839 (https://phabricator.wikimedia.org/T312984) (owner: 10Ladsgroup) [11:03:02] (03CR) 10Elukey: "Saw the change passing by, I have few comments/suggestions:" [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:03:59] (03CR) 10JMeybohm: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [11:07:29] (03CR) 10JMeybohm: Actually run tests on type: php scaffold (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/813843 (owner: 10JMeybohm) [11:07:44] (03PS1) 10Elukey: ml-services: update image in ml-staging for enwiki editquality-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/813866 (https://phabricator.wikimedia.org/T301878) [11:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312977)', diff saved to https://phabricator.wikimedia.org/P31046 and previous config saved to /var/cache/conftool/dbconfig/20220714-110759-marostegui.json [11:08:04] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [11:09:00] (03CR) 10Elukey: "Forgot to add - there are also a lot of hiera values to add for master/worker roles, check what it is present for the ml-serve clusters. M" [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:10:08] (03CR) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:12:13] (03CR) 10Btullis: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:12:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:12:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:13:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expe [11:13:01] 00) https://wikitech.wikimedia.org/wiki/CX [11:13:43] (03PS4) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [11:14:39] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:16:25] (03PS5) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [11:17:33] (03CR) 10Elukey: [C: 03+2] ml-services: update image in ml-staging for enwiki editquality-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/813866 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [11:20:32] (03CR) 10Elukey: Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:22:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31047 and previous config saved to /var/cache/conftool/dbconfig/20220714-112304-marostegui.json [11:27:31] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:35:19] (03CR) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [11:36:13] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:36:27] (03PS1) 10Cparle: Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 [11:37:04] (03PS1) 10Cparle: Update boosts for weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813881 [11:37:19] (03CR) 10CI reject: [V: 04-1] Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 (owner: 10Cparle) [11:38:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31048 and previous config saved to /var/cache/conftool/dbconfig/20220714-113811-marostegui.json [11:42:11] (03PS2) 10Cparle: Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 [11:52:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312977)', diff saved to https://phabricator.wikimedia.org/P31049 and previous config saved to /var/cache/conftool/dbconfig/20220714-115316-marostegui.json [11:53:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:53:21] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [11:53:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:54:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:54:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:54:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312977)', diff saved to https://phabricator.wikimedia.org/P31050 and previous config saved to /var/cache/conftool/dbconfig/20220714-115448-marostegui.json [11:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312977)', diff saved to https://phabricator.wikimedia.org/P31051 and previous config saved to /var/cache/conftool/dbconfig/20220714-115701-marostegui.json [12:00:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:00:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:00:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:00:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:00:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [12:01:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [12:02:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:44] 10SRE, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10CDanis) @Esanders @VPuffetMichel hello from SRE, just wanted to make sure this task was on your radar for a quick patch soon... [12:06:54] (03PS28) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [12:12:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31052 and previous config saved to /var/cache/conftool/dbconfig/20220714-121209-marostegui.json [12:23:22] (03PS1) 10Ladsgroup: labs: Make sure templatelinks config overrides production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813892 (https://phabricator.wikimedia.org/T306673) [12:24:29] (03PS2) 10Ladsgroup: labs: Make sure templatelinks config overrides production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813892 (https://phabricator.wikimedia.org/T306673) [12:27:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31053 and previous config saved to /var/cache/conftool/dbconfig/20220714-122714-marostegui.json [12:27:27] (03CR) 10Ladsgroup: [C: 03+2] labs: Make sure templatelinks config overrides production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813892 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [12:28:44] (03Merged) 10jenkins-bot: labs: Make sure templatelinks config overrides production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813892 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [12:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:33:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:33:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:33:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:37:39] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312977)', diff saved to https://phabricator.wikimedia.org/P31054 and previous config saved to /var/cache/conftool/dbconfig/20220714-124219-marostegui.json [12:42:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:42:26] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [12:42:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31055 and previous config saved to /var/cache/conftool/dbconfig/20220714-124239-marostegui.json [12:43:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:43:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T312984)', diff saved to https://phabricator.wikimedia.org/P31056 and previous config saved to /var/cache/conftool/dbconfig/20220714-124321-ladsgroup.json [12:43:26] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [12:45:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31057 and previous config saved to /var/cache/conftool/dbconfig/20220714-124515-marostegui.json [12:50:09] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.223 second response time https://wikitech.wikimedia.org/wiki/Swift [12:55:03] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T1300). [13:00:04] matthiasmullie, cormacparle, and MichaelG_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31058 and previous config saved to /var/cache/conftool/dbconfig/20220714-130020-marostegui.json [13:00:23] 👋 [13:00:27] o/ [13:00:54] and also o/ for cormac's patches - he's out, I'll take care of those [13:01:09] question: I have a backport with i18n changes I want to deploy, but I won't need those messages until next Wed. Should I sync-world those now (and if so, is it ok to do in current backports window?), or will a sync-world just happen before then anyway (e.g. as part of train), in which case I can probably simply merge that patch & wait out the sync-world that will happen later anyway? [13:01:40] I was going to ask you to change the commit message of that patch anyways, by the way [13:01:55] because I at least wouldn’t be happy to deploy “Sync with current master”, that’s not enough information IMHO [13:02:02] Sure [13:02:05] (though I can’t stop you if you’re going to deploy it yourself ^^) [13:02:18] What do you mean that's not descriptive enough? :p [13:02:22] I think a sync-world as part of the backport window can happen [13:02:24] I'll change it right away ;) [13:02:35] let’s just do it at the end of the window, since it’ll take so long? :) [13:02:49] but that seems safer to me than pulling the changes without a sync-world [13:03:07] (03PS2) 10Matthias Mullie: Improve maint script output & update i18n messages [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813829 [13:03:22] (03CR) 10Matthias Mullie: [C: 03+1] Update boosts for weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813881 (owner: 10Cparle) [13:03:31] (03CR) 10Matthias Mullie: [C: 03+1] Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 (owner: 10Cparle) [13:03:43] sounds good! [13:03:49] do you want to deploy the config changes or should I do it? [13:04:01] I’d like to do the Special:NewLexemeAlpha one with MichaelG_WMDE, but otherwise I don’t mind either way [13:04:18] sure, go ahead with that one [13:04:24] cool thanks [13:04:38] either WFM; happy to do mine myself after that, and free up your time [13:04:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312984)', diff saved to https://phabricator.wikimedia.org/P31059 and previous config saved to /var/cache/conftool/dbconfig/20220714-130445-ladsgroup.json [13:04:49] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:04:52] alright, then let’s do that first [13:05:23] (03PS2) 10Lucas Werkmeister (WMDE): Enable Special:NewLexemeAlpha on Wikidata and TestWikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813609 (https://phabricator.wikimedia.org/T306016) (owner: 10Michael Große) [13:07:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Special:NewLexemeAlpha on Wikidata and TestWikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813609 (https://phabricator.wikimedia.org/T306016) (owner: 10Michael Große) [13:08:30] (03Merged) 10jenkins-bot: Enable Special:NewLexemeAlpha on Wikidata and TestWikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813609 (https://phabricator.wikimedia.org/T306016) (owner: 10Michael Große) [13:09:39] MichaelG_WMDE: the change is on mwdebug1001, please test [13:12:00] it works as expected on Wikidata [13:12:04] great [13:12:45] syncing [13:13:44] (03CR) 10David Caro: [C: 03+2] "Manually verified on coludcontrol1003 by copying the modified script and running for some time." [puppet] - 10https://gerrit.wikimedia.org/r/813838 (owner: 10David Caro) [13:14:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:15:00] (03PS3) 10Matthias Mullie: Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 (owner: 10Cparle) [13:15:12] (03CR) 10Matthias Mullie: [C: 03+2] Improve maint script output & update i18n messages [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813829 (owner: 10Matthias Mullie) [13:15:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:15:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31060 and previous config saved to /var/cache/conftool/dbconfig/20220714-131525-marostegui.json [13:15:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813609|Enable Special:NewLexemeAlpha on Wikidata and TestWikidata (T306016)]] (duration: 02m 57s) [13:15:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:15:35] T306016: Enable new Special:NewLexeme page in production (in parallel to the current page) - https://phabricator.wikimedia.org/T306016 [13:16:09] 10SRE, 10SRE-Access-Requests: Add Zabe to #mediawiki_security - https://phabricator.wikimedia.org/T313026 (10Zabe) [13:16:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:17:08] 10SRE, 10SRE-Access-Requests: Add Zabe to #mediawiki_security - https://phabricator.wikimedia.org/T313026 (10Zabe) [13:17:17] weird, mw1430 still doesn’t seem to have that change [13:17:45] not seeing it on mw1385 yet either [13:19:25] PROBLEM - Check systemd state on mw2392 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31061 and previous config saved to /var/cache/conftool/dbconfig/20220714-131950-ladsgroup.json [13:20:25] let’s just resync the file and see if that helps… [13:20:44] syncing again [13:21:10] (there was nothing unusual in the output of the previous scap as far as I could tell, no warnings) [13:22:45] (03PS4) 10David Caro: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274 [13:23:20] (03Merged) 10jenkins-bot: Improve maint script output & update i18n messages [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813829 (owner: 10Matthias Mullie) [13:23:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813609|Enable Special:NewLexemeAlpha on Wikidata and TestWikidata (T306016)]] (re-sync, config change seemingly not consistently picked up) (duration: 02m 45s) [13:23:29] T306016: Enable new Special:NewLexeme page in production (in parallel to the current page) - https://phabricator.wikimedia.org/T306016 [13:24:00] now it seems to be working more consistently [13:24:09] been a while since double syncs were necessary, though :S [13:24:24] it now consistently works for me on production [13:24:30] but I think I’m done, matthiasmullie you’re good to go [13:24:39] (03CR) 10Matthias Mullie: [C: 03+2] Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 (owner: 10Cparle) [13:24:45] Lucas_WMDE: thanks, starting [13:26:14] and sorry it took a bit longer than expected :/ [13:26:19] (03Merged) 10jenkins-bot: Add custommatch search feature config for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813880 (owner: 10Cparle) [13:26:22] hopefully that’s not a general scap issue again [13:26:32] I guess you’re about to find out [13:27:08] T311788 has not been closed yet, could be related [13:27:09] T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) - https://phabricator.wikimedia.org/T311788 [13:28:11] (03PS2) 10Matthias Mullie: Update boosts for weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813881 (owner: 10Cparle) [13:28:19] (03CR) 10Matthias Mullie: [C: 03+2] Update boosts for weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813881 (owner: 10Cparle) [13:29:13] oh, I wasn’t aware of that [13:29:27] (03Merged) 10jenkins-bot: Update boosts for weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813881 (owner: 10Cparle) [13:29:39] thanks zabe [13:30:29] !log mlitn@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813880|Add custommatch search feature config for commons]] (duration: 02m 58s) [13:30:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31062 and previous config saved to /var/cache/conftool/dbconfig/20220714-133031-marostegui.json [13:30:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:30:36] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [13:30:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:30:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31063 and previous config saved to /var/cache/conftool/dbconfig/20220714-133051-marostegui.json [13:31:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:31:44] good to know that it’s at least specific to config (if I understand correctly), and code shouldn’t be affected [13:32:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:32:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31064 and previous config saved to /var/cache/conftool/dbconfig/20220714-133331-marostegui.json [13:34:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:29] !log mlitn@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813881|Update boosts for weighted_tags]] (duration: 02m 45s) [13:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31065 and previous config saved to /var/cache/conftool/dbconfig/20220714-133455-ladsgroup.json [13:35:01] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.060 second response time https://wikitech.wikimedia.org/wiki/Swift [13:37:23] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [13:37:26] !log mlitn@deploy1002 Started scap: Backport: [[gerrit:813829|Improve maint script output & update i18n messages]] [13:39:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:40:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:40:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:41:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:48:04] (03PS1) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [13:48:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31067 and previous config saved to /var/cache/conftool/dbconfig/20220714-134836-marostegui.json [13:49:32] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36266/console" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [13:50:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312984)', diff saved to https://phabricator.wikimedia.org/P31068 and previous config saved to /var/cache/conftool/dbconfig/20220714-135000-ladsgroup.json [13:50:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:50:05] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:50:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:50:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:50:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T312984)', diff saved to https://phabricator.wikimedia.org/P31069 and previous config saved to /var/cache/conftool/dbconfig/20220714-135038-ladsgroup.json [13:53:32] !log mlitn@deploy1002 Finished scap: Backport: [[gerrit:813829|Improve maint script output & update i18n messages]] (duration: 16m 05s) [14:01:02] (03PS2) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [14:02:17] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36269/console" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:02:20] !log UTC afternoon backport window done [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31070 and previous config saved to /var/cache/conftool/dbconfig/20220714-140341-marostegui.json [14:07:59] (03PS3) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [14:09:23] (03PS1) 10Papaul: Add new PDU model for ps1-a6-codfw [puppet] - 10https://gerrit.wikimedia.org/r/813902 (https://phabricator.wikimedia.org/T309957) [14:10:01] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36270/console" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312984)', diff saved to https://phabricator.wikimedia.org/P31071 and previous config saved to /var/cache/conftool/dbconfig/20220714-141201-ladsgroup.json [14:12:05] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:12:26] (03CR) 10David Caro: [V: 03+1] "Hmmm... I'm not sure this is correct:" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:12:39] PROBLEM - Host db1120.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:13:01] (03CR) 10David Caro: [V: 03+1] labstore: Send prom stats for getent_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:14:39] (03CR) 10Majavah: [C: 04-1] labstore: Send prom stats for getent_check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:18:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31072 and previous config saved to /var/cache/conftool/dbconfig/20220714-141846-marostegui.json [14:18:47] (03PS1) 10Muehlenhoff: Extend access for shubhankar [puppet] - 10https://gerrit.wikimedia.org/r/813903 [14:18:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:18:52] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [14:18:53] !log on going PU maintenance in rack A6 codfw [14:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:03] !log on going PU maintenance in rack A6 codfw [14:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:10] !log on going PDU maintenance in rack A6 codfw [14:19:11] RECOVERY - Host db1120.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:19:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31073 and previous config saved to /var/cache/conftool/dbconfig/20220714-141917-marostegui.json [14:19:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [14:19:59] (03PS4) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [14:20:01] (03CR) 10David Caro: labstore: Send prom stats for getent_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:20:17] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for shubhankar [puppet] - 10https://gerrit.wikimedia.org/r/813903 (owner: 10Muehlenhoff) [14:21:40] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36271/console" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [14:22:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson db1185 A1 21 port:33 23000053 db1186 A8 23 port:33 23000075 db1187 B1 35 port:33 2300033 db1188 B3 26 p... [14:23:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) [14:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31074 and previous config saved to /var/cache/conftool/dbconfig/20220714-142706-ladsgroup.json [14:28:56] (03CR) 10David Caro: [C: 03+2] Remove systemd241 Stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [14:29:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36272/console" [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [14:32:46] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel db1183 to be dbstore1007 - https://phabricator.wikimedia.org/T284126 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Relabeled Server [14:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31075 and previous config saved to /var/cache/conftool/dbconfig/20220714-143525-marostegui.json [14:35:30] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [14:36:20] (03CR) 10Zabe: [C: 03+1] mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [14:40:32] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.201 second response time https://wikitech.wikimedia.org/wiki/Swift [14:41:04] (03CR) 10Jelto: [V: 03+1 C: 04-1] "This looks mostly good, there is a small bug in the systemd service definition. See inline." [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [14:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31076 and previous config saved to /var/cache/conftool/dbconfig/20220714-144211-ladsgroup.json [14:43:37] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [14:45:51] (03CR) 10Jelto: [V: 03+1 C: 04-1] "one additional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [14:47:52] PROBLEM - IPMI Sensor Status on aqs2004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:48:46] PROBLEM - IPMI Sensor Status on es2024 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:49:13] papaul: ^ could you check their power supplies? [14:49:46] PROBLEM - IPMI Sensor Status on mw2303 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:49:48] marostegui: is there not maintenance today [14:50:10] RhinosF1: Yeah, just checked SAL :) [14:50:13] papaul: ignore me :) [14:50:14] Yeah, it's ongoing [14:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31077 and previous config saved to /var/cache/conftool/dbconfig/20220714-145030-marostegui.json [14:50:43] marostegui: ok [14:50:50] marostegui: i only know because i looked after Tuesday [14:50:56] I never saw SAL too [14:52:50] PROBLEM - IPMI Sensor Status on es2027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:54:42] PROBLEM - IPMI Sensor Status on ml-staging2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:56:24] PROBLEM - IPMI Sensor Status on es2028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:57:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312984)', diff saved to https://phabricator.wikimedia.org/P31078 and previous config saved to /var/cache/conftool/dbconfig/20220714-145716-ladsgroup.json [14:57:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:57:20] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:57:24] PROBLEM - IPMI Sensor Status on aqs2003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:57:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31079 and previous config saved to /var/cache/conftool/dbconfig/20220714-145736-ladsgroup.json [14:57:42] (03CR) 10Ori: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:59:42] (03CR) 10JMeybohm: [C: 03+1] New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [15:01:25] PROBLEM - IPMI Sensor Status on thumbor2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:01:53] PROBLEM - Host db1135.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:02:03] PROBLEM - IPMI Sensor Status on aqs2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31080 and previous config saved to /var/cache/conftool/dbconfig/20220714-150535-marostegui.json [15:07:17] RECOVERY - Host db1135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [15:07:37] (03PS1) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [15:08:27] (03CR) 10Hnowlan: [C: 03+2] image-suggestion: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/813242 (https://phabricator.wikimedia.org/T304885) (owner: 10Hnowlan) [15:08:29] (03PS5) 10David Caro: labstore: Send prom stats for getent_check [puppet] - 10https://gerrit.wikimedia.org/r/813898 [15:08:34] (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [15:09:01] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:02] (03PS2) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [15:09:10] (03PS3) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [15:11:50] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@b8f66e9]: (no justification provided) [15:12:00] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@b8f66e9]: (no justification provided) (duration: 00m 10s) [15:12:12] (03Merged) 10jenkins-bot: image-suggestion: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/813242 (https://phabricator.wikimedia.org/T304885) (owner: 10Hnowlan) [15:12:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) @akosiaris Any update on moving forward with this decom? I could really use the rack space. [15:13:08] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync [15:13:15] (03PS1) 10Marostegui: core.pp: Make sync_binlog configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 [15:13:23] PROBLEM - IPMI Sensor Status on aqs2002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:13:39] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync [15:14:26] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: sync [15:14:59] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: sync [15:15:08] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: sync [15:15:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: sync [15:15:58] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good!" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [15:17:05] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Nice." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [15:18:41] (03PS2) 10Marostegui: core.pp: Make sync_binlog configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 [15:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312977)', diff saved to https://phabricator.wikimedia.org/P31081 and previous config saved to /var/cache/conftool/dbconfig/20220714-152040-marostegui.json [15:20:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:20:45] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [15:20:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:20:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:21:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:21:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312977)', diff saved to https://phabricator.wikimedia.org/P31082 and previous config saved to /var/cache/conftool/dbconfig/20220714-152118-marostegui.json [15:23:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312977)', diff saved to https://phabricator.wikimedia.org/P31083 and previous config saved to /var/cache/conftool/dbconfig/20220714-152331-marostegui.json [15:24:22] (03CR) 10Papaul: [C: 03+2] Add new PDU model for ps1-a6-codfw [puppet] - 10https://gerrit.wikimedia.org/r/813902 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [15:30:23] 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-worker1111.eqiad.wmnet - https://phabricator.wikimedia.org/T298785 (10Cmjohnson) a:03BTullis @BTullis Confirmed netbox is correct, this server is in C2/U1. Please resolve this once you updated puppet. [15:31:03] 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T298621 (10Cmjohnson) a:03BTullis @BTullis confirmed Netbox is correct. Please resolve once you updated puppet. assigning to you [15:32:08] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Cmjohnson) @Eevans take your time, I just want to make sure that we're not falling behind on-site. Let me know whenever you're ready. [15:33:21] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) @Marostegui db1135 and dbproxy1021 are back online. [15:35:21] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Thank you Chris! [15:35:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) [15:35:42] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) [15:35:53] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) The remainder of these server moves can happen once we are able to resolve T306162. That will free up space in rack d6. Currently, this row is at maximum capacity for 1G servers. [15:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31084 and previous config saved to /var/cache/conftool/dbconfig/20220714-153836-marostegui.json [15:38:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:13] (03PS2) 10Cmjohnson: Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) [15:44:59] (03PS3) 10Cmjohnson: Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) [15:46:18] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) [15:46:27] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) [15:46:36] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Cmjohnson) [15:46:57] (03CR) 10Mforns: [V: 04-1] "Please, do not merge yet :]" [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [15:48:43] RECOVERY - IPMI Sensor Status on aqs2004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:49:45] RECOVERY - IPMI Sensor Status on es2024 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:50:01] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [15:50:39] RECOVERY - IPMI Sensor Status on mw2303 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:50:44] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [15:52:40] (03PS2) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) [15:52:45] (Device rebooted) firing: Alert for device ps1-a6-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [15:52:57] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:53:00] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudweb1003/4 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/813697 (https://phabricator.wikimedia.org/T305414) (owner: 10Cmjohnson) [15:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31085 and previous config saved to /var/cache/conftool/dbconfig/20220714-155341-marostegui.json [15:54:11] RECOVERY - IPMI Sensor Status on es2027 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:54:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31086 and previous config saved to /var/cache/conftool/dbconfig/20220714-155418-ladsgroup.json [15:54:22] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:54:42] 10SRE, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Esanders) > One point of clarification: I originally thought the debounce value meant "we'll parse after the user stops typin... [15:55:13] PROBLEM - IPMI Sensor Status on mw2304 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:55:48] 10SRE, 10ops-eqiad, 10DC-Ops: Failed disk on analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T293111 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson resolving this since it's going to be decom'd anyway. [15:56:15] RECOVERY - IPMI Sensor Status on ml-staging2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:57:45] (Device rebooted) resolved: Device ps1-a6-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [15:58:15] RECOVERY - IPMI Sensor Status on es2028 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:00:05] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:26] (03PS1) 10Mforns: Refine WikibaseTermboxInteraction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/813925 (https://phabricator.wikimedia.org/T290303) [16:02:27] RECOVERY - IPMI Sensor Status on thumbor2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:02:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [16:03:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [16:03:02] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [16:03:05] RECOVERY - IPMI Sensor Status on aqs2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:03:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [16:05:14] (03PS1) 10David Caro: WIP wmcs: add labstore related alerts [alerts] - 10https://gerrit.wikimedia.org/r/813926 [16:07:52] (03CR) 10CI reject: [V: 04-1] WIP wmcs: add labstore related alerts [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [16:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312977)', diff saved to https://phabricator.wikimedia.org/P31087 and previous config saved to /var/cache/conftool/dbconfig/20220714-160846-marostegui.json [16:08:53] T312977: Adjust the field type of oauth_accepted_consumer.oaac_accepted to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312977 [16:08:55] RECOVERY - IPMI Sensor Status on aqs2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:09:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @cmooney cloudnet1006 nic f/w was update but still fails, if you get a moment can you take a look. I am not sure what I am missing [16:09:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31088 and previous config saved to /var/cache/conftool/dbconfig/20220714-160923-ladsgroup.json [16:09:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) There is an OS on the server but has not gone through puppet and unable to ssh [16:11:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Bethany) I don't have access to analytics-privatedata-users When I log into superset, I cannot view any databases. I get an error every time I try to view... [16:13:09] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.225 second response time https://wikitech.wikimedia.org/wiki/Swift [16:15:23] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [16:15:27] RECOVERY - IPMI Sensor Status on aqs2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:24:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31089 and previous config saved to /var/cache/conftool/dbconfig/20220714-162428-ladsgroup.json [16:26:39] RECOVERY - IPMI Sensor Status on mw2304 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:39:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31090 and previous config saved to /var/cache/conftool/dbconfig/20220714-163933-ladsgroup.json [16:39:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:39:38] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:39:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:39:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31091 and previous config saved to /var/cache/conftool/dbconfig/20220714-163953-ladsgroup.json [17:00:04] bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T1700) [17:03:07] * bd808 looks to see if there is anything worthy of a deploy [17:06:00] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-07-14-111908-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/813935 [17:08:35] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.306 second response time https://wikitech.wikimedia.org/wiki/Swift [17:09:51] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-07-14-111908-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/813935 (owner: 10BryanDavis) [17:10:29] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.271 second response time https://wikitech.wikimedia.org/wiki/Swift [17:12:51] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:03] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-07-14-111908-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/813935 (owner: 10BryanDavis) [17:14:31] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:14:52] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:15:16] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:15:54] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:15:59] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Swift [17:17:18] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:17:57] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:37:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31092 and previous config saved to /var/cache/conftool/dbconfig/20220714-173753-ladsgroup.json [17:37:59] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:41:37] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:15] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.075 second response time https://wikitech.wikimedia.org/wiki/Swift [17:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31093 and previous config saved to /var/cache/conftool/dbconfig/20220714-175258-ladsgroup.json [17:53:04] (03PS9) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [17:54:20] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10taavi) Thank you for your comments everyone! In the interests of not having this stall forever, could we move forward with r/o access to objects that we don't hav... [17:56:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS bullseye [17:56:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye [17:58:45] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [18:02:57] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudweb1003.wikimedia.org with OS bullseye [18:03:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye ex... [18:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31094 and previous config saved to /var/cache/conftool/dbconfig/20220714-180803-ladsgroup.json [18:10:18] (03PS1) 10CDanis: fix flask/jinja2 semver snafu [software/klaxon] - 10https://gerrit.wikimedia.org/r/813938 [18:10:20] (03PS1) 10CDanis: restore styling accidentally removed in 16f1d6c [software/klaxon] - 10https://gerrit.wikimedia.org/r/813939 [18:10:22] (03PS1) 10CDanis: Don't hardcode v1 of the api in the base path [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 [18:10:24] (03PS1) 10CDanis: Add support for fetching current oncallers [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 [18:10:26] (03PS1) 10CDanis: display current oncallers in Klaxon UI [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 [18:10:42] (03PS6) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [18:10:49] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 6.148 second response time https://wikitech.wikimedia.org/wiki/Swift [18:11:08] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [18:12:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS bullseye [18:12:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye [18:13:05] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:43] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Swift [18:16:22] (03PS1) 10Majavah: O:openstack: prepare for dedicated rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/813944 [18:17:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36275/console" [puppet] - 10https://gerrit.wikimedia.org/r/813944 (owner: 10Majavah) [18:19:03] (03CR) 10CI reject: [V: 04-1] O:openstack: prepare for dedicated rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/813944 (owner: 10Majavah) [18:19:05] (03PS2) 10Majavah: Use ProxyFix middleware to correctly recognize HTTPS usage [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) (owner: 10Legoktm) [18:19:57] (03PS2) 10Majavah: O:openstack: prepare for dedicated rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/813944 [18:21:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36276/console" [puppet] - 10https://gerrit.wikimedia.org/r/813944 (owner: 10Majavah) [18:21:40] (03CR) 10CDanis: [C: 03+2] Use ProxyFix middleware to correctly recognize HTTPS usage [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) (owner: 10Legoktm) [18:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31095 and previous config saved to /var/cache/conftool/dbconfig/20220714-182308-ladsgroup.json [18:23:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:23:13] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:23:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:23:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31096 and previous config saved to /var/cache/conftool/dbconfig/20220714-182328-ladsgroup.json [18:35:09] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:38:05] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.190 second response time https://wikitech.wikimedia.org/wiki/Swift [18:42:43] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [19:03:42] (03PS7) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [19:10:11] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [19:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31097 and previous config saved to /var/cache/conftool/dbconfig/20220714-191140-ladsgroup.json [19:11:44] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [19:12:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS bullseye [19:12:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye [19:23:41] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.290 second response time https://wikitech.wikimedia.org/wiki/Swift [19:24:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudweb1003.wikimedia.org with OS bullseye [19:24:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye ex... [19:25:31] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.117 second response time https://wikitech.wikimedia.org/wiki/Swift [19:26:05] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [19:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P31098 and previous config saved to /var/cache/conftool/dbconfig/20220714-192645-ladsgroup.json [19:36:31] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:03] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Swift [19:38:37] (03PS8) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [19:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P31100 and previous config saved to /var/cache/conftool/dbconfig/20220714-194150-ladsgroup.json [19:46:14] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [19:50:06] (03CR) 10Ottomata: [C: 03+2] Refine WikibaseTermboxInteraction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/813925 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [19:55:45] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.068 second response time https://wikitech.wikimedia.org/wiki/Swift [19:56:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312984)', diff saved to https://phabricator.wikimedia.org/P31102 and previous config saved to /var/cache/conftool/dbconfig/20220714-195655-ladsgroup.json [19:56:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:57:00] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [19:57:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:57:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T312984)', diff saved to https://phabricator.wikimedia.org/P31103 and previous config saved to /var/cache/conftool/dbconfig/20220714-195715-ladsgroup.json [19:58:56] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.135 second response time https://wikitech.wikimedia.org/wiki/Swift [20:00:05] brennen: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220714T2000). [20:00:05] thcipriani: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:25] o/ [20:08:17] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [20:08:55] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift [20:13:46] (03PS2) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) [20:14:23] (03CR) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani) [20:14:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:14:48] (03PS3) 10Thcipriani: CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) [20:15:13] (03CR) 10Thcipriani: [C: 03+2] CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani) [20:15:27] now to await jenkins [20:16:37] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.180 second response time https://wikitech.wikimedia.org/wiki/Swift [20:17:14] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [20:17:17] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [20:18:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312984)', diff saved to https://phabricator.wikimedia.org/P31104 and previous config saved to /var/cache/conftool/dbconfig/20220714-201812-ladsgroup.json [20:18:16] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:18:20] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2027.codfw.wmnet with OS bullseye [20:18:29] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2027.codfw.wmnet with OS bullseye [20:19:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:21:18] (ProbeDown) firing: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:18] (ProbeDown) firing: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:38] ryankemper: expected? ^ [20:22:00] here [20:22:20] rzl ryankemper I believe so, we will ACK [20:22:22] hey [20:22:23] rzl: not expected, looking now [20:23:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-psi-https_9643: Servers elastic2043.codfw.wmnet, elastic2036.codfw.wmnet, elastic2040.codfw.wmnet, elastic2039.codfw.wmnet, elastic2044.codfw.wmnet, elastic2055.codfw.wmnet, elastic2058.codfw.wmnet, elastic2053.codfw.wmnet, elastic2054.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:24:05] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [20:24:09] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudweb1004.wikimedia.org with OS bullseye [20:24:09] Coincidental timing, we kicked off a reimage of a single host but these failures are in addition to the host we actually reimaged [20:24:14] Looks like we lost a master [20:24:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye ex... [20:24:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-psi-https_9643: Servers elastic2043.codfw.wmnet, elastic2036.codfw.wmnet, elastic2032.codfw.wmnet, elastic2048.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2055.codfw.wmnet, elastic2058.codfw.wmnet, elastic2053.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:25:16] ryankemper: what are the hosts in question? [20:25:37] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9643/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9643): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:26:42] bblack: We're reimaging `elastic2027`, it looks like it is one of the 3 masters for the psi (port 9643) codfw elasticsearch cluster [20:26:55] It's supposed to fail over to one of the other two masters, so it's odd that it hasn't [20:28:21] Okay we see the problem. `elastic2049`, which is also one of the masters, is out of the cluster due to hw failure. We'll promote a new master [20:29:17] does search actively use codfw under normal conditions? (user affecting?) [20:29:47] As far as impact, eqiad is the active cluster, so there's no user queries routing there [20:29:50] So no user-visible impact [20:30:06] We can go ahead and resolve the pages so people in EU/etc hours aren't getting paged [20:30:29] ack, thanks! [20:30:48] !log [Elastic] We're working on promoting `elastic2054` to a master to replace `elastic2049` which is in hw failure [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:35] (03PS1) 10Bking: elastic: promote new master [puppet] - 10https://gerrit.wikimedia.org/r/813974 (https://phabricator.wikimedia.org/T311939) [20:32:01] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: promote new master [puppet] - 10https://gerrit.wikimedia.org/r/813974 (https://phabricator.wikimedia.org/T311939) (owner: 10Bking) [20:33:17] !log [Elastic] `ryankemper@elastic2054:~$ sudo run-puppet-agent` to add 2054 as an eligible master for codfw-psi [20:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31105 and previous config saved to /var/cache/conftool/dbconfig/20220714-203317-ladsgroup.json [20:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:16] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2027.codfw.wmnet with reason: host reimage [20:34:46] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic2027.codfw.wmnet with reason: host reimage [20:35:59] !log Restarting elastic services `ryankemper@elastic2054:~$ sudo systemctl restart elasticsearch_6@production*` [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:06] (03Merged) 10jenkins-bot: CampaignEvents: backport extension for Jul 18 beta deploy [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/813657 (https://phabricator.wikimedia.org/T311752) (owner: 10Thcipriani) [20:36:22] We're back, we should see the icinga alerts (pybal etc) resolve very soon [20:36:33] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [20:37:21] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:37:24] * thcipriani continues with backports as this looks unrelated/handled. [20:37:40] thcipriani: yup, feel free [20:37:50] ryankemper: will do, thanks for confirming :) [20:37:53] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 15, active_primary_shards: 1534, active_shards: 4026, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 575, delayed_unassigned_shards: 0, number_of_pending_tasks [20:37:53] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.5027168006955 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:38:13] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:41:18] (ProbeDown) resolved: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:18] (ProbeDown) resolved: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:05] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [20:44:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:44:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:45:30] !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/CampaignEvents: Backport: [[gerrit:813657|CampaignEvents: backport extension for Jul 18 beta deploy (T311752)]] (duration: 02m 49s) [20:45:33] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [20:45:52] !log utc-late backport window complete [20:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:09] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31106 and previous config saved to /var/cache/conftool/dbconfig/20220714-204822-ladsgroup.json [20:50:07] (03PS1) 10Cwhite: klaxon_config: add esc_policy_ids_filter to type definition [puppet] - 10https://gerrit.wikimedia.org/r/813978 [20:53:03] (03PS1) 10Ryan Kemper: elastic: don't disable replica alloc for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/813979 [20:54:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2027.codfw.wmnet with OS bullseye [20:54:12] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2027.codfw.wmnet with OS bullseye comp... [20:54:17] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.071 second response time https://wikitech.wikimedia.org/wiki/Swift [20:56:05] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.232 second response time https://wikitech.wikimedia.org/wiki/Swift [20:56:39] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [20:56:51] (03PS2) 10Ryan Kemper: elastic: don't disable replica alloc for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/813979 [21:01:47] (03PS3) 10Ryan Kemper: elastic: don't disable replica alloc for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/813979 (https://phabricator.wikimedia.org/T289135) [21:02:13] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [21:02:20] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [21:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312984)', diff saved to https://phabricator.wikimedia.org/P31107 and previous config saved to /var/cache/conftool/dbconfig/20220714-210327-ladsgroup.json [21:03:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:03:32] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:03:42] !log T289135 First host reimage done, manually killed rolling-operation cookbook before the next host reimage so that we can test out https://gerrit.wikimedia.org/r/813979 [21:03:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T312984)', diff saved to https://phabricator.wikimedia.org/P31108 and previous config saved to /var/cache/conftool/dbconfig/20220714-210347-ladsgroup.json [21:05:55] (03CR) 10Bking: [V: 03+1 C: 03+2] elastic: don't disable replica alloc for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/813979 (https://phabricator.wikimedia.org/T289135) (owner: 10Ryan Kemper) [21:06:01] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [21:09:36] (03Merged) 10jenkins-bot: elastic: don't disable replica alloc for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/813979 (https://phabricator.wikimedia.org/T289135) (owner: 10Ryan Kemper) [21:15:10] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [21:15:14] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [21:18:41] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.283 second response time https://wikitech.wikimedia.org/wiki/Swift [21:20:29] (03PS9) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [21:21:05] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Swift [21:21:55] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.147 second response time https://wikitech.wikimedia.org/wiki/Swift [21:22:27] (03PS10) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [21:23:11] PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312984)', diff saved to https://phabricator.wikimedia.org/P31109 and previous config saved to /var/cache/conftool/dbconfig/20220714-212556-ladsgroup.json [21:26:01] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:28:31] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [21:29:25] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [21:31:19] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.180 second response time https://wikitech.wikimedia.org/wiki/Swift [21:33:43] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [21:34:47] (03CR) 10CDanis: [C: 03+2] klaxon_config: add esc_policy_ids_filter to type definition [puppet] - 10https://gerrit.wikimedia.org/r/813978 (owner: 10Cwhite) [21:35:19] (03PS11) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [21:35:25] (03CR) 10Nskaggs: Ensure quota_increase cookbook runs and validates (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [21:41:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P31110 and previous config saved to /var/cache/conftool/dbconfig/20220714-214101-ladsgroup.json [21:41:13] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [21:41:16] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [21:41:57] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [21:43:29] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.149 second response time https://wikitech.wikimedia.org/wiki/Swift [21:44:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:45:48] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T289135 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:44] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [21:47:48] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:18] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.282 second response time https://wikitech.wikimedia.org/wiki/Swift [21:55:16] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [21:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P31111 and previous config saved to /var/cache/conftool/dbconfig/20220714-215606-ladsgroup.json [21:56:34] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.090 second response time https://wikitech.wikimedia.org/wiki/Swift [21:58:12] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [21:58:20] (03PS1) 10Krinkle: doc: Remove old travis and coveralls badge from readme [software/conftool] - 10https://gerrit.wikimedia.org/r/813981 [21:59:33] (PuppetFailure) resolved: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:02:42] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.054 second response time https://wikitech.wikimedia.org/wiki/Swift [22:09:16] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [22:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312984)', diff saved to https://phabricator.wikimedia.org/P31112 and previous config saved to /var/cache/conftool/dbconfig/20220714-221112-ladsgroup.json [22:11:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:11:17] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [22:11:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:21:36] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.068 second response time https://wikitech.wikimedia.org/wiki/Swift [22:28:52] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift [22:38:02] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.060 second response time https://wikitech.wikimedia.org/wiki/Swift [22:38:56] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:08] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.265 second response time https://wikitech.wikimedia.org/wiki/Swift [22:47:44] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:48:28] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift [22:49:14] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:55:09] (03PS1) 10Cwhite: loki-beta: increase grpc message size [puppet] - 10https://gerrit.wikimedia.org/r/813985 (https://phabricator.wikimedia.org/T222826) [22:55:52] (03PS2) 10Cwhite: hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) [23:00:00] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [23:00:02] (03PS1) 10Daimona Eaytoy: Add CampaignEvents to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813986 (https://phabricator.wikimedia.org/T311752) [23:04:09] Is all the people on commons complaining about swift errors (Which seems to correspond to a spike in 502 errors in logstash) a known thing? I don't seem to see a bug for it [23:06:12] bawolff: you are the first report here that I've seen today [23:06:30] ok, I guess I'll file a bug then :) [23:06:36] (03PS1) 10Daimona Eaytoy: Add config variable for the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813989 (https://phabricator.wikimedia.org/T311752) [23:06:46] based on logstash, seems to have started ~16:00 utc july 12 [23:09:46] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.217 second response time https://wikitech.wikimedia.org/wiki/Swift [23:10:06] Well speak of the devil [23:10:42] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813990 (https://phabricator.wikimedia.org/T311752) [23:11:09] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10Bawolff) [23:13:10] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10Bawolff) [23:14:00] bawolff: I see in irc logs that there was an unexpected power event in codfw around that 2022-07-12 16:00Z time, but I don't think that should have any lingering effect on swift [23:14:32] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [23:14:46] I didn't see anything in SAL that looked relevant [23:15:15] (03PS1) 10Daimona Eaytoy: Load and configure the CampaignEvents extension where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) [23:17:44] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.078 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:00] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Swift [23:53:52] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.297 second response time https://wikitech.wikimedia.org/wiki/Swift [23:54:30] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.074 second response time https://wikitech.wikimedia.org/wiki/Swift [23:56:52] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift